CISC 7510X - Database Systems

CISC 7510X HW# 1 (due by 3rd class;): For the below `store' schema:
product(productid,description,listprice)
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
purchase(purchaseid,purchasetimestamp,customerid)
purchase_items(itemid,purchaseid,productid,quantity,price)

Using SQL, answer these questions (write a SQL query that answers these questions):

What is the description of productid=42?
What's the name and address of customerid=42?
What products did customerid=42 purchase?
List customers who bought productid=24?
List customer names who have never puchased anything.
List product descriptions who have never been purchased by anyone.
What products were purchased by customers with zip code 10001?
What percentage of customers have ever purchased productid=42?
Of customers who purchased productid=42, what percentage also purchased productid=24?
What is the most popular (purchased most often) product in NY state?
What is the most popular (purchased most often) product in Tri-state Area? (NJ, NY, CT)
Who purchased productid=24 prior to July 4th, 2020?
For each customer, find all products from their last purchase.
For each customer, find all products from their last 10 purchases.
Names of customers who have purchased product 42 in the last 3 months.

Also, install PostgreSQL.

CISC 7510X HW# 2 (due by 4th class;): Install PostgreSQL.
For the below schema for a company door:
doorlog(eventid,doorid,tim,username,event)

Where doorid represents the door for this event. e.g. Front door may be doorid=1, and bathroom may be doorid=2, etc. tim is timestamp, username is the user who is opening or closing the door. event is "E" for entry, and "X" for exit.

Using SQL, answer these questions (write a SQL query that answers these questions):

How many users entered through doorid=1
If doorid=2 is bathroom, how many people are currently in the bathroom?
If doorid=1 is front entrance door, and doorid=3 is back entrance door, and these are the only doors in the building, how many people are currently in the building?
How many people were in the building on July 4th, at 10PM? (watching fireworks)
If doorid=7 is for floor 42, what's the daily occupancy of floor 42 for entire 2021 (give a number for every day in 2021; not just days that had activity; if nobody entered/left floor, then return 0 for that day)
What is the daily average (and standard deveation) occupancy of floor 42 for 2021? (single number; use above question results)
What percentage of the people work on floor 42 (assume if they entered the floor, they work there).
What's the average number of times per day that people use the bathroom? (bathroom is doorid=2).
What percentage of employees stayed after 5:15PM on July 3rd, 2022?
List all employees who left work before 1PM on July 3rd, 2022 (assume they arrived to work on July 3rd, before 1pm).

CISC 7510X HW# 3 (due by Nth class;): Write a command line program to "join" .csv files. Use any programming language you're comfortable with (Python suggested). Your program should work similarly to the unix "join" utility (google for it). Unlike the unix join, your program will not require files to be sorted on the key. Your program must also accept the "type" of join to use---merge join, inner loop join, or hash join, etc. Assume that first column is the join key---or you can accept the column number as paramater (like unix join command).

Do not use libraries with join-capabilities (e.g. Pandas, Dataset, or pass your files to unix "join" command, etc. that defeats the purpose of this homework.). Use lists, hashes, your own data-structures, etc., not a library that's essentially a mini-database. Test your program on "large" files (e.g. make sure it wouldn't blow up on one-million-records [e.g. do not store everything in memory], etc.)

Submit source code for the program.

Also... load all files in ctsdata.20140211.tar (link on the left) into Oracle or Postgres (or whichever works for you). The format of these files is: cts(tdate,symbol,open,high,low,close,volume), splits(tdate,symbol,post,pre), dividend(tdate,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database.

CISC 7510X HW# 4 (due by Nth class): If you haven't done so already, load all files in ctsdata.20140211.tar (link on the left) into Oracle or PostgreSQL (or whichever works for you; postgresql recommended!). The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database. (this was part of previous homework).

After loading the data, using a create-table-as SQL statement, create another table DAILY_PRCNT, with fields: TDATE,SYMBOL,PRCNT which will have the daily percentage gain/loss adjusted for dividends and splits.

Do NOT write procedural code (Java, C#, C/C++, etc.) for this homework (all code must be SQL, etc.).

HINT: MSFT (Microsoft) on 2004-11-12 closed at 29.97.
They issued a dividend of 3.08, with ex-dividend date of 2004-11-15. Meaning anyone who buys the stock on-or-after 2004-11-15 is NOT entitled to the dividned.
On 2004-11-12 it was $29.97 equity, by morning 2004-11-15 it turned into (26.89 equity + 3.08 cash). When markets closed on 2004-11-15 at 27.39, the gain was from 26.89 to 27.39.
(preclose - dividend) * (1+r) = close
(29.97 - 3.08)*(1+r) = 27.39
r = (27.39/(29.97 - 3.08))-1 = 0.018594
or 1.8594% daily gain.
Can test: (29.97 - 3.08) * (1+0.018594) = 27.390, which matches closing price on 2004-11-15.

HINT: splits, MSFT did a 1 to 2 split on 2003-02-18. During a split, each share of a company gets turned into several shares of lower value each. The total value held by investors is not changed.
Closing price on 2003-02-14 is 48.30
Closing price on 2003-02-18 is 24.96
On 2003-02-14 it was 48.30 stock, by morning 2003-02-18 it turned into 2 * 24.150 equity (total value still 48.30).
(prevclose * pre/post) * (1+r) = close
(48.30 * 1/2) * (1+r) = 24.96
The gain/loss is caluclated from 24.15 (value after split) to 24.96 (closing price on 2003-02-18).
r = (24.96 / (48.30 * 1/2))-1 = 0.033540
or 3.3540% daily gain.
Can test: (48.30 * 1/2) * (1+ 0.033540) = 24.96
Loss is just a negative percentage.

Submit query used to construct the DAILY_PRCNT table (e.g. "create table DAILY_PRCNT as select ..."). We'll do more stuff with this DAILY_PRCNT dataset in subsequent homeworks---so don't put it off and get it done on time.

CISC 7510X HW# 5 (due by Nth class;): Doing something useful with the data from HW4: Background: Pairs trading. Using the percentage returns table you built in HW4: Your task is to identify potential symbol pairs that have HIGH correlation, and are suitable for pairs trading. While everyone agrees that this strategy works, nobody agrees on the best way to identify correlation---especially when considered in relation to the rest of the market.

For this homework, feel free to use whatever you think is appropriate for correlation (if not sure, try Pearson; Take a log of the percentage gain, and apply pearson on top of that. Yes, you can do all this in SQL.).

Submit 10 "best" symbol pairs, each of which trades at least ~$10m a day, suitable for pairs trading in December 2013 (yah, I know it's an old date). Along with the pairs, submit their correlation coefficients for previous year, and the month of December 2013. (assume you were trading $1m worth, and you traded those exact 10 pairs, how much would you have gained/lost during that period?). Also submit the sql code to get those 10 symbols from the dataset.

CISC 7510X HW# 6: Your buddy stops over for lunch and tells you about their wonderful idea of building software for junk yards. Junk yards are places that aquire cheap old cars and sell individual parts---a $1k old junky car may have 100 parts in it that each can be sold for $20-$50, etc. A typical junk yard may have dozens to hundreds of old cars, and if you need a part, you drive by and ask... the attendant would know what car/part you're looking for and would know whether they have anything compatible in the inventory. (e.g. a "left side mirror from a white 2013 Ford Mustang" may be repainted to be compatible with a red 2014 Ford Mustang, etc.).

Now, the attendant would likely know these things (they have enormous domain knowledge). But it's still a major inventory hassle to find compatible parts... Your buddy has an idea of building such an `inventory management system' for junk yards... so anyone can start a junk yard, and junk yards can get much bigger. Maybe even hookup with ebay/amazon for used-parts! (You can't sell used car parts on ebay unless you know what parts you have!).

The idea is that the customer would drive in, type in the car/part they're looking for, and the system would tell them if there's a compatible car/part available (and where it is), or can be made compatible with minor tweaks (such as repainting, etc.). If part is not available locally, the software should be internet enabled to find the compatible parts in other junk yards running the same software. Your buddy estimates license per junk yard, $20k, with $2k/year maintenance, and your buddy thinks he can immediately sell it to at least 10 junk yards near major city centers, and perhaps a few hundred over the next few years. So now you have a case for a lucrative business... your task is to build it.

Go through the process of designing this inventory system. What are objects? What are events? Create a database schema, etc. How would the search process work? (e.g. go through the motions of: new junky car arrives, how is it inventoried? new customer arrives looking for a part, how does the system find a compatible part? where can humans be eliminated from this process?).

Submit writeup of the design (nothing too complicated, just a 1 page description---something that would convince me that you're the right contractor for this project---that you know what you're doing). While you can use chatgpt for this, it is almost certain that chatgpt will take your job if you do---the easiest jobs to eliminate are those that chatgpt can do easily. Submit database schema (DDL, create table statements), and query statements/process to find a compatible part.

CISC 7510X HW# 7: below is a schema for an HR database:
employee(empid, fname, lname, managerid, departmentid, employee_rank)

It's an employee table, which has employee id, first name, last name, manager id (which is an employee id), department id, and employee_rank, such as VP, CEO, SVP, etc.
Using SQL, answer these questions (write a SQL query that answers these questions) [tip: use recursive queries].

Who (empid, fname, lname) is the immediate manager of employee 42?
Who (empid, fname, lname) reports directly to employee 42?
Who (empid, fname, lname) reports (directly or indicretly) to employe 42?
Count of all employees who report to employee 42?
Who does employee 42 reports to (directly or indirectly)?
Who is employee 42's most immediate SVP manager?
How many levels up is employee 42's most immediate SVP manager?
How many levels away is employee 42 from CEO?
Maximum number of employee levels reporting to employee 42?
For employee 42, find the path-of-managers directly to the CEO?
For employee 42, find the path-of-managers directly to the manager John Doe?
Employees 42 and 24, most immediate common manager?
Employees 42 wants to send employee 24 a message: company policy states that the message must travel up the management chain to the common manager, then down the chain to the appropriate employee. What is the path (empid, fname, lname) of the message?

CISC 7510X HW# 8: It's year 2032, and flying cars are finally here. For safety, each is equipped with a black box that logs (every second): GPS timestamp, GPS latitude, longitude, and altitude.

To keep its flying registration, your flying car transmits this data nightly to the FAA, so your safe flying record can be verified (or appropriate tickets issued if not).

Your job as a contractor for the FAA is to write queries that generate tickets. Your database gets 50 million car records per day (all flying cars in the US, in year 2032), each one logging their position every second for the entire day.

Flying Car Law 123B states: Airborne flying cars must keep a distance of 50 feet away from any other airborne flying car. Your task: write a query that identifies violators and tickets *both* cars [assume airborne means 50 feet altitude]. What's the running time of your query in big-O notation.

Note, millions of records, you can't just do an inner join and compare distances: you may assume the existance of a "distance" function that takes two sets of coordinates as arguments, e.g.: distance(x1,y1,z1,x2,y2,z2) and returns distance.

CISC 7510X HW# 9: (this homework is inspired by an interview question I've been asked): In this homework you'll be using this file:
[quotes_UsConsolidated_....txt.gz]. [hint]

This file includes US stock quote data. Each row is a quote. A quote could be from a single venue or a consolidated quote across all venues. Each file is 10 minutes for a subset of stocks.

File Specification: The first row from the sample file:
86|1|18:10:00.000|U|0||5|3|BP.N||3=13:09:59.993|1=16|0=39.93|2=0x52|8=13:09:59.993|6=21|5=39.94|11=2017-12-11|1715=13:09:59.993|7=0x52|1427=C|

Each row contains two parts:
The header is comprised of 10 pipe-delimited fields. The only relevant field for this problem is the 9th, the symbol.

Symbols have the form "AAA.BB" where AAA is the ticker and BB is the venue.
Quotes with symbols ending in "." (e.g. "AAA.") are consolidated quotes.
Quotes with venues specified (e.g. "AAA.BB") are venue quotes that contribute to the consolidated quotes for their ticker (e.g. "AAA.").

The body is comprised of a variable number of pipe-delimited key-value fields representing the latest known value for a ticker/venue combination.
If a key is missing, its value is retained from the prior entry for that ticker/venue. Values for a given ticker/venue are valid for a given trade date until explicitly updated.
If a key is specified but has no value (e.g. "|3=|") then the prior value does not carry over, but is instead missing. This may occur if, for example, a venue has no bids for a security at the moment.

The relevant keys are:
0: bid
1: bid size
3: bid time
5: ask
6: ask size
8: ask time
11: trade date

In general, both venue and consolidated quotes are valid until updated. The consolidated quote represents the highest valid bid (or lowest ask) across all venues. Certain condition codes on venue quotes can indicate that the venue is no longer valid for inclusion in the consolidated quote.

Task1: Write ETL code to save the following fields from the venue quotes in Parquet format:
ticker, date, time, venue, bid, bid size, ask, ask size.

The data written should be fully reflective of the state of the market as of each quote—i.e. if the current bid is unspecified in a row on the input because it is unchanged, it nonetheless should appear in the Parquet data. If the current bid is unavailable because it was explicitly nulled (i.e. a |0=| entry in the file) it should appear as a null in the Parquet data.

Task2: For each date, ticker and minute from 09:31 through 16:00, calculate the number of venues that are showing the same bid price as the consolidated quote at the end of the minute interval. Include only quotes for the trade date specified in the file name.

Submit the Spark program to do Task1 and Task2.

CISC 7510X HW# 10: Write a program to peform a database backup. Generate a public/private key pair (using GnuPG, or anything else). Your backup program/script must run daily, backup a table in a database into a .csv.gz file (comma delimited, gzip compressed [do not use database specific backup formats]). And encrypt the backup using your PUBLIC key.

Note that you can use any language, database, utility, configuration, etc. (cron script or windows scheduler is ok). The key is that the backup is comma delimited (do not export to database specific formats), gzip compressed, AND encrypted with public key---and is generated daily (without your intervention). Email me the program/script and instructions on how to set it up to run daily (for cron, I want the crontab line, etc., for windows scheduler, I want a batch file to setup to run and instruction on how to set it up in scheduler)