main
April 18th, 2024    

CISC 7512X
Main
Files
Syllabus
Links
Homeworks

Notes
0001

DB2
Intro
SQL Intro
More SQL
Oracle Primer
MySQL Primer
PostgreSQL Primer
AnalyticFuncs
Indexes/Joins
DB Design
Normal Forms
Data Loads
Hierarchical
Partitions
DB Procs
Big Data
HBase Primer
Hadoop/Hive
Spark Primer


Past Tests
S2024 Midterm
F2023 Final(key)
F2023 Midterm
S2023 Final(key)
S2023 Midterm(key)
F2022 Final(key)
F2022 Midterm(key)
S2020 Final(key)
F2021 Midterm
F2021 Final
S2022 Midterm(key)
S2022 Final(key)
S2020 Final(key)
S2021 Midterm
S2021 Final(key)
F2020 Midterm (key)
F2020 Final Exam (key)
S2019 Midterm(key) S2019 Midterm2(key)
F2019 Midterm(key)
S2018 Midterm(key)
S2018 Midterm2(key)
S2018 Final(key)
S2018 Final2(key)
S2017 Midterm
F2017 Midterm(key)
F2017 Final(key)
S2017 Midterm(key)
S2017 Midterm2(key)
S2017 Final(key)
S2017 Final2(key)
F2015 Midterm
S2014 Midterm
S2014 Final
F2016 Final


Sample Data
ctsdata.20140211.tar
Stock Ordrs


SQLRunner

CISC 7512X (DB2) Homeworks

You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7512X HW#". Homeworks without the subject line risk being deleted and not counted.

CISC 7512X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year. (oh, and install PostgreSQL).


CISC 7512X HW# 2 (due by 3rd class;): For the below `bank' schema:
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
account(accountid,customerid,description,)
transaction(transactionid,trantimestamp,accountid,amount)

A customer may have several accounts, and each account may participate in many transactions. Each transaction will have at least two records, one deducting amount from an account, and one adding amount to an account (for a single transactionid, the sum of amounts will equal zero).
Using SQL, answer these questions (write a SQL query that answers these questions):

  1. What is the balance of accountid=42?
  2. What was the transaction amount of transactionid=42?
  3. Which transactionids do not sum up to zero (are invalid)?
  4. List of customers without accounts?
  5. What is the balance (total across all accounts) for customerid=42?
  6. What is the total balance of all customers living in zip code 10001?
  7. Which zip code has the highest balance?
  8. List the top 1% of customers (ordered by total balance).
  9. Using balances for previous two months, predict what the balances will be next month. (tip: find slope of a line; x-axis is days, y-axis is balance. 2 previous months means you have 2 points, finding slope is easy. Use slope to predict where next month's balance will be.)
  10. List top 10 fastest growing accounts (using previous 2 months). (tip: same as above, fastest growing means steepest slope).
  11. For each account, what was the closing balance on December 31, 2023?
  12. What percentage of bank's money is held by people in the tri-state area today? (NY, NJ, CT)
  13. Write a query to add 0.01% to each savings account (note that the money has to be accounted for).
  14. Find all accounts with 30-day moving average balance less than $1500.
  15. Find all accounts with less than 2 transactions in the previous 30-days.
  16. Find all accounts with negative balances for the entire previous 30-days.
  17. Find all accounts that meet all conditions from previous 3 bullet points.
  18. Find customers who move money between their accounts too frequencly (e.g. transaction count among accounts that they own is above 99-th percentile of all customers in the bank).
  19. We (the bank) define a round-trip as a transaction from account A to account B, that is followed by a transaction from account B back to account A within 30 days, for an amount that is within 10% of the original amount. Find all round-trip transactions in the last year. Make sure not to count transactions twice. [e.g. A-to-B, and then B-to-A, followed by another B-to-A, etc., shouldn't count it as two round-trips, since the 2nd B-to-A isn't paired up with its own A-to-B].
  20. Generate a list of customers who engaged in at least 10 round-trips in the last year.
  21. We (the bank) define a loop as any set of transactions, less than 4, that create a loop, e.g.: A-to-B, and B-to-C and C-to-A... within 30 days, and within 10% of amount, etc., This is just a bigger version of a round-trip, involving more than 2 participants. Find all loops in the last year. Again, do not overcount.

CISC 7512X HW# 3 (due by Nth class): Your buddy stops over for lunch and tells you about this wonderful idea of building apps for phones (for profit!). The gist of the idea: ride sharing! (``Urgh, not again!'', you think). Unlike other ride-sharing ideas, this app is designed for the usual commuter who uses the car to get to work---and is willing to share the ride with someone else to lower their costs. Going out of the way to pickup folks is out of the question (the driver also needs to get to work themselves). Also, the driver prefers the fastest possible route (highways, etc.,) even if it means not picking up someone. Since everyone (including the driver) are benefitting from the ride, the goal is to lower the commute cost for everyone (including driver and passenger [passenger would use their own car if it costs them less]). The business takes a small slice of the money saved (so it's a win-win for everyone involved). Also, folks will be able to pay for the ride in bitcoins. This all seems like crazy talk until your buddy mentions there's a potential $10m investment (from the same folks who seeded Uber), and all they need from you is a working prototype and a write-up of the architecture by next week.

Your task: Design and build a database to run this business. What tables would you need? What events would you capture? Etc. Write up what interface and functionality would be needed to interact with the database. Make the investors see that this is a real viable idea that will actually work. Produce a business plan, design document, whitepaper, architecture, prototype, etc., whatever it takes to get that investment.


CISC 7512X HW# 4: below is a schema for an HR database:
employee(empid, fname, lname, managerid, departmentid, employee_rank)

It's an employee table, which has employee id, first name, last name, manager id (which is an employee id), department id, and employee_rank, such as VP, CEO, SVP, etc.
Using SQL, answer these questions (write a SQL query that answers these questions) [tip: use recursive queries].

  1. Who (empid, fname, lname) is the immediate manager of employee 42?
  2. Who (empid, fname, lname) reports directly to employee 42?
  3. Who (empid, fname, lname) reports (directly or indicretly) to employe 42?
  4. Count of all employees who report to employee 42?
  5. Who does employee 42 reports to (directly or indirectly)?
  6. Who is employee 42's most immediate SVP manager?
  7. How many levels up is employee 42's most immediate SVP manager?
  8. How many levels away is employee 42 from CEO?
  9. Maximum number of employee levels reporting to employee 42?
  10. For employee 42, find the path-of-managers directly to the CEO?
  11. For employee 42, find the path-of-managers directly to the manager John Doe?
  12. Employees 42 and 24, most immediate common manager?
  13. Employees 42 wants to send employee 24 a message: company policy states that the message must travel up the management chain to the common manager, then down the chain to the appropriate employee. What is the path (empid, fname, lname) of the message?

CISC 7512X HW# 5: Load all files in ctsdata.20140211.tar (link on the left) into Oracle or PostgreSQL (or whichever works for you; postgresql recommended!). The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database. (this was part of previous homework).

Part1: After loading the data, using a create-table-as SQL statement, create another table DAILY_PRCNT, with fields: TDATE,SYMBOL,PRCNT which will have the daily percentage gain/loss adjusted for dividends and splits.

Do NOT write procedural code (Java, C#, C/C++, etc.) for this homework (all code must be SQL, etc.).

HINT: MSFT (Microsoft) on 2004-11-12 closed at 29.97.
They issued a dividend of 3.08, with ex-dividend date of 2004-11-15. Meaning anyone who buys the stock on-or-after 2004-11-15 is NOT entitled to the dividned.
On 2004-11-12 it was $29.97 equity, by morning 2004-11-15 it turned into (26.89 equity + 3.08 cash). When markets closed on 2004-11-15 at 27.39, the gain was from 26.89 to 27.39.
(preclose - dividend) * (1+r) = close
(29.97 - 3.08)*(1+r) = 27.39
r = (27.39/(29.97 - 3.08))-1 = 0.018594
or 1.8594% daily gain.
Can test: (29.97 - 3.08) * (1+0.018594) = 27.390, which matches closing price on 2004-11-15.

HINT: splits, MSFT did a 1 to 2 split on 2003-02-18. During a split, each share of a company gets turned into several shares of lower value each. The total value held by investors is not changed.
Closing price on 2003-02-14 is 48.30
Closing price on 2003-02-18 is 24.96
On 2003-02-14 it was 48.30 stock, by morning 2003-02-18 it turned into 2 * 24.150 equity (total value still 48.30).
(prevclose * pre/post) * (1+r) = close
(48.30 * 1/2) * (1+r) = 24.96
The dain/loss is caluclated from 24.15 (value after split) to 24.96 (closing price on 2003-02-18).
r = (24.96 / (48.30 * 1/2))-1 = 0.033540
or 3.3540% daily gain.
Can test: (48.30 * 1/2) * (1+ 0.033540) = 24.96
Loss is just a negative percentage.

Part2: Using the DAILY_PRCNT table, calculate CTS_OUTLIER table, that will have tdate, symbol, closing price, previous-day closing price, prcnt return on tdate, 20-day moving avearge of return on tdate, 20-day moving standard deviation of return on tdate, and z-value (hint). [Note, you need to do calculation on price change, not percentage... so unless you want to repeat part1 calculations, you should use table daily_prcnt and convert it into a price, since that's adjusted for splits/dividends].

Part3: Identify 10 worst-days. On each day, a certain number of symbols will have z-value below 3. Find 10 worst-days by that count.

Submit qeuries for part1 and part2, and submit queries and output for part3. (e.g. "create table DAILY_PRCNT as select ...").


CISC 7512X HW# 6: In the not-so-distant future, flying cars are commonplace---everyone on the planet got one. Yes, there are ~10 billion flying cars all over the globe. Each one logs its coordinates every 10 milliseconds, even when parked. Assume x,y,z coordinates, with z being altitude, and x,y, some cartesian equivalent of GPS. To avoid accidents, regulation states that no car can be next to any other car by more than 10 feet while in the air (z > 0) for longer than 1 second. Cars can go really fast, ~500mph. YOUR TASK: write an algorithm and program to find all violators. Assume input is a HUGE file (10 billion cars logging "VIN,timestamp,x,y,z" every 10 milliseconds all-the-time).

Install Apache Hadoop. [hadoop]. Write a Hive query (or a series of queries), or a MapReduce program to find all violators (cars that are next to other cars while in flight). Assume data is in "cars" table in Hive (or "/use/hive/warehouse/cars" file on HDFS).

What is the running time of your implementation? If it's O(N^2), can you make it run in O(N log N) time? (note that with this much data, N^2 is not practical, even N log N is a bit long). Using your 1 node Hadoop cluster, estimate the amount of resources this whole task will consume (to apply it on 10 billion cars), and put a dollar amount value (assuming it costs $0.10/hour to rent 1 node (machine); how much will your solution cost per day/month/year?); rationalize your answer. (note that you can't answer "I'll rent 1 node, and let it run until it's done."; You must process data at least as fast as it is being generated by all those billions of cars).

Submit whatever you create to solve this problem (source code for map reduce tasks, or hive queries, etc.,). Note, your solution must run (on small dataset) on a 1-node hadoop cluster. You may ``ignore'' the GPS coordinate system, and simply assume those are cartesian x,y,z coordinates.


CISC 7512X HW# 7: (this homework is inspired by an interview question I've been asked): In this homework you'll be using this file:
[quotes_UsConsolidated_....txt.gz]. [hint]

This file includes US stock quote data. Each row is a quote. A quote could be from a single venue or a consolidated quote across all venues. Each file is 10 minutes for a subset of stocks.

File Specification: The first row from the sample file:
86|1|18:10:00.000|U|0||5|3|BP.N||3=13:09:59.993|1=16|0=39.93|2=0x52|8=13:09:59.993|6=21|5=39.94|11=2017-12-11|1715=13:09:59.993|7=0x52|1427=C|

Each row contains two parts:
The header is comprised of 10 pipe-delimited fields. The only relevant field for this problem is the 9th, the symbol.

Symbols have the form "AAA.BB" where AAA is the ticker and BB is the venue.
Quotes with symbols ending in "." (e.g. "AAA.") are consolidated quotes.
Quotes with venues specified (e.g. "AAA.BB") are venue quotes that contribute to the consolidated quotes for their ticker (e.g. "AAA.").

The body is comprised of a variable number of pipe-delimited key-value fields representing the latest known value for a ticker/venue combination.
If a key is missing, its value is retained from the prior entry for that ticker/venue. Values for a given ticker/venue are valid for a given trade date until explicitly updated.
If a key is specified but has no value (e.g. "|3=|") then the prior value does not carry over, but is instead missing. This may occur if, for example, a venue has no bids for a security at the moment.

The relevant keys are:
0: bid
1: bid size
3: bid time
5: ask
6: ask size
8: ask time
11: trade date

In general, both venue and consolidated quotes are valid until updated. The consolidated quote represents the highest valid bid (or lowest ask) across all venues. Certain condition codes on venue quotes can indicate that the venue is no longer valid for inclusion in the consolidated quote.

Task1: Write ETL code to save the following fields from the venue quotes in Parquet format:
ticker, date, time, venue, bid, bid size, ask, ask size.

The data written should be fully reflective of the state of the market as of each quote—i.e. if the current bid is unspecified in a row on the input because it is unchanged, it nonetheless should appear in the Parquet data. If the current bid is unavailable because it was explicitly nulled (i.e. a |0=| entry in the file) it should appear as a null in the Parquet data.

Task2: For each date, ticker and minute from 09:31 through 16:00, calculate the number of venues that are showing the same bid price as the consolidated quote at the end of the minute interval. Include only quotes for the trade date specified in the file name.

Submit the Spark program to do Task1 and Task2.


CISC 7512X HW# 8 (due by Nth class;): Install HBase. You have a relational database from HW2:
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
account(accountid,customerid,description,)
transaction(transactionid,trantimestamp,accountid,amount)

You would like to port it to HBase. How would you organize the data to make it easy to answer HW2 questions using HBase? What would you use as keys? Do you need to store multiple copies of the data?

Outline pseudo code (please don't write actual java) that would answer the following questions using your design:

  1. What is the balance of accountid=42?
  2. What was the transaction amount of transactionid=42?
  3. Which transactionids do not sum up to zero (are invalid)?
  4. List of customers without accounts?
  5. What is the balance (total across all accounts) for customerid=42?
  6. What is the total balance of all customers living in zip code 10001?
  7. Which zip code has the highest balance?





































© 2006, Particle