main
March 27th, 2017    

CISC 7510X/7512X
Main
Files
Syllabus
Links
Homeworks

Notes
0001

DB1
Intro
SQL Intro
More SQL
Indexes/Joins
Data Loads
DB Design

DB2
Intro
SQL Intro
More SQL
Normal Forms
Oracle Primer
MySQL Primer
Data Loads
DB Design
AnalyticFuncs
Indexes/Joins
DB Procs
Hierarchical
Partitions
Big Data

Sample Data
ctsdata.20140211.tar
Stock Ordrs

SQLRunner

CISC 7510X (DB1) Homeworks

You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7510X HW#". Homeworks without the subject line risk being deleted and not counted.

CISC 7510X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.


CISC 7510X HW# 2 (due by 3rd class;): For the below `store' schema:
product(productid,description,)
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
purchase(purchaseid,purchasetimestamp,customerid,productid,quantity,price)

Using SQL, answer these questions (write a SQL query that answers these questions):

  1. What is the description of productid=42?
  2. What's the name and address of customerid=42?
  3. What products did customerid=42 purchase?
  4. List customer names who have never puchased anything.
  5. List product descriptions who have never been purcahsed by anyone.
  6. What products were purchased by customers with zip code 10001?
  7. What percentage of customers have ever purchased productid=42?
  8. Of customers who purchased productid=42, what percentage also purchased productid=24?
  9. What is the most popular (purchased most often) product in NY state?
  10. What is the most popular (purchased most often) product in North East?

CISC 7510X HW# 3 (due by Nth class;): Write a command line program to "join" .csv files. Use any programming language you're comfortable with. Your program should work similarly to the unix "join" utility (google for it). Unlike the unix join, your program will not require files to be sorted on the key. Your program must also accept the "type" of join to use---merge join, inner loop join, or hash join, etc. Test your program on "large" files (e.g. make sure it doesn't blow up on one million records, etc.)

Submit source code for the program.

Also... load all files in ctsdata.20140211.tar (link on the left) into Oracle or Postgres (or whichever works for you). The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database.


CISC 7510X HW# 4 (due by Nth class): Load all files in ctsdata.20140211.tar (link on the left) into Oracle or PostgreSQL (or whichever works for you; postgresql recommended!). The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database. (this was part of previous homework).

After loading the data, create another table with fields: TDATE,SYMBOL,PRCNT which will have the daily percentage gain/loss adjusted for dividends and splits.

Do not write procedural code (Java, C#, C/C++, etc.) for this homework (all code must be SQL, etc.).

That is, if stock XYZ today (20140211) issued dividend of $0.25, previous day's (20140210) close was $50 and today's (20140211) close is $51, then today's percentage gain is: 1.5%, since 50 + 0.25 + 50*0.015 = 51. So your table will have: 20140211,XYZ,1.5
Loss is just a negative percentage. Similar accounting for splits.

Submit query used to construct the table. We'll do more stuff with this dataset in subsequent homeworks.


CISC 7510X HW# 5 (due by Nth class): Your buddy stops over for lunch and tells you about his wonderful idea of building software for junk yards. Junk yards are places that aquire cheap old cars and sell individual parts---a $1k old junky car may have 100 parts in it that each can be sold for $20-$50, etc. A typical junk yard may have dozens to hundreds of old cars, and if you need a part, you drive by and ask... the attendant would know what car/part you're looking for and would know whether they have anything compatible in the inventory. (e.g. a "left side mirror from a white 2013 Ford Mustang" may be repainted to be compatible with a red 2014 Ford Mustang, etc.).

Now, the attendant would likely know these things (they have enormous domain knowledge). But it's still a major inventory hassle to find compatible parts... Your buddy has an idea of building such an `inventory management system' for junk yards... so anyone can start a junk yard, and junk yards can get much bigger. The idea is that the customer would drive in, type in the car/part they're looking for, and the system would tell them if there's a compatible car/part available (and where it is), or can be made compatible with minor tweaks (such as repainting, etc.). If part is not available locally, the software should be internet enabled to find the compatible parts in other junk yards running the same software. License per junk yard, $20k, with $2k/year maintenance, and your buddy thinks he can sell it to at least 10 junk yards near major city centers. So now you have a case for a lucrative business... your task is to build it.

Go through the process of designing this inventory system. What are objects? What are events? Create a database schema, etc. How would the search process work? (e.g. go through the motions of: new junky car arrives, how is it inventoried? new customer arrives looking for a part, how does the system find a compatible part? where can humans be eliminated from this process?).

Submit writeup of the design (nothing too complicated, just a 1 page description---something that would convince me that you're the right contractor for this project---that you know what you're doing). Also submit database schema (DDL, create table statements), and query statements/process to find a compatible part.


CISC 7512X (DB2) Homeworks

You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7512X HW#". Homeworks without the subject line risk being deleted and not counted.

CISC 7512X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.


CISC 7512X HW# 2 (due by 3rd class;): For the below `bank' schema:
customer(customerid,username,fname,lname,street1,street2,city,state,zip)
account(accountid,customerid,description,)
transaction(transactionid,trantimestamp,accountid,amount)

A customer may have several accounts, and each account may participate in many transactions. Each transaction will have at least two records, one deducting amount from an account, and one adding amount to an account (for a single transactionid, the sum of amounts will equal zero).
Using SQL, answer these questions (write a SQL query that answers these questions):

  1. What is the balance of accountid=42?
  2. What was the transaction amount of transactionid=42?
  3. Which transactionids do not sum up to zero (are invalid)?
  4. List of customers without accounts?
  5. What is the balance (total across all accounts) for customerid=42?
  6. What is the total balance of all customers living in zip code 10001?
  7. Which zip code has the highest balance?
  8. List the top 1% of customers (ordered by total balance).
  9. Using balances for previous two months, predict what the balances will be next month. (tip: find slope of a line; x-axis is days, y-axis is balance. 2 previous months means you have 2 points, finding slope is easy. Use slope to predict where next month's balance will be.)
  10. List top 10 fastest growing accounts (using previous 2 months). (tip: same as above, fastest growing means steepest slope).
  11. Write a query to add 0.01% to each savings account (note that the money has to be accounted for).
  12. For each account, what was the closing balance on December 31, 2016?
  13. What percentage of bank's money is held by people in the tri-state area today? (NY, NJ, CT)

CISC 7512X HW# 3: Imagine you have a database table with columns: phoneid, time, gps_latitude, gps_longitude. Assume these records are logged approximately every few seconds for every phone. Your task is to detect speeding: Write a database query (in SQL) to find anyone whose *average* speed is between 90 and 200mph for at least a minute. If can't write SQL query, write detailed procedural speudo code (assume input is coming from a comma delimited text file). Submit code via email, with subject "CISC 7512X HW3".


CISC 7512X HW# 4: In the not-so-distant future, flying cars are commonplace---everyone on the planet got one. Yes, there are ~10 billion flying cars all over the globe. Each one logs its coordinates every 10 milliseconds, even when parked. Assume x,y,z coordinates, with z being altitude, and x,y, some cartesian equivalent of GPS. To avoid accidents, regulation states that no car can be next to any other car by more than 10 feet while in the air (z > 0) for longer than 1 second. Cars can go really fast, ~500mph. YOUR TASK: write an algorithm and program to find all violators. Assume input is a HUGE file (10 billion cars logging "VIN,timestamp,x,y,z" every 10 milliseconds all-the-time).

Install Apache Hadoop. [hadoop]. Write a Hive query (or a series of queries), or a MapReduce program to find all violators (cars that are next to other cars while in flight). Assume data is in "cars" table in Hive (or "cars.txt" file on HDFS). What is the running time of your algorithm? If it's O(N^2), can you make it run in O(N log N) time? (note that with this much data, N^2 is not practical, even N log N is a bit long). Using your 1 node Hadoop cluster, estimate the amount of resources this whole task will consume (to apply it on 10 billion cars), and put a dollar amount value (assuming it costs $0.10/hour to rent 1 node (machine); how much will your solution cost per day/month/year?); rationalize your answer. (note that you can't answer "I'll rent 1 node, and let it run until it's done."; You must process data at least as fast as it is being generated by all those billions of cars).

Good Hadoop installation guide. Hive installation is much simpler, just unzip, set HIVE_HOME, add bin folder to PATH, and then just run "hive". Here's some tips on trying to get Hive running for first time (links may be outdated).

Submit whatever you create to solve this problem (source code for map reduce tasks, or hive queries, etc.,). Note, your solution must run (on small dataset) on a 1-node hadoop cluster.






































© 2006, Particle