CISC 7510X (DB1) Homeworks
You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7510X HW#". Homeworks without the subject line risk being deleted and not counted.
CISC 7510X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.
CISC 7510X HW# 2 (due by 3rd class;): For the below `store' schema:
CISC 7510X HW# 3 (due by Nth class;): Write a command line program to "join" .csv files. Use any programming language you're comfortable with. Your program should work similarly to the unix "join" utility (google for it). Unlike the unix join, your program will not require files to be sorted on the key. Your program must also accept the "type" of join to use---merge join, inner loop join, or hash join, etc. Test your program on "large" files (e.g. make sure it doesn't blow up on one million records, etc.)
Submit source code for the program.
Also... load all files in ctsdata.20140211.tar (link on the left) into Oracle or Postgres (or whichever works for you). The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database.
CISC 7510X HW# 4 (due by Nth class): Load all files in ctsdata.20140211.tar (link on the left) into Oracle or PostgreSQL (or whichever works for you; postgresql recommended!). The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database. (this was part of previous homework).
After loading the data, create another table with fields: TDATE,SYMBOL,PRCNT which will have the daily percentage gain/loss adjusted for dividends and splits.
Do not write procedural code (Java, C#, C/C++, etc.) for this homework (all code must be SQL, etc.).
That is, if stock XYZ today (20140211) issued dividend of $0.25, previous day's (20140210) close was $50 and today's (20140211) close is $51, then today's percentage gain is: 1.5%, since 50 + 0.25 + 50*0.015 = 51. So your table will have: 20140211,XYZ,1.5
Submit query used to construct the table. We'll do more stuff with this dataset in subsequent homeworks.
CISC 7510X HW# 5 (due by Nth class): Your buddy stops over for lunch and tells you about his wonderful idea of building software for junk yards. Junk yards are places that aquire cheap old cars and sell individual parts---a $1k old junky car may have 100 parts in it that each can be sold for $20-$50, etc. A typical junk yard may have dozens to hundreds of old cars, and if you need a part, you drive by and ask... the attendant would know what car/part you're looking for and would know whether they have anything compatible in the inventory. (e.g. a "left side mirror from a white 2013 Ford Mustang" may be repainted to be compatible with a red 2014 Ford Mustang, etc.).
Now, the attendant would likely know these things (they have enormous domain knowledge). But it's still a major inventory hassle to find compatible parts... Your buddy has an idea of building such an `inventory management system' for junk yards... so anyone can start a junk yard, and junk yards can get much bigger. The idea is that the customer would drive in, type in the car/part they're looking for, and the system would tell them if there's a compatible car/part available (and where it is), or can be made compatible with minor tweaks (such as repainting, etc.). If part is not available locally, the software should be internet enabled to find the compatible parts in other junk yards running the same software. License per junk yard, $20k, with $2k/year maintenance, and your buddy thinks he can sell it to at least 10 junk yards near major city centers. So now you have a case for a lucrative business... your task is to build it.
Go through the process of designing this inventory system. What are objects? What are events? Create a database schema, etc. How would the search process work? (e.g. go through the motions of: new junky car arrives, how is it inventoried? new customer arrives looking for a part, how does the system find a compatible part? where can humans be eliminated from this process?).
Submit writeup of the design (nothing too complicated, just a 1 page description---something that would convince me that you're the right contractor for this project---that you know what you're doing). Also submit database schema (DDL, create table statements), and query statements/process to find a compatible part.
CISC 7510X HW# 6 (due by Nth class;): Doing something useful with the data from HW4: Background: Pairs trading. Using the percentage returns table you built in HW4: Your task is to identify potential symbol pairs that have HIGH correlation, and are suitable for pairs trading. While everyone agrees that this strategy works, nobody agrees on the best way to identify correlation---especially when considered in relation to the rest of the market.
For this homework, feel free to use whatever you think is appropriate for correlation (if not sure, try Pearson; Take a log of the percentage gain, and apply pearson on top of that. Yes, you can do this in SQL.).
Submit 10 "best" symbol pairs, each of which trades at least ~$10m a day, suitable for pairs trading in December 2013 (yah, I know it's an old date). Along with the pairs, submit their correlation coefficients for previous year, and the month of December 2013. (assume you were trading $1m worth, and you traded those exact 10 pairs, how much would you have gained/lost during that period?). Also submit the sql code to get those 10 symbols from the dataset.
CISC 7510X HW# 7 (due by Nth class;): Doing something useful with the data from HW4: Background: portfolio theory. The gist is that you can lower risk by investing in things that have LOW correlation. While a single stock will go up and down, the *average* returns from say 20 will be a lot steadier---provided of course that they're not all correlated (don't move in the same direction at the same time). These are the kinds of funds your retirement account is invested in.
Build a portfolio of 20 symbols, each of which has daily average volume over $10m (avg taken over last year), has paid dividends every year (no skipping) for the last 10 years, has returned at least a cumulative 2% a year for the last 15 years (including dividends/splits and stock price), and each of which have the lowest correlation with the rest of the symbols in your portfolio.
Calculate the return on those 20 symbols for 2013. Is that better or worse than S&P500 for same period? (You can use "SPY" symbol as stand in for S&P500) How about last 10 years? Last 20?
Submit 20 symbols, along with their aggregate 2013 return, compared to S&P500 return for the same time period. ...and the SQL code.
If you get pairs that have almost -1 correlation, then something is very wrong (you're likely not using profitable stocks; make sure all your stock give a POSITIVE net yearly return of at least 2%---of that set, you shouldn't have any that have -1 correlation).
Note that most folks in the class should end up with more or less the same list (depending on how everyone defines correlation). Feel free to collaborate with classmates---but everyone must submit the homework (no group submissions).
Don't forget: write all code in SQL only (use analytical functions, etc.)... let the database do the data crunching. Breakup large steps into smaller steps using temp tables. Test query on a subset of symbols, etc.
Submit everything in an email; put "CISC 7510X HW7" in email subject.
CISC 7510X HW# 8: Given a database table such as: create table file(id int, parentid int, name varchar(1024), size int, type char(1)); write a single (recursive) database query to list FULL PATH of all files. [assume type is either "F" or "D" for file or directory]. Your query should give you similar output to unix command: "find . -type f".
Submit query in an email; put "CISC 7510X HW8" in email subject.
CISC 7510X HW# 9: Write a program to peform a database backup. Generate a public/private key pair (using GnuPG, or anything else). Your backup program/script must run daily, backup a table in a database into a .csv.gz file (comma delimited, gzip compressed [do not use database specific binary formats]). And encrypt the backup using your PUBLIC key.
Note that you can use any language, database, utility, configuration, etc. (cron script or windows scheduler is ok). The key is that the backup is comma delimited, gzip compressed, AND encrypted with public key---and is generated daily (without your intervention). Email me the program/script and instructions on how to set it up to run daily (for cron, I want the crontab line, etc., for windows scheduler, I want a batch file to setup to run and instruction on how to set it up in scheduler)
CISC 7510X HW# 10: Download and install Spark. spark.apache.org. Port the code from CISC 7510X HW# 6 to run on Spark/Scala [run a timy example using Spark/Scala using the same dataset: read either .csv or PostgreSQL via Spark].
Submit a Scala/Spark script (whatever you type in spark-shell) to solve HW6.
CISC 7512X (DB2) Homeworks
You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7512X HW#". Homeworks without the subject line risk being deleted and not counted.
CISC 7512X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.
CISC 7512X HW# 2 (due by 3rd class;): For the below `bank' schema:
CISC 7512X HW# 3: Imagine you have a database table with columns: phoneid, time, gps_latitude, gps_longitude. Assume these records are logged approximately every few seconds for every phone. Your task is to detect speeding: Write a database query (in SQL) to find anyone whose *average* speed is between 90 and 200mph for at least a minute. If can't write SQL query, write detailed procedural speudo code (assume input is coming from a comma delimited text file). Submit code via email, with subject "CISC 7512X HW3".
CISC 7512X HW# 4: In the not-so-distant future, flying cars are commonplace---everyone on the planet got one. Yes, there are ~10 billion flying cars all over the globe. Each one logs its coordinates every 10 milliseconds, even when parked. Assume x,y,z coordinates, with z being altitude, and x,y, some cartesian equivalent of GPS. To avoid accidents, regulation states that no car can be next to any other car by more than 10 feet while in the air (z > 0) for longer than 1 second. Cars can go really fast, ~500mph. YOUR TASK: write an algorithm and program to find all violators. Assume input is a HUGE file (10 billion cars logging "VIN,timestamp,x,y,z" every 10 milliseconds all-the-time).
Install Apache Hadoop. [hadoop]. Write a Hive query (or a series of queries), or a MapReduce program to find all violators (cars that are next to other cars while in flight). Assume data is in "cars" table in Hive (or "cars.txt" file on HDFS). What is the running time of your algorithm? If it's O(N^2), can you make it run in O(N log N) time? (note that with this much data, N^2 is not practical, even N log N is a bit long). Using your 1 node Hadoop cluster, estimate the amount of resources this whole task will consume (to apply it on 10 billion cars), and put a dollar amount value (assuming it costs $0.10/hour to rent 1 node (machine); how much will your solution cost per day/month/year?); rationalize your answer. (note that you can't answer "I'll rent 1 node, and let it run until it's done."; You must process data at least as fast as it is being generated by all those billions of cars).
Good Hadoop installation guide. Hive installation is much simpler, just unzip, set HIVE_HOME, add bin folder to PATH, and then just run "hive". Here's some tips on trying to get Hive running for first time (links may be outdated).
Submit whatever you create to solve this problem (source code for map reduce tasks, or hive queries, etc.,). Note, your solution must run (on small dataset) on a 1-node hadoop cluster.
CISC 7512X HW# 5 (due by Nth class;): Install HBase on your cluster from HW4. You have a relational database from HW2:
You would like to port it to HBase. How would you organize the data to make it easy to answer HW2 questions using HBase? What would you use as keys? Do you need to store multiple copies of the data?
Outline pseudo code (please don't write actual java) that would answer the following questions using your design:
CISC 7512X HW# 6: Download and install Spark. spark.apache.org. Port the code from HW4 to run on Spark/Scala [run a timy example using Spark/Scala].
Submit a Scala/Spark script (whatever you type in spark-shell) to solve HW4.
CISC 7512X HW# 7: Download ~100-1000 "news" articles (e.g. grab RSS feeds, http://www.cnn.com/services/rss/ then use wget or something to grab all the URLs in each of those feeds; don't limit yourself to just cnn, etc.). Write a script to generate an N by M matrix (N = number of documents, M = number of unique words in all those documents). The matrix is a number of how many times a particular word appears in a particular document. Lowercase everything. For example, matrix[i][j] would be the count that word 'j' appears in document 'i'. Use the NMF (non-negative matrix) factorization to create two matrices, Nx10 and 10xM (10 being number of `features'). For each feature, identify the most common word ("topic word"). For each document, identify the most common feature, then label the document with that "topic word". Submit your code, list of URLs *along with the label* (i don't want the actual html documents, just URLs and labels), along with a log of output. [if you can't get this to work, submit your best attempt at it... There are also Spark packages for NMF. Also, there's an implementation of NMF in pMatrix.java, though that one won't work on 1000x1000 matrix].
CISC 7512X HW# 8: Write an implementation of k-Means algorithm in SQL. Imagine you have a table such as cust_attributes(custid,attributename,attributetype,attributevalue). You'd like to use only "numeric" values to cluster all of your customers into say 7 clusters. In other words, you'd like to generate another table with columns cust_cluster(custid,clusterid), where clusterid is representative of this customer to the other customers based on that customer's attributes. Note that you'll need some mechanism of running the same query over and over again---you can do that via an external script, or use recursive queries to iterate. [before, you've used SQL as a query engine---in this homework you're using it as a computation engine].