CISC 7510X (DB1) Homeworks
You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7510X HW#". Homeworks without the subject line risk being deleted and not counted.
CISC 7510X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.
CISC 7510X HW# 2 (due by 3rd class;): For the below `store' schema:
Also, install PostgreSQL.
CISC 7512X (DB2) Homeworks
You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7512X HW#". Homeworks without the subject line risk being deleted and not counted.
CISC 7512X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.
CISC 7512X HW# 2 (due by 3rd class;): For the below `bank' schema:
CISC 7512X HW# 3: Imagine you have a database table with columns: phoneid, time, gps_latitude, gps_longitude. Assume these records are logged approximately every few seconds for every phone. Your task is to detect speeding: Write a database query (in SQL) to find anyone whose *average* speed is between 90 and 200mph for at least a minute. If can't write SQL query, write detailed procedural speudo code (assume input is coming from a comma delimited text file). Submit code via email, with subject "CISC 7512X HW3".
CISC 7512X HW# 4: In the not-so-distant future, flying cars are commonplace---everyone on the planet got one. Yes, there are ~10 billion flying cars all over the globe. Each one logs its coordinates every 10 milliseconds, even when parked. Assume x,y,z coordinates, with z being altitude, and x,y, some cartesian equivalent of GPS. To avoid accidents, regulation states that no car can be next to any other car by more than 10 feet while in the air (z > 0) for longer than 1 second. Cars can go really fast, ~500mph. YOUR TASK: write an algorithm and program to find all violators. Assume input is a HUGE file (10 billion cars logging "VIN,timestamp,x,y,z" every 10 milliseconds all-the-time).
Install Apache Hadoop. [hadoop]. Write a Hive query (or a series of queries), or a MapReduce program to find all violators (cars that are next to other cars while in flight). Assume data is in "cars" table in Hive (or "/app/cars/data" file on HDFS). What is the running time of your algorithm? If it's O(N^2), can you make it run in O(N log N) time? (note that with this much data, N^2 is not practical, even N log N is a bit long). Using your 1 node Hadoop cluster, estimate the amount of resources this whole task will consume (to apply it on 10 billion cars), and put a dollar amount value (assuming it costs $0.10/hour to rent 1 node (machine); how much will your solution cost per day/month/year?); rationalize your answer. (note that you can't answer "I'll rent 1 node, and let it run until it's done."; You must process data at least as fast as it is being generated by all those billions of cars).
Good Hadoop installation guide. Hive installation is much simpler, just unzip, set HIVE_HOME, add bin folder to PATH, and then just run "hive". Here's some tips on trying to get Hive running for first time (links may be outdated).
Submit whatever you create to solve this problem (source code for map reduce tasks, or hive queries, etc.,). Note, your solution must run (on small dataset) on a 1-node hadoop cluster.