|
|
CISC 7510X (DB1) HomeworksYou should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7510X HW#". Homeworks without the subject line risk being deleted and not counted.CISC 7510 HW# 1 (due by 3rd class;): For the below `store' schema:
Also, install PostgreSQL. CISC 7510 HW# 2 (due by 4th class;): Install PostgreSQL
Email the query text. CISC 7510X HW# 3 (due by Nth class;): Write a command line program to "join" .csv files. Use any programming language you're comfortable with (Python suggested). Your program should work similarly to the unix "join" utility (google for it). Unlike the unix join, your program will not require files to be sorted on the key. Your program must also accept the "type" of join to use---merge join, inner loop join, or hash join, etc. Assume that first column is the join key---or you can accept the column number as paramater (like unix join command). Do not use libraries with join-capabilities (e.g. Pandas, Dataset, or pass your files to unix "join" command, etc. that defeats the purpose of this homework.). Use lists, hashes, your own data-structures, etc., not a library that's essentially a mini-database. Test your program on "large" files (e.g. make sure it wouldn't blow up on one-million-records [e.g. do not store everything in memory], etc.) Submit source code for the program. Also... load all files in ctsdata.20140211.tar (link on the left) into Oracle or Postgres (or whichever works for you). The format of these files is: cts(tdate,symbol,open,high,low,close,volume), splits(tdate,symbol,post,pre), dividend(tdate,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database. CISC 7510X HW# 4 (due by Nth class): If you haven't done so already, load all files in ctsdata.20140211.tar (link on the left) into Oracle or PostgreSQL (or whichever works for you; postgresql recommended!). The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit (email) whatever commands/files you used to load the data into whatever database you're using, as well as the raw space usage of the tables in your database. (this was part of previous homework). After loading the data, using a create-table-as SQL statement, create another table DAILY_PRCNT, with fields: TDATE,SYMBOL,PRCNT which will have the daily percentage gain/loss adjusted for dividends and splits. Do NOT write procedural code (Java, C#, C/C++, etc.) for this homework (all code must be SQL, etc.). HINT: MSFT (Microsoft) on 2004-11-12 closed at 29.97. HINT: splits, MSFT did a 1 to 2 split on 2003-02-18. During a split, each share of a company gets turned into several shares of lower value each. The total value held by investors is not changed. Submit query used to construct the DAILY_PRCNT table (e.g. "create table DAILY_PRCNT as select ..."). We'll do more stuff with this DAILY_PRCNT dataset in subsequent homeworks---so don't put it off and get it done on time. CISC 7510X HW# 5: Your buddy stops over for lunch and tells you about their wonderful idea of building software for junk yards. Junk yards are places that aquire cheap old cars and sell individual parts---a $1k old junky car may have 100 parts in it that each can be sold for $20-$50, etc. A typical junk yard may have dozens to hundreds of old cars, and if you need a part, you drive by and ask... the attendant would know what car/part you're looking for and would know whether they have anything compatible in the inventory. (e.g. a "left side mirror from a white 2013 Ford Mustang" may be repainted to be compatible with a red 2014 Ford Mustang, etc.). Now, the attendant would likely know these things (they have enormous domain knowledge). But it's still a major inventory hassle to find compatible parts... Your buddy has an idea of building such an `inventory management system' for junk yards... so anyone can start a junk yard, and junk yards can get much bigger. Maybe even hookup with ebay/amazon for used-parts! (You can't sell used car parts on ebay unless you know what parts you have!). The idea is that the customer would drive in, type in the car/part they're looking for, and the system would tell them if there's a compatible car/part available (and where it is), or can be made compatible with minor tweaks (such as repainting, etc.). If part is not available locally, the software should be internet enabled to find the compatible parts in other junk yards running the same software. Your buddy estimates license per junk yard, $20k, with $2k/year maintenance, and your buddy thinks he can immediately sell it to at least 10 junk yards near major city centers, and perhaps a few hundred over the next few years. So now you have a case for a lucrative business... your task is to build it. Go through the process of designing this inventory system. What are objects? What are events? Create a database schema, etc. How would the search process work? (e.g. go through the motions of: new junky car arrives, how is it inventoried? new customer arrives looking for a part, how does the system find a compatible part? where can humans be eliminated from this process?). Submit writeup of the design (nothing too complicated, just a 1 page description---something that would convince me that you're the right contractor for this project---that you know what you're doing). While you can use chatgpt for this, it is almost certain that chatgpt will take your job if you do---the easiest jobs to eliminate are those that chatgpt can do easily. Submit database schema (DDL, create table statements), and query statements/process to find a compatible part. CISC 7512X (DB2) HomeworksYou should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7512X HW#". Homeworks without the subject line risk being deleted and not counted.CISC 7512X HW# 1 (due by 3rd class;): For the below `bank' schema:
CISC 7512X HW# 2 (due by Nth class): Your buddy stops over for lunch and tells you about this wonderful idea of building apps for phones (for profit!). The gist of the idea: ride sharing! (``Urgh, not again!'', you think). Unlike other ride-sharing ideas, this app is designed for the usual commuter who uses the car to get to work---and is willing to share the ride with someone else to lower their costs. Going out of the way to pickup folks is out of the question (the driver also needs to get to work themselves). Also, the driver prefers the fastest possible route (highways, etc.,) even if it means not picking up someone. Since everyone (including the driver) are benefitting from the ride, the goal is to lower the commute cost for everyone (including driver and passenger [passenger would use their own car if it costs them less]). The business takes a small slice of the money saved (so it's a win-win for everyone involved). Also, folks will be able to pay for the ride in bitcoins. This all seems like crazy talk until your buddy mentions there's a potential $10m investment (from the same folks who seeded Uber), and all they need from you is a working prototype and a write-up of the architecture by next week. Your task: Design and build a database to run this business. What tables would you need? What events would you capture? Etc. Write up what interface and functionality would be needed to interact with the database. Make the investors see that this is a real viable idea that will actually work. Produce a business plan, design document, whitepaper, architecture, prototype, etc., whatever it takes to get that investment. CISC 7512X HW# 3: below is a schema for an HR database:
CISC 7512X HW# 4: In the not-so-distant future, flying cars are commonplace---everyone on the planet got one. Yes, there are ~10 billion flying cars all over the globe. Each one logs its coordinates every 10 milliseconds, even when parked. Assume x,y,z coordinates, with z being altitude, and x,y, some cartesian equivalent of GPS. To avoid accidents, regulation states that no car can be next to any other car by more than 10 feet while in the air (z > 0) for longer than 1 second. Cars can go really fast, ~500mph. YOUR TASK: write an algorithm and program to find all violators. Assume input is a HUGE file (10 billion cars logging "VIN,timestamp,x,y,z" every 10 milliseconds all-the-time). Install Apache Hadoop. [hadoop]. Write a Hive query (or a series of queries), or a MapReduce program to find all violators (cars that are next to other cars while in flight). Assume data is in "cars" table in Hive (or "/use/hive/warehouse/cars" file on HDFS). What is the running time of your implementation? If it's O(N^2), can you make it run in O(N log N) time? (note that with this much data, N^2 is not practical, even N log N is a bit long). Using your 1 node Hadoop cluster, estimate the amount of resources this whole task will consume (to apply it on 10 billion cars), and put a dollar amount value (assuming it costs $0.10/hour to rent 1 node (machine); how much will your solution cost per day/month/year?); rationalize your answer. (note that you can't answer "I'll rent 1 node, and let it run until it's done."; You must process data at least as fast as it is being generated by all those billions of cars). Submit whatever you create to solve this problem (source code for map reduce tasks, or hive queries, etc.,). Note, your solution must run (on small dataset) on a 1-node hadoop cluster. You may ``ignore'' the GPS coordinate system, and simply assume those are cartesian x,y,z coordinates.
|