CISC 7700X - Introduction to Data Science

CISC 7700X HW# 1 (due by 2nd class;): Using the Iris dataset, build a kNN model to identify the species of a flower given sepal_length, sepal_width, petal_length,petal_width. Feel free to use whatever language/tool you are comfortable with. I encourage you to write C/C++/Java/C#/SQL/Python code. You may also use Excel, or Weka or Colab or whatever other library/tool you find. Submit (via email), the model code.

CISC 7700X HW# 2: Continuing with the Iris dataset, plot the histograms for each of the attributes: sepal_length, sepal_width, petal_length, petal_width. Find the average and standard deveation for sepal_length, sepal_width, petal_length, petal_width for each label. Find the median and IQR for sepal_length, sepal_width, petal_length, petal_width for each label. Use bootstrap method to find error bounds on all of the above.

CISC 7700X HW# 3: We have a labeled training data set: hw3.data1.csv.gz.

Thinking of a linear model, we come up with:

y = 24*column1 + -15*column2 + -38*column3 + -7*column4 + -41*column5 + 35*column6 + 0*column7 + -2*column8 + 19*column9 + 33*column10 + -3*column11 + 7*column12 + 3*column13 + -47*column14 + 26*column15 + 10*column16 + 40*column17 + -1*column18 + 3*column19 + 0*column20 + -6

if y is > 0 then 1 othewise -1.

What is the accuracy? Calculate the confusion matrix for this model. If cost of a false negative is $1000, and cost of a false positive is $100, (and $0 for an accurate answer), what is the expected economic gain?

How can we tweak the model to increase economic gain? Come up with a model that maximizes economic gain (approximations are OK; try guestimating a few possibilities in a spreadsheet, etc.).

Email the numbers and the steps you used to calculate things (you can do most of this homework in a spreadsheet [Excel?], but I highly encourage you to write code---learn Python if not sure where to start).

CISC 7700X HW# 4: Using data from anywhere on the internet, using previous 1 or 2 years data (excluding latest quarter!), build a linear [y = a+bx ], logarithmic [y = a+b*log(x) ], exponential [ y=b*exp(a*x) ], and power curve [ y=b*x^a ] models on revenue, earnings, and dividends, for symbols IBM, MSFT, AAPL, GOOG, META, PG, GE.

Which model works best for which metric/symbol? Show with numbers, (e.g. r-squared score, etc.). Read through: Coefficient of determination.

Using the best model for each metric, make a prediction for `next quarter' revenue, earnings, and dividends. Remember, you didn't use the last number to build your models. Compare your model's prediction to the last quarter number. What's the error? [hint]

Note: I used to suggest stockrow.com, but it seems they no longer let you get the data without payment. Same story with finance.yahoo.com. This data is available in LOTS of places through---SEC Edgar has it free... but you'll need to aggregate it across multiple reports. See if you can find another source for this data. It's public, it's out there, so find it. Or just grab last 8 quarterly reports from SEC Edgar (the 10Q reports).

CISC 7700X HW# 5: Using data from: spambase, build a Naive Bayes email classifier. Nothing too fancy, just a training module, and a classifier module. Submit code and accuracy you get on the spambase dataset.

CISC 7700X HW# 6: Using hw6 data to build a classification model. The last column in the dataset is the label. Randomly split the dataset into 70% training instances, and 30% test instances. Construct a classifier on the training data, and report the accuracy results using the test dataset. Feel tree to use any model classifier (kNN, linear, etc.). Submit the code, a short description of your model, accuracy, etc.

CISC 7700X HW# 7 (due by Nth class;):

Run your model from HW6 on MNIST dataset (http://yann.lecun.com/exdb/mnist/). Just use "digits" datasets. What accuracy are you getting on MNIST (train using "train" dataset, test using the "test" dataset). Submit code/model and accuracy. You can also download the files here (link on left menu).