|
|
HomeworksHW1 (20140910): Download some big collection of text. I highly recommend Project Gutenberg (google for it; you can download the entire DVD). Part1: Write a parser/reader that will collect probabilities of every word following any other word. Convert everything to lowercase. Punctuation breaks the sequence (e.g. end of setence, etc.). Part2: Using the probabilities from part1, generate text... start with a random word, then randomly (weighted by probabilities) pick the following word, and continue. Email code for the project and a sample of the generated output. Other things to experiment with: gather probabilities of any word following a pair or triplet of words, etc. HW2 (20140917): Given points: (67.53,241.95), (11.75,68.29), (20.87,92.94), (3.51,37.16), (74.29,258.33), (68.49,242.46), (4.25,37.04), (34.97,137.55), (17.29,86.79), (76.56,270.25), (39.28,154.74), (97.28,336.08), (35.35,143.59), (68.79,242.51), (17.84,85.56), (76.38,268.30), (12.90,64.93), (80.57,280.01), (82.70,289.04), (28.38,116.09), fit a line, 3 degree polynomial, 5 degree polynomial, an exponential, and a power function. I highly recommend you write your own matrix inverter (Numerical Recipes in C is a great reference for that). You can also use Octave, R, etc. Submit your code along with the equations for line, polynomials, exponential, and a power function. The above was generated via (but please ignore this when doing this homework): perl -e'for(1..20){ $x=rand(100); $y=3.14*$x+23+rand(10); printf("(%.2f,%.2f), ",$x,$y) }; print "\n"' HW3 (): Given points: Create a neural network with 8 inputs, 3 hidden, and 8 outputs. Your inputs are 8 binary bits, 00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000. Your expected outputs match the inputs. In other words, you're giving the network 00000010 and expecting it to produce 00000010, etc. You should use backpropagation for training, but feel free to use whatever other method that works. After successful training, your network should NOT make any mistakes (you give it a 00000010 and it always outputs 00000010, etc.). Now you can cut away the last layer, and you end up with a network that turns your input string into a 3-bit binary representation. Note that 00000010 will not correspond to 010 ('2' in binary). Submit your code along with a log of output... HW4 (): Write a program to do naive bayes classification of email. Create 2 folders, "spam" and "notspam", and fill it with text of respective categories. You can use your own spam/notspam emails, or just make up some text that seems like it fits these categories. Write a program to train your naive bayes classifier on the text in the folders. Not that you're estimating the P(D|C) probabilities... you already know the class (which folder the text is in). Now given a previously unseen document, use the bayes rule to find P(C|D)... and assign either spam or notspam tag on the new document. Submit your code, training data, etc., along with a log of output... HW5 (): Download ~100-1000 "news" articles (e.g. grab RSS feeds, http://www.cnn.com/services/rss/ then use wget or something to grab all the URLs in each of those feeds; don't limit yourself to just cnn, etc.). Write a script to generate an N by M matrix (N = number of documents, M = number of unique words in all those documents). The matrix is a number of how many times a particular word appears in a particular document. Lowercase everything. For example, matrix[i][j] would be the count that word 'j' appears in document 'i'. Use the NMF (non-negative matrix) factorization to create two matrices, Nx10 and 10xM (10 being number of `features'). For each feature, identify the most common word ("topic word"). For each document, identify the most common feature, then label the document with that "topic word". Submit your code, list of URLs along with the label (i don't want the actual html documents, just URLs), along with a log of output... HW6 (): Come up with a hopfield network example program. Preferably done in Javascript. Let the user specify ~5 memories for the network to learn, then let the user specify a pattern, and have the network retrieve one of the memories. HW7 (): Google for a simple example use of Hidden Markov Model (e.g. there's a tiny example program in wikipedia), and implement it (create the actual computer program that you can run). Using that code, train a 2 state HMM with A..Z observables; take any text document, and feed it to that simple HMM implementation (1 character at a time)... Submit your code, along with the HMM probability matrices. (hint, your HMM will learn to differentiate vowels vs consonants). HW8 (): Write a program to take ANY database table (or .csv file), and calculate a decision tree for EVERY (non-numeric) field. HW9 (): Write a program to take ANY database table (or .csv file), and use Boosting technique to build a classifier for EVERY column (same as hw8, except instead of decision tree, use boosting; note you can use hw8 classifier as a weak learner in this homework).
|