Homework 2
Due: February 8, 2005
You may work with others on this assignment, but you should turn in separate writeups, and you should understand the solutions. Consult the book and your professor for help if you need it.
This assignment must be done in LaTeX and turned in printed from Postscript or PDF file format.
Announcements
- Friday, 1/28/2005
- Here is a matlab function and associated data for finding the
F-distribution inverse values, which you will need for doing stepwise
variable selection. I am giving you these since our Matlab does not have
the statistics toolbox.
You should use the hamerly_f_inv function like this:
f_critical_value = hamerly_f_inv(α, a, b);
Only critical values for α = 0.95 and α = 0.99 are included. The variables a and b correspond to the two parameters of the F distribution, and can be in the ranges 1 ≤ a ≤ 300, and 10 ≤ b ≤ 1600.
Assignment
- Reading. You should read chapter 3 thoroughly.
- Exercises.
- Do exercise 3.1.
- Do exercise 3.2.
- Experiments on classifying digits
Download the ZIP code data from the book's website here. You will use it to perform several experiments on classification. The data has one class variable (0-9) followed by 256 brightness variables (pixels).
You can view a training or test point, or even the learning parameters, by doing something like the following:
imagesc(reshape(train(1,2:end), 16, 16)');
This reshapes the training point (256 pixel variables) to be a 16-by-16 matrix, takes the transpose (otherwise it looks sideways), and makes a picture of it.Experiments:
- The "base rate" is the percentage accuracy for making constant predictions, such as always predicting the digit 4. What is the base rate of the training set for each digit? What about the test set?
- Build a k-nearest neighbor classifier. Using the training
set as the nearest neighbors, find the confusion matrix of the
training set for k = 1, k = 5, k = 25, and k = 125. Then find the
confusion matrix of the test set for the same values of k.
A confusion matrix is the matrix that shows how often something of class A was classified as every other class, like this:
True class A True class B True class C ... Predicted class A aa ba ca ... Predicted class B ab bb cb ... Predicted class C ac bc cc ... ... ... ... ... ... Give the accuracy of your different classifiers (on both the training and test sets). Accuracy is defined as accuracy = (aa + bb + cc + ...) / n, where n is the number of examples classified.
What conclusions can you draw from your different nearest neighbor classifiers? Do you see any evidence of overfitting or underfitting?
- Build a linear regression model for classifying between the
digit 3 and the digit 8. One way to code the output is as -1 for
three, and +1 for eight. Then you would predict three if the
response is < 0, and predict eight if the response is ≥ 0.
You will probably need to eliminate some variables immediately, or you will not be able to invert the X'X matrix (due to numerical instability). There are many ways to eliminate variables, but to begin with you should eliminate variables that have very low variance (for all threes and all eights). For example, I find that there are 13 variables that have variance < 0.01 among all threes and eights in the training data. These variables with low variance are generally ones around the left and right edges of the image. Tell which variables you eliminated and which you used; an image would be best to show this.
What is the accuracy of your linear regression model on the training set? On the test set?
After you have eliminated variables with low variance, then eliminate variables that do not have strong predictive power using stepwise variable elimination (backwards and forwards; see section 3.4.1 in your book) and ridge regression (section 3.4.3). Show accuracy curves on the training data and the test data for various numbers of variables (for stepwise selection) and various settings of the penalty term (for ridge regression). Which do you think is a better way to do variable selection for regression?
-
Discuss the differences in the k-nearest neighbor classifier (for
classifying all digits) and the linear regression classifier for
classifying 3's and 8's. Comment on the following:
- How well do each of them generalize to the test set?
- Is one more efficient to learn?
- Is one more accurate?
- How intuitive are the controls over the model (for example choosing k, choosing α, and choosing the ridge regression penalties).
Copyright © 2005 Greg Hamerly.
Computer Science Department
Baylor University