Homework 2
Due: February 10, 2006
You may work with others on this assignment, but you should turn in separate writeups, and you should understand the solutions. Consult the book and your professor for help if you need it.
This assignment must be done in LaTeX and turned in printed from Postscript or PDF file format. Treat this assignment like a research assignment. Do interesting things, and present your results in a compelling way.
Assignment
- Reading. You should read chapter 3 thoroughly.
- Exercises.
- Do exercise 3.1.
- Do exercise 3.2. For part 1, I interpret this to mean that at point x0, you should plot x0 + j and x0 - j, where j is the 95% confidence level for the estimated variance of the output measurements, sigma squared. For part 2, follow the description in the text it refers to.
- Experiments on classifying digits
Download the ZIP code data from the book's website here. You will use it to perform several experiments on classification. The data has one class variable (0-9) followed by 256 brightness variables (pixels).
You can view a training or test point, or even the learning parameters, by doing something like the following:
imagesc(reshape(train(1,2:end), 16, 16)');
This reshapes the training point (256 pixel variables) to be a 16-by-16 matrix, takes the transpose (otherwise it looks sideways), and makes a picture of it.Experiments:
- The "base rate" is the percentage accuracy for making constant predictions, such as always predicting the digit 4. What is the base rate of the training set for each digit? What about the test set?
- Build a classifier from a linear model. Your predictor should
classify between images of the digit 3 and the digit 8. One way to
code the output is as -1 for three, and +1 for eight. Then you
would predict three if the response is < 0, and predict eight if
the response is ≥ 0.
You will probably need to eliminate some variables immediately, or you will not be able to invert the X'X matrix (due to numerical instability). There are many ways to eliminate variables, but to begin with you should eliminate variables that have very low variance (for all threes and all eights). For example, I find that there are 13 variables that have variance < 0.01 among all threes and eights in the training data. These variables with low variance are generally ones around the left and right edges of the image. Tell which variables you eliminated and which you used; an image would be best to show this.
What is the accuracy of your linear regression model on the training set? On the test set?
After you have eliminated variables with low variance, then eliminate variables that do not have strong predictive power using stepwise variable elimination (backwards and forwards; see section 3.4.1 in your book) and ridge regression (section 3.4.3). Show accuracy curves on the training data and the test data for various numbers of variables (for stepwise selection) and various settings of the penalty term (for ridge regression). Which do you think is a better way to do variable selection for regression?
Try choosing a model using subset selection with cross-validation to estimate the model error. How does this compare with the methods you tried above?
You might try using basis expansions to improve on your best model. Are you able to get better results?
- Overall, once you have a good model, try to explain the model. For example, if you used all 256 variables to build a linear model, you could make a picture of the parameter settings of that model. This picture should look like one of the original training images, but instead of representing a 3 or 8, it represents which pixels it found are good for classifying 3s and 8s.
Code for you to use
Here is a matlab function and associated data for finding the F-distribution inverse values, which you will need for doing stepwise variable selection. I am giving you these since our Matlab does not have the statistics toolbox.You should use the hamerly_f_inv function like this:
f_critical_value = hamerly_f_inv(α, a, b);Only critical values for α = 0.95 and α = 0.99 are included. The variables a and b correspond to the two parameters of the F distribution, and can be in the ranges 1 ≤ a ≤ 300, and 10 ≤ b ≤ 1600.
Copyright © 2006 Greg Hamerly.
Computer Science Department
Baylor University