Homework 4
Due: March 24, 2005
You may work with others on this assignment, but you should turn in separate writeups, and you should understand the solutions. Consult the book and your professor for help if you need it. If you work with someone else, you must acknowledge them on your report.
This assignment must be done in LaTeX and turned in printed from Postscript or PDF file format.
Announcements
- Thursdsay, March 10, 2005
- Please read this description of how to rank
attributes for this assignment using minimum conditional entropy. If you
have questions, please ask.
Also, one good preprocessing step for text documents is to remove "stop words" -- words that are so common in English that they are unlikely to be of any help for classification. Search Google for stopword list to find lists of such words. But don't just use one without thinking about it!
- Tuesday, March 1, 2005
- The assignment has been posted. This is a difficult assignment, with a lot of work involved (more than previous assignments). You must start on it now, and ask questions if something is unclear. This should be an interesting assignment!
Assignment
- Reading. You should read chapter 6 of Mitchell (the handout given in class) thoroughly.
- Write a naive Bayes classifier. Implement a naive
Bayes classifier in your language of choice (Matlab, C++, Java, etc.). It
should be for any type of discrete data, not just text. You may implement
it in one function, a set of functions, one program, or a set of programs.
However, there should be several tasks that your software can perform:
- Learning a model
The inputs for this task are:- data set for training (see format below)
- smoothing parameter lambda
The output for this task is the model (see format below).
- Evaluating a model
The inputs for this task are:- the model learned in task 1
- the number of features to report
The output for this task is a ranking of the input features that are most predictive between each pair of classes.
- Making predictions
The inputs for this task are:- the model learned in task 1
- a test data set (same format as training set)
The output for this task is the list of class predictions for each test record.
- Learning a model
- Learning with naive Bayes. Choose a text-based data
set that has multiple classes to test with your software. For example, you
could take messages from your own email that are SPAM or NOT-SPAM. You can
also look here for ideas of other
types of datasets. However, you should construct your own dataset.
- Construct a dataset and representation for the data (vocabulary size, words in the vocabulary, etc.). Explain your choices of dataset, the classes, and the vocabulary.
- Process the dataset into training and test sets in the given dataset format.
- Train a model on the training set and output a model.
- Test the learned model on the training set and the held-out test dataset. Report your confusion matrices and accuracy/error rates.
- Evaluate the model for attributes that are useful for the task.
- Try to reduce the number of features that you use based on your evaluation, and see how small you can make the feature set and still achieve good performance.
For each of these steps, you should think carefully about the experiments and analyze them carefully as well.
Data file format
The data format will be the following:
NUM_CLASSES CLASS_1 CLASS_2 CLASS_3...
NUM_ATTRIBUTES ATTRIBUTE_1 ATTRIBUTE_2 ATTRIBUTE_3...
NUM_RECORDS
CLASS ATTRIBUTE_VALUE_1 ATTRIBUTE_VALUE_2 ...
CLASS ATTRIBUTE_VALUE_1 ATTRIBUTE_VALUE_2 ...
CLASS ATTRIBUTE_VALUE_1 ATTRIBUTE_VALUE_2 ...
...
The first line gives the number of classes and their names. The second line gives the number of attributes and their names. The third line gives the number of records that follow in the dataset.
Then for each record, we have a CLASS, which is an integer between 1 and NUM_CLASSES (inclusive). Following the class is the list of attribute values, which can be any value (string, integer, etc.). Every unique string listed as an attribute value is assumed to be a unique value.
So an example of a training dataset with 100 records for predicting weather (SUNNY or RAINY) might be:
2 SUNNY RAINY
3 TEMPERATURE PRESSURE SEASON
100
SUNNY HIGH HIGH SUMMER
SUNNY LOW HIGH SUMMER
RAINY MEDIUM LOW SPRING
SUNNY LOW HIGH WINTER
RAINY HIGH MEDIUM FALL
...
Model file format
The model format will be the following:
NUM_CLASSES CLASS_1 CLASS_2 CLASS_3...
NUM_ATTRIBUTES ATTRIBUTE_1 ATTRIBUTE_2 ATTRIBUTE_3...
LAMBDA
CLASS_1 COUNT
CLASS_2 COUNT
...
CLASS_1 ATTRIBUTE_1 NUM_ATTRIBUTE_1_VALUES VALUE_1 COUNT_1 VALUE_2 COUNT_2...
CLASS_1 ATTRIBUTE_2 NUM_ATTRIBUTE_2_VALUES VALUE_1 COUNT_1 VALUE_2 COUNT_2...
...
CLASS_2 ATTRIBUTE_1 NUM_ATTRIBUTE_1_VALUES VALUE_1 COUNT_1 VALUE_2 COUNT_2...
CLASS_2 ATTRIBUTE_2 NUM_ATTRIBUTE_2_VALUES VALUE_1 COUNT_1 VALUE_2 COUNT_2...
...
The first two lines are the same as the first two lines in data files. The third line is the smoothing parameter lambda. Then there are NUM_CLASSES lines listing the count of the occurrence of each class (for the prior probabilities). Then there are NUM_CLASSES*NUM_ATTRIBUTES lines, each describing the Pr(attribute|class). Each of these lines has the class name, then attribute name, then number of attribute values, then a list of each attribute value and the count for that value. For each class, the attribute values should be listed in the same order.
Here is an example of the model format for the above training data:
2 SUNNY RAINY
3 TEMPERATURE PRESSURE SEASON
1
SUNNY 60
RAINY 40
SUNNY TEMPERATURE 3 LOW 13 MEDIUM 30 HIGH 17
SUNNY PRESSURE 3 LOW 9 MEDIUM 19 HIGH 32
SUNNY SEASON 4 WINTER 8 SPRING 10 SUMMER 23 FALL 19
RAINY TEMPERATURE 3 LOW 24 MEDIUM 11 HIGH 5
RAINY PRESSURE 3 LOW 29 MEDIUM 8 HIGH 3
RAINY SEASON 4 WINTER 15 SPRING 18 SUMMER 4 FALL 3
Note that all values sum up to 60 for each attribute for the SUNNY class, and 40 for the RAINY class.
Copyright © 2005 Greg Hamerly.
Computer Science Department
Baylor University