Homework 1

Pen-and-paper

  1. Derive the discriminant function for two-class (binary outcomes) LDA and show that it is a linear function. (Show your work). (For extra credit, do the same for QDA).
  2. Show that LDA and linear regression are equivalent when outcomes are binary. What goes wrong when there are more than three classes?

Data analysis

For this exercise, you will use a dataset representing email messages. The goal is to classify messages as spam (or not). You will need to install the package ElemStatLearn as described in the Resources section of the course homepage. To load the dataset, get information on the dataset, and get meaningful names for message features do the following:

# load data
> library(ElemStatLearn)
> data(spam)
# get information on spam dataset
> ?spam
# get meaningful feature names
# nms is a vector of names conforming to the description given by ?spam
> nms <- read.table("http://www.biostat.jhsph.edu/\~hcorrada/PracticalML/Data/spam_names.txt", 
  stringsAsFactors=FALSE)
# if you want to add these names to the spam data frame do the following
> names(spam)[-ncol(spam)] <- nms\$x

Your goal is to test methods we have used in class (or variations thereof), using this dataset. To see how methods are working use subsets of the data as test sets. For example, to create a test with 500 messages and use the remaining messages as a training set do the following:

> set.seed(1)
> p <- sample(nrow(spam))
> test.ind <- p[1:500]
> spam.train <- spam[-test.ind,]
> spam.test <- spam[test.ind,]

Try a few of the different methods we have used so far and report on your experience. One thing to keep in mind: don't be shy about using transformations of the data, a useful one may be to convert word frequencies to binary predictors (present or not). Remember that the R code for each lecture shows examples of using all methods used so far.

Make some predictions

Use this training data to predict the outcomes for this data. You should give the 500 estimates and an estimate of the number of mistakes you've made. Please send a text file with only the predictions (separated by spaces). Include a description of what you did to make the predicitons. Whomever predicts best wins first prize. Whomever best estimates the number of mistakes they make comes in second. Prizes will be handed out.

Handing in

This homework is due on Monday February 22. The pen-and-paper section is due at the beginning of class (1:30pm) along with writeups of the analysis and prediction sections. Please send the code you used for the analysis and prediction sections along with your 500 predictions by email to with subject [Practical ML HW 1].