Homework 3

Pen-and-paper

  1. None...

Data analysis 1

You will conclude your exploration of regression by using smoothing methods: loess and smoothing splines. Use the ozone-level measurements data (dataset ozone in the ElemStatLearn package) as in Homework 2. For loess, report 5-fold cross-validated rss with for local polynomials of degrees 1 and 2.

Model selection for loess in this case entails choosing the span. Within each cross-validation fold, choose span by your favorite method (e.g. five-fold cross-validation). See the note below on tuning support vector machines to see a nice function you can use to make this easier. Use span values in [0,1] so they refer to proportion of the training points to use to fit the local polynomials.

This example shows how to fit a loess model with polynomials of degree 2, a span of 2\% of the training points, and plot the fit.

library(ElemStatLearn)
data(ozone)
x <- ozone\$temperature
y <- ozone\$ozone
plot(x,y)
lfit <- loess(y\~x, degree=2, span=.2, control=loess.control(surface="direct"))
x.to.plot <- seq(50, 100, len=50)
yhat <- predict(lfit, newdata=data.frame(x=x.to.plot))
lines(x.to.plot, yhat)

Update: when using loess for prediction, make sure the call is made as in the code example above.

Report on 5-fold cross-validation rss for a smoothing spline. Model selection for a smoothing spline entails choosing penalty parameter lambda (similar to ridge regression). In R, however, the smooth.spline function can do that for you:

sfit <- smooth.spline(x, y)
yhat <- predict(sfit, xplot)\$y
lines(xplot, yhat)

Include some plots of the data and fits including the best model from HW 2 (e.g. ridge on polynomials of degree 4). Include this model in your 5-fold cross validation report.

Data analysis 2

You will pit your digit classifier from HW2 against a support vector machine. The following example shows how to fit a support vector machine on a subset of 200 digits from the training set from HW2.

library(ElemStatLearn)
library(e1071)
data(zip.train)
load(url("http://www.biostat.jhsph.edu/\~hcorrada/PracticalML/Data/zip_indices.rda"))
train.set <- zip.train[train.inds,]
set.seed(1)
sample.inds <- sample(1:nrow(train.set))[1:200]
train.sample <- train.set[sample.inds,]
x <- train.sample[,-1]
g <- factor(train.sample[,1])
svmfit <- svm(x, g)

By default, this function uses the "radial" kernel (aka a Gaussian kernel) we saw in class. Model selection for SVMs in this case entails choosing cost parameter C (think of this in similar terms as lambda in ridge regression) and kernel bandwidth gamma (you saw this in kernel smoothing). The e1071 package has a nice function you can use to help select these parameters with cross-validation. The following example shows how to do model selection for an svm with 5-fold cross validation using a 5-by-5 grid of cost and bandwidth values.

tune.res <- tune.svm(x, g, kernel="radial", cost=2^seq(-5,5,len=5),
gamma=10^seq(-3,6,len=5), tunecontrol=tune.control(cross=5))
print(tune.res)
svmfit <- tune.res\$best.model
table(g, predict(svmfit))

This function can be used for general cross-validation purposes. Here is an example to get 5-fold cross-validation rss for a smoothing spline.

data(ozone)
x <- ozone\$temperature
y <- ozone\$ozone
tune.res <- tune(smooth.spline, x, y, predict.fun = function(object,
newdata) predict(object,newdata)\$y, tunecontrol=tune.control(cross=5))
rss <- tune.res\$best.performance

Report on 5-fold cross validation error of your classifier from HW2 and a properly selected (with the tune.svm function) SVM on this subset of the training set of digits from HW2 (train.set above). Use the "misclassified ones count as two errors" metric. Include a copy of the description of your classifier from HW2 in your report.

A final hint: you can use the class.weights argument to the svm function to train with uneven classification error weights.

Handing in

This homework is due on Monday March 22. Please send your writeup and code to with subject [Practical ML HW 3].