Homework 3

Data analysis 1

You will conclude your exploration of regression by using smoothing methods: loess and smoothing splines. Use the ozone-level measurements data (dataset ozone in the ElemStatLearn package) as in Homework 2. For loess, report 5-fold cross-validated rss with for local polynomials of degrees 1 and 2.

Model selection for loess in this case entails choosing the span. Within each cross-validation fold, choose span by your favorite method (e.g. five-fold cross-validation). See the note below on tuning support vector machines to see a nice function you can use to make this easier. Use span values in [0,1] so they refer to proportion of the training points to use to fit the local polynomials.

This example shows how to fit a loess model with polynomials of degree 2, a span of 2\% of the training points, and plot the fit.

library(ElemStatLearn)
data(ozone)
x <- ozone\$temperature
y <- ozone\$ozone
plot(x,y)
lfit <- loess(y\~x, degree=2, span=.2, control=loess.control(surface="direct"))
x.to.plot <- seq(50, 100, len=50)
yhat <- predict(lfit, newdata=data.frame(x=x.to.plot))
lines(x.to.plot, yhat)

Update: when using loess for prediction, make sure the call is made as in the code example above.

Report on 5-fold cross-validation rss for a smoothing spline. Model selection for a smoothing spline entails choosing penalty parameter lambda (similar to ridge regression). In R, however, the smooth.spline function can do that for you:

sfit <- smooth.spline(x, y)
yhat <- predict(sfit, xplot)\$y
lines(xplot, yhat)

Include some plots of the data and fits including the best model from HW 2 (e.g. ridge on polynomials of degree 4). Include this model in your 5-fold cross validation report.

Data analysis 2

You will pit your digit classifier from HW2 against a support vector machine. The following example shows how to fit a support vector machine on a subset of 200 digits from the training set from HW2.

library(ElemStatLearn)
library(e1071)
data(zip.train)
load(url("http://www.biostat.jhsph.edu/\~hcorrada/PracticalML/Data/zip_indices.rda"))
train.set <- zip.train[train.inds,]
set.seed(1)
sample.inds <- sample(1:nrow(train.set))[1:200]
train.sample <- train.set[sample.inds,]
x <- train.sample[,-1]
g <- factor(train.sample[,1])
svmfit <- svm(x, g)

By default, this function uses the "radial" kernel (aka a Gaussian kernel) we saw in class. Model selection for SVMs in this case entails choosing cost parameter C (think of this in similar terms as lambda in ridge regression) and kernel bandwidth gamma (you saw this in kernel smoothing). The e1071 package has a nice function you can use to help select these parameters with cross-validation. The following example shows how to do model selection for an svm with 5-fold cross validation using a 5-by-5 grid of cost and bandwidth values.

tune.res <- tune.svm(x, g, kernel="radial", cost=2^seq(-5,5,len=5),
gamma=10^seq(-3,6,len=5), tunecontrol=tune.control(cross=5))
print(tune.res)
svmfit <- tune.res\$best.model
table(g, predict(svmfit))

This function can be used for general cross-validation purposes. Here is an example to get 5-fold cross-validation rss for a smoothing spline.

data(ozone)
x <- ozone\$temperature
y <- ozone\$ozone
tune.res <- tune(smooth.spline, x, y, predict.fun = function(object,
newdata) predict(object,newdata)\$y, tunecontrol=tune.control(cross=5))
rss <- tune.res\$best.performance

Report on 5-fold cross validation error of your classifier from HW2 and a properly selected (with the tune.svm function) SVM on this subset of the training set of digits from HW2 (train.set above). Use the "misclassified ones count as two errors" metric. Include a copy of the description of your classifier from HW2 in your report.

A final hint: you can use the class.weights argument to the svm function to train with uneven classification error weights.

Homework 3

Pen-and-paper

Data analysis 1

Data analysis 2

Handing in