* Cl7EX1.DO Logistic Regression * Kyphosis Data from S-PLUS system * * * Raw data: kyphosis.dat * Assumes files are in folder [path]\regress * If files are in another folder, change cd command below to * point Stata to the correct folder * To run this program, use the following Stata commands: * cd [path]\regress ... change directory to folder regress * do cl7ex1 * OUTLINE: * Part a. Input and display data, calculate new variables * Part b. Scatterplot Matrix + ksm lowess smoothed curves in a single image * Part c. Boxplots by Kyphosis -- put into one image * Part d. Non-linearity display -- logs odds vs quintiles of X * put plots in one image * Part e. Fit Logistic Regression Model * Part f. Fit Alternative models, get Deviances using glm * Part h. Sensitivity, Specificity, ROC curves * Part i. Cross Validation with split samples * Part j. Check Fit using Hosmer-Lemeshow chi-square -- 10 groups * Part k. Calculations of Deviance and AIC * Housekeeping * Clear workspace clear * Turn off -more- pause set more off * Change the working directory cd "C:/Elizabeth/Regression Course/Materials 2004" * Save log file on disk, use .txt so Notepad will open it capture log close log using cl7ex1.log, replace * Make subfolder for graphs *shell md cl7ex1 * Extend linesize for log set linesize 100 * Part a. Input and display data, calculate new variables infile obs str7 kyph age number start using "Data\kyphosis.dat" gen kyphosis=1 if kyph=="present" replace kyphosis=0 if kyph=="absent" * Get means, medians, etc codebook kyphosis age number start , tabulate(2) * Center continuous Xs -- so beta0 has interpretation -- at medians, rounded gen agec = age-84 gen startc = start-13 gen numberc = number -4 gen startp=(start-10)*(start>10) if start~=. * Make "plus" function for broken-arrow spline with a break at 84 months gen agep = (age-84)*(age>84) if age~=. * Part b. Scatterplot Matrix + ksm lowess smoothed curves in a single image * Allow for long lines split over several lines separated by semi-colons set textsize 150 graph matrix age number start kyphosis, half jitter(5) ms(.) title("SCATTERPLOT MATRIX: KYPHOSIS DATA") graph export "cl7ex1\figb1.wmf", replace set textsize 150 lowess kyphosis age, adjust ms(. i) jitter(5) bwidth(0.8) title("KYPHOSIS -VS- AGE") saving("cl7ex1\one.gph", replace) graph export "cl7ex1\figb2.wmf",replace lowess kyphosis number, adjust ms(. i) jitter(5) bwidth(0.8) title("KYPHOSIS -VS- NUMBER") saving("cl7ex1\two.gph", replace) graph export "cl7ex1\figb3.wmf", replace lowess kyphosis start, adjust ms(. i) jitter(5) bwidth(0.8) title("KYPHOSIS -VS- START") saving("cl7ex1\three.gph", replace) graph export "cl7ex1\figb4.wmf",replace set textsize 170 graph combine "cl7ex1\one.gph" "cl7ex1\two.gph" "cl7ex1\three.gph" graph export "cl7ex1\figb.wmf",replace * Make the same plots but on the logit scale lowess kyphosis age, logit bwidth(0.8) title("KYPHOSIS -VS- AGE") saving("cl7ex1\onel.gph", replace) graph export "cl7ex1\figb2l.wmf",replace lowess kyphosis number, logit bwidth(0.8) title("KYPHOSIS -VS- NUMBER") saving("cl7ex1\twol.gph", replace) graph export "cl7ex1\figb3l.wmf", replace lowess kyphosis start, logit bwidth(0.8) title("KYPHOSIS -VS- START") saving("cl7ex1\threel.gph", replace) graph export "cl7ex1\figb4l.wmf",replace set textsize 170 graph combine "cl7ex1\onel.gph" "cl7ex1\twol.gph" "cl7ex1\threel.gph" graph export "cl7ex1\figbl.wmf",replace set textsize 150 * Make a better graph of kyphosis by start -- use more jitter and label; * the axes more completely, including tick marks; twoway scatter kyphosis start, jitter(10) ms(o) xtick(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18) xlabel(0 3 6 9 12 15 18) ylabel(0 .5 1.0) ytitle("KYPHOSIS") xtitle("START") title("KYPHOSIS -VS- STARTING VERTEBRA") graph export "cl7ex1\figb5.wmf",replace set textsize 100 * Part c. Boxplots by Kyphosis -- put into one image sort kyphosis set textsize 150 graph box age , over(kyphosis) title("AGE") saving("cl7ex1\one.gph",replace) graph export "cl7ex1\figc1.wmf",replace graph box number, over(kyphosis) title("NUMBER") saving("cl7ex1\two.gph",replace) graph export "cl7ex1\figc2.wmf",replace graph box start, over(kyphosis) title("START") saving("cl7ex1\three.gph",replace) graph export "cl7ex1\figc3.wmf",replace set textsize 170 graph combine "cl7ex1\one.gph" "cl7ex1\two.gph" "cl7ex1\three.gph" graph export "cl7ex1\figc.wmf", replace set textsize 100 * Part d. Non-linearity display -- logs odds vs quintiles of X * put plots in one image set textsize 150 * Age egen q1 = pctile(age), p(20) egen q2 = pctile(age), p(40) egen q3 = pctile(age), p(60) egen q4 = pctile(age), p(80) egen qmin = min(age) egen qmax = max(age) gen quintile = (q4+qmax)/2 if age > q4 replace quintile = (q3+q4 )/2 if age <= q4 replace quintile = (q2+q3 )/2 if age <= q3 replace quintile = (q1+q2 )/2 if age <= q2 replace quintile = (qmin+q1)/2 if age <= q1 replace quintile =. if age==. egen qcat = group(quintile) tab quintile qcat egen prop = mean(kyphosis), by (qcat) gen lodds = log(prop/(1-prop)) twoway scatter lodds quintile , sort c(l) ms(O) title("AGE") ytitle("LOG ODDS") xtitle("QUINTILES") saving(cl7ex1\one.gph,replace) graph export "cl7ex1\figd1.wmf",replace drop q1-q4 quintile qmin qmax qcat prop lodds * Start egen q1 = pctile(start), p(20) egen q2 = pctile(start), p(40) egen q3 = pctile(start), p(60) egen q4 = pctile(start), p(80) egen qmin = min(start) egen qmax = max(start) gen quintile = (q4+qmax)/2 if start > q4 replace quintile = (q3+q4 )/2 if start <= q4 replace quintile = (q2+q3 )/2 if start <= q3 replace quintile = (q1+q2 )/2 if start <= q2 replace quintile = (qmin+q1)/2 if start <= q1 replace quintile =. if start==. egen qcat = group(quintile) tab quintile qcat egen prop = mean(kyphosis), by (qcat) gen lodds = log(prop/(1-prop)) twoway scatter lodds quintile , sort c(l) ms(O) title("START") ytitle("LOG ODDS") xtitle("QUINTILES") saving(cl7ex1\two.gph,replace) graph export "cl7ex1\figd1.wmf",replace drop q1-q4 quintile qmin qmax qcat prop lodds * Number egen q1 = pctile(number), p(20) egen q2 = pctile(number), p(40) egen q3 = pctile(number), p(60) egen q4 = pctile(number), p(80) egen qmin = min(number) egen qmax = max(number) gen quintile = (q4+qmax)/2 if number > q4 replace quintile = (q3+q4 )/2 if number <= q4 replace quintile = (q2+q3 )/2 if number <= q3 replace quintile = (q1+q2 )/2 if number <= q2 replace quintile = (qmin+q1)/2 if number <= q1 replace quintile =. if start==. egen qcat = group(quintile) tab quintile qcat egen prop = mean(kyphosis), by (qcat) gen lodds = log(prop/(1-prop)) twoway scatter lodds quintile , sort c(l) ms(O) title("NUMBER") ytitle("LOG ODDS") xtitle("QUINTILES") saving(cl7ex1\three.gph,replace) graph export "cl7ex1\figd1.wmf",replace drop q1-q4 quintile qmin qmax qcat prop lodds graph combine cl7ex1\one.gph cl7ex1\two.gph cl7ex1\three.gph graph export "cl7ex1\figd.wmf",replace set textsize 100 * Part e. Fit Logistic Regression Model logit kyphosis agec agep startc numberc * Estimate beta1+beta2 - slope after age 7 lincom agec+agep * Get Odds Ratios logistic kyphosis agec agep startc numberc lincom agec+agep * Verify global test for Xs=0 fitting null vs extended models quietly logit kyphosis age agep startc numberc est store A quietly logit kyphosis lrtest A * Display correlation matrix for estimates -- rarely done/needed * -- show correlations among betas for verifying beta1+beta2 estimate quietly logit kyphosis age agep startc numberc vce , corr * Part f. Fit Alternative models, get Deviances using glm * Model with intercept only -- to confirm global test logit kyphosis quietly glm kyphosis , family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model with all Xs: age agep startc numberc logit kyphosis agec agep startc numberc quietly glm kyphosis agec agep startc numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model A: agec agep logistic kyphosis agec agep logit quietly glm kyphosis agec agep, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model B: start logistic kyphosis startc logit quietly glm kyphosis startc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model C: numberc logistic kyphosis numberc logit quietly glm kyphosis numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model D: age agep startc logistic kyphosis agec agep startc logit quietly glm kyphosis agec agep startc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model E: age agep numberc logistic kyphosis agec agep numberc logit quietly glm kyphosis agec agep numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model F: age agep startp logistic kyphosis agec agep startp logit quietly glm kyphosis agec agep startp, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model G: age agep startc startp replace startc=(start-10) if start~=. logistic kyphosis agec agep startc startp logit quietly glm kyphosis agec agep startc startp, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model H: age agep startc numberc logistic kyphosis agec agep startc numberc logit quietly glm kyphosis agec agep startc numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Part g. Graph fitted -vs- observed quietly logistic kyphosis age agep startp lpredict muhat * Lowess smoothed plot set textsize 150 lowess kyphosis muhat, adjust s(. i) jitter(5) bwidth(0.8) title("KYPHOSIS -VS- PREDICTED PROBS") graph export cl7ex1\figg1.wmf",replace * Plot deciles fitted -vs- observed * muhat egen q1 = pctile(muhat), p(10) egen q2 = pctile(muhat), p(20) egen q3 = pctile(muhat), p(30) egen q4 = pctile(muhat), p(40) egen q5 = pctile(muhat), p(50) egen q6 = pctile(muhat), p(60) egen q7 = pctile(muhat), p(70) egen q8 = pctile(muhat), p(80) egen q9 = pctile(muhat), p(90) egen qmin = min(muhat) egen qmax = max(muhat) gen decile = (q9+qmax)/2 if muhat > q9 replace decile = (q8+q9 )/2 if muhat <= q9 replace decile = (q7+q8 )/2 if muhat <= q8 replace decile = (q6+q7 )/2 if muhat <= q7 replace decile = (q5+q6 )/2 if muhat <= q6 replace decile = (q4+q5 )/2 if muhat <= q5 replace decile = (q3+q4 )/2 if muhat <= q4 replace decile = (q2+q3 )/2 if muhat <= q3 replace decile = (q1+q2 )/2 if muhat <= q2 replace decile = (qmin+q1)/2 if muhat <= q1 replace decile =. if muhat==. egen deccat = group(decile) tab decile deccat egen prop = mean(kyphosis), by (deccat) * Plot obs -vs- pred and y=x for comparison with perfect fit twoway scatter prop decile decile , sort c(l l) ms(O i) title("OBSERVED -VS- PREDICTED PROPORTIONS -- DECILES") graph export "cl7ex1\figg2.wmf",replace set textsize 100 * Part h. Sensitivity, Specificity, ROC curves set textsize 125 * Model with age agep startc numberc quietly logit kyphosis age agep startc numberc * Sensitivity, Specificity, etc for a give cut point lstat , cutoff(0.5) lstat , cutoff(0.3) * Plot Sensitivity -vs- Specificity -- all possible cut points lsens , title("MODEL: AGE AGEP STARTC NUMBERC") ytitle("Sensitivity / Specificity") graph export "cl7ex1\figh1.wmf",replace * ROC CURVES lroc , title("MODEL: AGEC AGEP STARTC NUMBERC") ytitle("Sensitivity") graph export "cl7ex1\figh2.wmf",replace * Better model: double broken-arrow model - age agep startp quietly logit kyphosis age agep startp lroc , title("MODEL: AGE AGEP STARTP") ytitle("Sensitivity") graph export "cl7ex1\figh3.wmf",replace * Part i. Cross Validation with split samples * Use modification of Rick Thompson's macro: crossval.ado * -- can be downloaded from course website * Put crossval.ado in the current folder or in the personal ADO folder * Note: Parameter numgrps = k , where 1/k = random fraction of data deleted * for cross-validation. k=2 splits data in random halves. * Recommend setting k=n, which deletes one observation at a time crossval kyphosis age agep startp, numgrp(81) * Part j. Check Fit using Hosmer-Lemeshow chi-square -- 10 groups quietly logit kyphosis age agep startc numberc lfit , group(10) quietly logit kyphosis age agep startp lfit , group(10) set textsize 100 * Part k. Calculations of Deviance and AIC * Generate 4 new "predictors", all containing random numbers set seed 568123457 gen x5 = uniform() gen x6 = uniform() gen x7 = uniform() gen x8 = uniform() * Get Deviances and AICs (uses mlfit.ado macro -- downloadable from website) quietly glm kyphosis agec agep startc numberc , family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc mlfit quietly glm kyphosis agec agep startc numberc x5, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 mlfit quietly glm kyphosis agec agep startc numberc x5 x6, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 x6 mlfit quietly glm kyphosis agec agep startc numberc x5 x6 x7, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 x6 x7 mlfit quietly glm kyphosis agec agep startc numberc x5 x6 x7 x8, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 x6 x7 x8 mlfit * Close log file -- Only when all errors have been fixed *log close