* Cl7EX1.DO Logistic Regression * Kyphosis Data from S-PLUS system * * * Raw data: kyphosis.dat * Assumes files are in folder [path]\regress * If files are in another folder, change cd command below to * point Stata to the correct folder * To run this program, use the following Stata commands: * cd [path]\regress ... change directory to folder regress * do cl7ex1 * OUTLINE: * Part a. Input and display data, calculate new variables * Part b. Scatterplot Matrix + ksm lowess smoothed curves in a single image * Part c. Boxplots by Kyphosis -- put into one image * Part d. Non-linearity display -- logs odds vs quintiles of X * put plots in one image * Part e. Fit Logistic Regression Model * Part f. Fit Alternative models, get Deviances using glm * Part h. Sensitivity, Specificity, ROC curves * Part i. Cross Validation with split samples * Part j. Check Fit using Hosmer-Lemeshow chi-square -- 10 groups * Part k. Calculations of Deviance and AIC * Housekeeping * Clear workspace clear * Turn off -more- pause set more off * Change the working directory * cd path/regress * Save log file on disk, use .txt so Notepad will open it capture log close log using cl7ex1.log, replace * Make subfolder for graphs shell md cl7ex1 * Extend linesize for log set linesize 100 * Part a. Input and display data, calculate new variables infile obs str7 kyph age number start using kyphosis.dat gen kyphosis=1 if kyph=="present" replace kyphosis=0 if kyph=="absent" * Get means, medians, etc codebook kyphosis age number start , tabulate(2) * Center continuous Xs -- so beta0 has interpretation -- at medians, rounded gen agec = age-84 gen startc = start-13 gen numberc = number -4 gen startp=(start-10)*(start>10) if start~=. * Make "plus" function for broken-arrow spline with a break at 84 months gen agep = (age-84)*(age>84) if age~=. * Part b. Scatterplot Matrix + ksm lowess smoothed curves in a single image * Allow for long lines split over several lines separated by semi-colons #delimit ; set textsize 150; graph age number start kyphosis, matrix half jitter(5) symbol(.)connect(s) bands(4) t1("SCATTERPLOT MATRIX: KYPHOSIS DATA"); #delimit cr; gphprint , saving(cl7ex1\figb1.wmf,replace) #delimit ; set textsize 150; ksm kyphosis age, lowess xlab ylab adjust sy(.i) jitter(5) bwidth(0.8) saving(cl7ex1\one.gph,replace) t1("KYPHOSIS -VS- AGE"); gphprint , saving(cl7ex1\figb2.wmf,replace) ; ksm kyphosis number, lowess xlab ylab adjust sy(.i) jitter(5) bwidth(0.8) saving(cl7ex1\two.gph,replace) t1("KYPHOSIS -VS- NUMBER"); gphprint , saving(cl7ex1\figb3.wmf,replace) ; ksm kyphosis start, lowess xlab ylab adjust sy(.i) jitter(5) bwidth(0.8) saving(cl7ex1\three.gph,replace) t1("KYPHOSIS -VS- START"); gphprint , saving(cl7ex1\figb4.wmf,replace) ; set textsize 170; graph using cl7ex1\one.gph cl7ex1\two.gph cl7ex1\three.gph, saving(cl7ex1\figb.wmf,replace); set textsize 150; * Make a better graph of kyphosis by start -- use more jitter and label; * the axes more completely, including tick marks; graph kyphosis start, jitter(10) symbol(o) xtick(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18) xlabel (0 3 6 9 12 15 18) ylabel(0 .5 1.0) l1(" ") l2("KYPHOSIS") b2("START") t1("KYPHOSIS -VS- STARTING VERTEBRA"); gphprint , saving(figb5.wmf,replace); #delimit cr; set textsize 100 * Part c. Boxplots by Kyphosis -- put into one image sort kyphosis set textsize 150 graph age , by (kyphosis) box t2("AGE") ylab saving(cl7ex1\one.gph,replace) gphprint , saving(cl7ex1\figc1.wmf,replace) graph number, by (kyphosis) box t2("NUMBER") ylab saving(cl7ex1\two.gph,replace) gphprint , saving(cl7ex1\figc2.wmf,replace) graph start, by (kyphosis) box t2("START") ylab saving(cl7ex1\three.gph,replace) b2("KYPHOSIS") gphprint , saving(cl7ex1\figc3.wmf,replace) set textsize 170 graph using cl7ex1\one.gph cl7ex1\two.gph cl7ex1\three.gph, saving(cl7ex1\figc.wmf,replace) set textsize 100 * Part d. Non-linearity display -- logs odds vs quintiles of X * put plots in one image set textsize 150 * Age egen q1 = pctile(age), p(20) egen q2 = pctile(age), p(40) egen q3 = pctile(age), p(60) egen q4 = pctile(age), p(80) egen qmin = min(age) egen qmax = max(age) gen quintile = (q4+qmax)/2 if age > q4 replace quintile = (q3+q4 )/2 if age <= q4 replace quintile = (q2+q3 )/2 if age <= q3 replace quintile = (q1+q2 )/2 if age <= q2 replace quintile = (qmin+q1)/2 if age <= q1 replace quintile =. if age==. egen qcat = group(quintile) tab quintile qcat egen prop = mean(kyphosis), by (qcat) gen lodds = log(prop/(1-prop)) graph lodds quintile , connect(l) symbol(O) xlab ylab t1("AGE") l1(" ") l2("LOG ODDS") b2("QUINTILES") saving(cl7ex1\one.gph,replace) gphprint , saving(cl7ex1\figd1.wmf,replace) drop q1-q4 quintile qmin qmax qcat prop lodds * Start egen q1 = pctile(start), p(20) egen q2 = pctile(start), p(40) egen q3 = pctile(start), p(60) egen q4 = pctile(start), p(80) egen qmin = min(start) egen qmax = max(start) gen quintile = (q4+qmax)/2 if start > q4 replace quintile = (q3+q4 )/2 if start <= q4 replace quintile = (q2+q3 )/2 if start <= q3 replace quintile = (q1+q2 )/2 if start <= q2 replace quintile = (qmin+q1)/2 if start <= q1 replace quintile =. if start==. egen qcat = group(quintile) tab quintile qcat egen prop = mean(kyphosis), by (qcat) gen lodds = log(prop/(1-prop)) graph lodds quintile , connect(l) symbol(O) xlab ylab t1("START") l1(" ") l2("LOG ODDS") b2("QUINTILES") saving(cl7ex1\two.gph,replace) gphprint , saving(cl7ex1\figd1.wmf,replace) drop q1-q4 quintile qmin qmax qcat prop lodds * Number egen q1 = pctile(number), p(20) egen q2 = pctile(number), p(40) egen q3 = pctile(number), p(60) egen q4 = pctile(number), p(80) egen qmin = min(number) egen qmax = max(number) gen quintile = (q4+qmax)/2 if number > q4 replace quintile = (q3+q4 )/2 if number <= q4 replace quintile = (q2+q3 )/2 if number <= q3 replace quintile = (q1+q2 )/2 if number <= q2 replace quintile = (qmin+q1)/2 if number <= q1 replace quintile =. if start==. egen qcat = group(quintile) tab quintile qcat egen prop = mean(kyphosis), by (qcat) gen lodds = log(prop/(1-prop)) graph lodds quintile , connect(l) symbol(O) xlab ylab t1("NUMBER") l1(" ") l2("LOG ODDS") b2("QUINTILES") saving(cl7ex1\three.gph,replace) gphprint , saving(cl7ex1\figd1.wmf,replace) graph using cl7ex1\one.gph cl7ex1\two.gph cl7ex1\three.gph, saving(cl7ex1\figd.wmf,replace) set textsize 100 * Part e. Fit Logistic Regression Model logit kyphosis agec agep startc numberc * Estimate beta1+beta2 - slope after age 7 lincom agec+agep * Get Odds Ratios logistic kyphosis agec agep startc numberc lincom agec+agep * Verify global test for Xs=0 fitting null vs extended models quietly logit kyphosis age agep startc numberc lrtest, saving(0) quietly logit kyphosis lrtest * Display correlation matrix for estimates -- rarely done/needed * -- show correlations among betas for verifying beta1+beta2 estimate quietly logit kyphosis age agep startc numberc vce , corr * Part f. Fit Alternative models, get Deviances using glm * Model with intercept only -- to confirm global test logit kyphosis quietly glm kyphosis , family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model with all Xs: age agep startc numberc logit kyphosis agec agep startc numberc quietly glm kyphosis agec agep startc numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model A: agec agep logistic kyphosis agec agep logit quietly glm kyphosis agec agep, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model B: start logistic kyphosis startc logit quietly glm kyphosis startc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model C: numberc logistic kyphosis numberc logit quietly glm kyphosis numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model D: age agep startc logistic kyphosis agec agep startc logit quietly glm kyphosis agec agep startc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model E: age agep numberc logistic kyphosis agec agep numberc logit quietly glm kyphosis agec agep numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model F: age agep startp logistic kyphosis agec agep startp logit quietly glm kyphosis agec agep startp, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model G: age agep startc startp replace startc=(start-10) if start~=. logistic kyphosis agec agep startc startp logit quietly glm kyphosis agec agep startc startp, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Model H: age agep startc numberc logistic kyphosis agec agep startc numberc logit quietly glm kyphosis agec agep startc numberc, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) " P-Value: " chiprob(e(df),e(deviance_s)) * Part g. Graph fitted -vs- observed quietly logistic kyphosis age agep startp lpredict muhat * Lowess smoothed plot set textsize 150 ksm kyphosis muhat, lowess ylab xlab s(.i) bwidth(0.8) t1("KYPHOSIS -VS- PREDICTED PROBS") gphprint , saving(cl7ex1\figg1.wmf,replace) * Plot deciles fitted -vs- observed drop q1-q4 quintile qmin qmax qcat prop lodds * muhat egen q1 = pctile(muhat), p(10) egen q2 = pctile(muhat), p(20) egen q3 = pctile(muhat), p(30) egen q4 = pctile(muhat), p(40) egen q5 = pctile(muhat), p(50) egen q6 = pctile(muhat), p(60) egen q7 = pctile(muhat), p(70) egen q8 = pctile(muhat), p(80) egen q9 = pctile(muhat), p(90) egen qmin = min(muhat) egen qmax = max(muhat) gen decile = (q9+qmax)/2 if muhat > q9 replace decile = (q8+q9 )/2 if muhat <= q9 replace decile = (q7+q8 )/2 if muhat <= q8 replace decile = (q6+q7 )/2 if muhat <= q7 replace decile = (q5+q6 )/2 if muhat <= q6 replace decile = (q4+q5 )/2 if muhat <= q5 replace decile = (q3+q4 )/2 if muhat <= q4 replace decile = (q2+q3 )/2 if muhat <= q3 replace decile = (q1+q2 )/2 if muhat <= q2 replace decile = (qmin+q1)/2 if muhat <= q1 replace decile =. if muhat==. egen deccat = group(decile) tab decile deccat egen prop = mean(kyphosis), by (deccat) * Plot obs -vs- pred and y=x for comparison with perfect fit graph prop decile decile , connect(ll) symbol(Oi) xlab ylab t1("OBSERVED -VS- PREDICTED PROPORTIONS -- DECILES") gphprint , saving(cl7ex1\figg2.wmf,replace) set textsize 100 * Part h. Sensitivity, Specificity, ROC curves set textsize 125 * Model with age agep startc numberc quietly logit kyphosis age agep startc numberc * Sensitivity, Specificity, etc for a give cut point lstat , cutoff(0.5) lstat , cutoff(0.3) * Plot Sensitivity -vs- Specificity -- all possible cut points lsens , t1("MODEL: AGE AGEP STARTC NUMBERC") l1(".") l2("Sensitivity / Specificity") gphprint , saving(cl7ex1\figh1.wmf,replace) * ROC CURVES lroc , t1("MODEL: AGEC AGEP STARTC NUMBERC") l1(" ") l2("Sensitivity") gphprint , saving(cl7ex1\figh2.wmf,replace) * Better model: double broken-arrow model - age agep startp quietly logit kyphosis age agep startp lroc , t1("MODEL: AGE AGEP STARTP") l1(" ") l2("Sensitivity") gphprint , saving(cl7ex1\figh3.wmf,replace) * Part i. Cross Validation with split samples * Use modification of Rick Thompson's macro: crossval.ado * -- can be downloaded from course website * Put crossval.ado in the current folder or in the personal ADO folder * Note: Parameter numgrps = k , where 1/k = random fraction of data deleted * for cross-validation. k=2 splits data in random halves. * Recommend setting k=n, which deletes one observation at a time crossval kyphosis age agep startp, numgrp(81) * Part j. Check Fit using Hosmer-Lemeshow chi-square -- 10 groups quietly logit kyphosis age agep startc numberc lfit , group(10) quietly logit kyphosis age agep startp lfit , group(10) set textsize 100 * Part k. Calculations of Deviance and AIC * Generate 4 new "predictors", all containing random numbers set seed 568123457 gen x5 = uniform() gen x6 = uniform() gen x7 = uniform() gen x8 = uniform() * Get Deviances and AICs (uses mlfit.ado macro -- downloadable from website) quietly glm kyphosis agec agep startc numberc , family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc mlfit quietly glm kyphosis agec agep startc numberc x5, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 mlfit quietly glm kyphosis agec agep startc numberc x5 x6, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 x6 mlfit quietly glm kyphosis agec agep startc numberc x5 x6 x7, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 x6 x7 mlfit quietly glm kyphosis agec agep startc numberc x5 x6 x7 x8, family(binomial) link(logit) display "Deviance: " e(deviance_s) " d.f.: " e(df) quietly logit kyphosis agec agep startc numberc x5 x6 x7 x8 mlfit * Close log file -- Only when all errors have been fixed *log close