Stata Lab Q&A:
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Some Stata Tips for Working with Another Data Set 1. Are your data contained in an Excel Spread sheet? You could;
You will notice that sex is a string variable, the dates are not in a format that would allow you to subtract them, and the rate variables have values of 9 represent a missing value but Stata requires a “.” for a missing value. 2. Need to change the format of certain variables?
Create the following do file to make some changes: codebook sex tab sex encode sex, gen(sexn) tab sexn codebook sexn gen interview=date(interview_dt,"MDY") codebook interview gen followup=date(followup_dt,"MDY") gen time = followup-interview stem time list interview_dt followup_dt time foreach var of varlist rate_1-rate_2{ replace `var'=. if `var'==9 } tab rate_1, missing tab rate_2, missing In the results window, you will see: . do "C:\practice.do” . codebook sex ----------------------------------------------------------------------------------------- sex (unlabeled) ----------------------------------------------------------------------------------------- type: string (str1) unique values: 2 missing "": 0/6 tabulation: Freq. Value 3 "F" 3 "M" . tab sex sex | Freq. Percent Cum. ------------+----------------------------------- F | 3 50.00 50.00 M | 3 50.00 100.00 ------------+----------------------------------- Total | 6 100.00 . encode sex, gen(sexn) . tab sexn sexn | Freq. Percent Cum. ------------+----------------------------------- F | 3 50.00 50.00 M | 3 50.00 100.00 ------------+----------------------------------- Total | 6 100.00 . codebook sexn -----------------------------------------------------------------------------------------sexn (unlabeled) ----------------------------------------------------------------------------------------- type: numeric (long) label: sexn range: [1,2] units: 1 unique values: 2 missing .: 0/6 tabulation: Freq. Numeric Label 3 1 F 3 2 M . gen interview=date(interview_dt,"MDY") . codebook interview ----------------------------------------------------------------------------------------- interview (unlabeled) ----------------------------------------------------------------------------------------- type: numeric (float) range: [16802,17535] units: 1 unique values: 6 missing .: 0/6 tabulation: Freq. Value 1 16802 1 17059 1 17210 1 17242 1 17271 1 17535 . gen followup=date(followup_dt,"MDY") . gen time = followup-interview . stem time Stem-and-leaf plot for time 5* | 9 6* | 7* | 8* | 9 9* | 1112 . list interview_dt followup_dt time +-------------------------------+ | intervi~t followup~t time | |-------------------------------| 1. | 1/1/2006 3/1/2006 59 | 2. | 2/13/2007 5/13/2007 89 | 3. | 4/15/2007 7/15/2007 91 | 4. | 9/15/2006 12/15/2006 91 | 5. | 3/17/2007 6/17/2007 92 | |-------------------------------| 6. | 1/4/2008 4/4/2008 91 | +-------------------------------+ . foreach var of varlist rate_1-rate_2{ 2. replace `var'=. if `var'==9 3. } (2 real changes made, 2 to missing) (2 real changes made, 2 to missing) . tab rate_1, missing rate_1 | Freq. Percent Cum. ------------+----------------------------------- 1 | 1 16.67 16.67 2 | 2 33.33 50.00 4 | 1 16.67 66.67 . | 2 33.33 100.00 ------------+----------------------------------- Total | 6 100.00 . tab rate_2, missing rate_2 | Freq. Percent Cum. ------------+----------------------------------- 1 | 1 16.67 16.67 3 | 1 16.67 33.33 4 | 2 33.33 66.67 . | 2 33.33 100.00 ------------+----------------------------------- Total | 6 100.00 end of do-file 3. Are your data are contained in an Excel Spread sheet or a different format such as a SAS data file or SPSS data file? You could; a. Open the StatTransfer program in the computer lab rooms. StatTransfer allows you to transfer an input file of a certain specification (e.g., Excel, SAS, SPSS) to a Stata10 output file. Note: the second tab on the left of the StatTransfer window will allow you to select certain variables; the third tab on the left will allow you to select certain observations. By default, Stata transfers all observations and all variables and it will transfer dates into date format for you. . list +----------------------------------------------------------+ | age sex intervi~t rate_1 rate_2 followu~t id | |----------------------------------------------------------| 1. | 32 M 01 Jan 06 1 3 01 Mar 06 1 | 2. | 15 F 13 Feb 07 2 4 13 May 07 2 | 3. | 12 M 15 Apr 07 9 1 15 Jul 07 3 | 4. | 19 M 15 Sep 06 4 9 15 Dec 06 4 | 5. | 8 F 17 Mar 07 9 9 17 Jun 07 5 | |----------------------------------------------------------| 6. | 6 F 04 Jan 08 2 4 04 Apr 08 6 | +----------------------------------------------------------+ . codebook ------------------------------------------------------------------------------ age (unlabeled) ------------------------------------------------------------------------------ type: numeric (byte) range: [6,32] units: 1 unique values: 6 missing .: 0/6 tabulation: Freq. Value 1 6 1 8 1 12 1 15 1 19 1 32 ------------------------------------------------------------------------------sex (unlabeled) ----------------------------------------------------------------------------- type: string (str1) unique values: 2 missing "": 0/6 tabulation: Freq. Value 3 "F" 3 "M" ------------------------------------------------------------------------------ interview_dt (unlabeled) ------------------------------------------------------------------------------ type: numeric daily date (long) range: [16802,17535] units: 1 or equivalently: [01jan2006,04jan2008] units: days unique values: 6 missing .: 0/6 tabulation: Freq. Value 1 16802 01jan2006 1 17059 15sep2006 1 17210 13feb2007 1 17242 17mar2007 1 17271 15apr2007 1 17535 04jan2008 ------------------------------------------------------------------------------ rate_1 (unlabeled) ------------------------------------------------------------------------------ type: numeric (byte) range: [1,9] units: 1 unique values: 4 missing .: 0/6 tabulation: Freq. Value 1 1 2 2 1 4 2 9 ------------------------------------------------------------------------------ rate_2 (unlabeled) ------------------------------------------------------------------------------ type: numeric (byte) range: [1,9] units: 1 unique values: 4 missing .: 0/6 tabulation: Freq. Value 1 1 1 3 2 4 2 9 ------------------------------------------------------------------------------ followup_dt (unlabeled) ------------------------------------------------------------------------------ type: numeric daily date (long) range: [16861,17626] units: 1 or equivalently: [01mar2006,04apr2008] units: days unique values: 6 missing .: 0/6 tabulation: Freq. Value 1 16861 01mar2006 1 17150 15dec2006 1 17299 13may2007 1 17334 17jun2007 1 17362 15jul2007 1 17626 04apr2008 . 4. Have a large data set? Before you open it in Stata, type “set mem 35m” in the command line. 5. Need to merge two data sets? (Two data sets with different different variables on the same individuals.) Both data sets must have the same unique id for individuals; both data sets must be sort by id. . use "C:\practice1.dta", clear . sort id . merge id using "C:\practice2.dta" . tab _merge _merge | Freq. Percent Cum. ------------+----------------------------------- 3 | 6 100.00 100.00 ------------+----------------------------------- Total | 6 100.00 Stata creates a variable names _merge such that 1 indicates only in file 1, 2 indicates only in file 2 and 3 indicates in both files. +----------------------------------------------------------------------------------------+ | id age sexn interv~w followup rate_1 rate_2 outcome1 outcome2 _merge | |----------------------------------------------------------------------------------------| 1. | 1 32 M 16802 16861 1 3 Y N 3 | 2. | 2 15 F 17210 17299 2 4 N N 3 | 3. | 3 12 M 17271 17362 . 1 Y Y 3 | 4. | 4 19 M 17059 17150 4 . N Y 3 | 5. | 5 8 F 17242 17334 . . Y N 3 | |----------------------------------------------------------------------------------------| 6. | 6 6 F 17535 17626 2 4 N Y 3 | +----------------------------------------------------------------------------------------+ 6. Need to append two data sets? (Two data sets with same variables on different individuals.) . use "C:\practice1.dta", clear . sort id . append using "C:\practice3.dta" 7. Do you have multiple records for the same individual (same id). The Stata reshape command allows one to go from data in a “long” format with multiple records per person to a “wide” format with a single record per person. (long form) i j x_ij id year sex inc ----------------------- 1 80 0 5000 1 81 0 5500 1 82 0 6000 2 80 1 2000 2 81 1 2200 2 82 1 3300 3 80 0 3000 3 81 0 2000 3 82 0 1000 (wide form) i ....... x_ij ........ id sex inc80 inc81 inc82 ------------------------------- 1 0 5000 5500 6000 2 1 2000 2200 3300 3 0 3000 2000 1000 Here is the example from the Stata help for the reshape command. Given these data, you could use reshape to convert from one form to the other: . reshape wide inc, i(id) j(year) (goes long to wide) . reshape long inc, i(id) j(year) (goes from wide to long) 8. Don’t forget to use the Stata help menu. It may look ominous but if you scroll down, often there are examples at the end of the help file for a certain command. 9. Don’t forget to look back at your Biostat 621-623 lecture notes, problem sets, and Stata notes for tips. 10. Biostat 624 requires a data analysis project of your choice so this course will be helpful to you if you are working with another data set. 11. Are we missing a question that you may have? Please let us know. |
|||||||||||||||||||||||||||||||||||||||||||||||||
Home | Schedule | Classes | Problem Sets | e-Quizzes | Contact Us |