Stata Tips

Stata Lab Q&A:

Use egen to calculate anthropometric z-scores
Converting range into SD from normally distributed data
Can I increase the capacity of the Results window in Stata?
Can I increase the number of commands stored in the
Review window?
How do I list a variable, x, in order of the frequency counts?
How do I format regression output for Word?
Converting ASCII file to Stata using SAS program --
eg, convert NHANES III dataset to Stata format
How can I take random samples from an existing dataset?
What to do if Stata does not start when you double-click
a do-file.
Stata error codes
With a Mac, how can I store the last 50 commands in a
do-file?
How to merge variables from 2 or more datasets
Analyze slopes from longitudinal data
How to get higher resolution TIFF files with Stata/Macintosh
What if the do-file is too large for the do-file editor.
Some Stata Tips for Working with Another Data Set

Can I increase the capacity of the Results window in Stata?

Yes, but it uses additional memory. To permanently increase the capacity of the Results window from 32,000 characters to 2,000,000 characters give the following command in a do-file or Stata session (set scrollbufsize will take effect the next time you launch Stata):
set scrollbufsize 2000000

Why doesn't Stata work correctly?

Who knows, but make sure you update everything after you have installed Stata 12 from CD by giving the commands:

update all (wait for it to finish)

update swap (renames the wstata.bin just downloaded -- Stata will restart)

How do I list a variable, x, in order of the frequency counts?

Use the following Stata statements, interactively or (better) in a .do file:

*save dataset prior to contract command
contract x
sort _freq
list x _freq

A simpler, but harder to read way is:

bysort x: gen count=_N
gen count_x=1000*count+x
tab count_x

In SAS this is done with the order=freq option:

proc freq order=freq;
table x;

Can I increase the number of commands stored in the Review window?

In Stata, you can increase the number of commands retained to 500 using the command:

set reventries 500, permanently

How can I take random samples from an existing dataset?

The Stata website gives full description of how this is done:

http://www.stata.com/support/faqs/stat/sampling.html

Examples:

sample 10 (draw and save a 10% sample)
save samp10pct.dta,replace

sample 1000, count (draw and save a sample of size 1000)
save samp1000.dta, replace

What to do if Stata does not start when you double-click a do-file

Open Windows Explorer

Click Tools => Folder Options => File Types => Scroll down to "DO" file extension and click change.

If details for 'Do' extension does not say
"Opens with stata"
then click Change and find stata.exe

If it already says "Opens with stata" then click on Advanced.

Under Actions, there should be 3 actions: edit and open.

Highlight "open", click "Edit" and make sure the following is in the Application used to perform action" box (include the quotes):

"C:\Program Files\Stata12\stata.exe" do "%1"

and stata should be in the Application box.

Stata error messages

With a Mac, how can I store the last 50 commands
in a do-file?

#review 50

Cut and paste commands from the log into the do-file
(you will have to remove line numbers)

How to merge variables from 2 or more datasets
- Example 1: Merge 2 datasets:
  data1 and data2 * using variable id to link

clear
use data1
sort id
save data1, replace

clear
use data2
sort id
save data2, replace

merge id using data1
*keep observations only if on both datasets -- _merge code==3
keep if _merge==3
save mergedata, replace

Example 2: Merge 3 datasets:
data1, data2 and data3 * using variable id to link

clear
use data1
sort id
save data1, replace

clear
use data2
sort id
save data2, replace

clear
use data3
sort id
save data3, replace

merge id using data1 data2
*keep observations if on all 3 datasets
keep if _merge1==1 & _merge2==1
drop _merge1 _merge2 _merge
save mergedata, replace

Analyze slopes from longitudinal data

I have a longitudinal data set and want to regress a predictor over time within each id and attach the resulting beta coefficient and se to the id.

gen coef = .
gen seb = .
levelsof id, local(levels)
foreach l of local levels {
regress yvar time if id==`l'
mat b = e(b)
mat V = e(V)
replace coef = b[1,1] if id==`l'
replace seb = sqrt(V[1,1]) if id==`l'
}
list id coef seb

How to format regression output for Word
- outreg.ado
  
  Installation:
  ssc install outreg, replace all
  Use help ssc for details on the ssc command.
  Use help outreg for details on the outreg command.
- (outreg2 is a related alternative and can be installed as above)
- Description
  
  outreg creates an ASCII text file with columns separated with tab
  characters. The file can be converted automatically to a table in
  word processors and
  spreadsheets.
  
  For example, in Microsoft Word:
  Open or Insert the file created by outreg.
  Select the estimation output text that is in columns
  (not the notes at the bottom of the table or the title at
  the top, if any).
  
  Select Table / Convert / Text to Table.
  
  With some adjustment of the column widths, fonts, etc.
  the final table is ready in Word.
- Example:
  Store the following commands in outreg_demo.do and run it:
  
  * Illustrate using built-in auto Stata dataset
  sysuse auto,clear
  
  * Fit regression equation
  regress mpg foreign weight headroom trunk length turn displacement
  
  * Allow commands to span mulitple lines
  #delimit ;
  outreg using outreg.txt, replace pvalue noaster label
  title("Title 1", "Title 2")
  ctitle("Column title")
  addnote("note1", "note2", "note3")
  ;
  #delimit cr
  
  Note:
  The ASCII file with the table, outreg.txt, will be found in the
  same folder as outreg_demo.do

How to get higher resolution TIFF files with Stata/Macintosh
- Add the height( ) option to the graph export command
  as in the following example:
  graph export eq3\figr3.tif,replace height(2000)
  
  This increases the number of pixels by a factor of about
  100 fold. The maximum allowed height is 16000.

What if the do-file is too large for the do-file editor.
- Some do-files, especially those with names and variable labels from large public use datasets, will cause a Stata error message stating that the do-file is too large.
  1) Try increasing memory using the set memory command.
  
  2) Create/edit the do-file outside Stata using NotePad (Start->Programs->Accessories->Notepad) or equivalent text editor, saving the file after finishing edits, but leave Notepad open. Then, double-click the do-file to change the Working Directory to the folder containing the do-file and data. This will run the start Stata and attempt to run the do-file. If there are errors to be fixed or edits to be made, you can make them in Notepad, save, return to Stata and type (if the do-file is named bigdo.do):
  - do bigdo.do
What if the number of variables exceeds 2047?
- You may receive an error for one of the very large public use datasets because the number of variables exceeded 2047. One can either use Stata/SE (which we do not have) or, use SAS to create a SAS dataset, and then use StatTransfer to create a Stata dataset with a smaller number of variables.

Some Stata Tips for Working with Another Data Set

1. Are your data contained in an Excel Spread sheet? You could;

a. Open the Excel file, highlight (select) the observations and variables of interest, copy and paste into the upper left cell of the Stata data editor; the variable names and values will be copied.

Age	sex	interview_dt	rate_1	rate_2	followup_dt	id
32	M	1/1/2006	1	3	3/1/2006	1
15	F	2/13/2007	2	4	5/13/2007	2
12	M	4/15/2007	9	1	7/15/2007	3
19	M	9/15/2006	4	9	12/15/2006	4
8	F	3/17/2007	9	9	6/17/2007	5
6	F	1/4/2008	2	4	4/4/2008	6

You will notice that sex is a string variable, the dates are not in a format that would allow you to subtract them, and the rate variables have values of 9 represent a missing value but Stata requires a “.” for a missing value.

2. Need to change the format of certain variables?

a. Convert string variable to numeric variable – use the “encode” command or the “destring” command

b. Change date format to number of days so that it may be used in analysis.

c. Change missing values coded as “9” to “.,” (missing values).

Create the following do file to make some changes:

codebook sex

tab sex

encode sex, gen(sexn)

tab sexn

codebook sexn

gen interview=date(interview_dt,"MDY")

codebook interview

gen followup=date(followup_dt,"MDY")

gen time = followup-interview

stem time

list interview_dt followup_dt time

foreach var of varlist rate_1-rate_2{

replace `var'=. if `var'==9

}

tab rate_1, missing

tab rate_2, missing

In the results window, you will see:

. do "C:\practice.do”

. codebook sex

-----------------------------------------------------------------------------------------

sex (unlabeled)

-----------------------------------------------------------------------------------------

type: string (str1)

unique values: 2 missing "": 0/6

tabulation: Freq. Value

3 "F"

3 "M"

. tab sex

sex | Freq. Percent Cum.

------------+-----------------------------------

F | 3 50.00 50.00

M | 3 50.00 100.00

------------+-----------------------------------

Total | 6 100.00

. encode sex, gen(sexn)

. tab sexn

sexn | Freq. Percent Cum.

------------+-----------------------------------

F | 3 50.00 50.00

M | 3 50.00 100.00

------------+-----------------------------------

Total | 6 100.00

. codebook sexn

-----------------------------------------------------------------------------------------sexn (unlabeled)

-----------------------------------------------------------------------------------------

type: numeric (long)

label: sexn

range: [1,2] units: 1

unique values: 2 missing .: 0/6

tabulation: Freq. Numeric Label

3 1 F

3 2 M

. gen interview=date(interview_dt,"MDY")

. codebook interview

-----------------------------------------------------------------------------------------

interview (unlabeled)

-----------------------------------------------------------------------------------------

type: numeric (float)

range: [16802,17535] units: 1

unique values: 6 missing .: 0/6

tabulation: Freq. Value

1 16802

1 17059

1 17210

1 17242

1 17271

1 17535

. gen followup=date(followup_dt,"MDY")

. gen time = followup-interview

. stem time

Stem-and-leaf plot for time

5* | 9

6* |

7* |

8* | 9

9* | 1112

. list interview_dt followup_dt time

+-------------------------------+

| intervi~t followup~t time |

|-------------------------------|

1. | 1/1/2006 3/1/2006 59 |

2. | 2/13/2007 5/13/2007 89 |

3. | 4/15/2007 7/15/2007 91 |

4. | 9/15/2006 12/15/2006 91 |

5. | 3/17/2007 6/17/2007 92 |

|-------------------------------|

6. | 1/4/2008 4/4/2008 91 |

+-------------------------------+

. foreach var of varlist rate_1-rate_2{

2. replace `var'=. if `var'==9

3. }

(2 real changes made, 2 to missing)

. tab rate_1, missing

rate_1 | Freq. Percent Cum.

------------+-----------------------------------

1 | 1 16.67 16.67

2 | 2 33.33 50.00

4 | 1 16.67 66.67

. | 2 33.33 100.00

------------+-----------------------------------

Total | 6 100.00

. tab rate_2, missing

rate_2 | Freq. Percent Cum.

------------+-----------------------------------

1 | 1 16.67 16.67

3 | 1 16.67 33.33

4 | 2 33.33 66.67

. | 2 33.33 100.00

------------+-----------------------------------

Total | 6 100.00

end of do-file

3. Are your data are contained in an Excel Spread sheet or a different format such as a SAS data file or SPSS data file? You could;

a. Open the StatTransfer program in the computer lab rooms. StatTransfer allows you to transfer an input file of a certain specification (e.g., Excel, SAS, SPSS) to a Stata10 output file. Note: the second tab on the left of the StatTransfer window will allow you to select certain variables; the third tab on the left will allow you to select certain observations. By default, Stata transfers all observations and all variables and it will transfer dates into date format for you.

. list

+----------------------------------------------------------+

| age sex intervi~t rate_1 rate_2 followu~t id |

|----------------------------------------------------------|

1. | 32 M 01 Jan 06 1 3 01 Mar 06 1 |

2. | 15 F 13 Feb 07 2 4 13 May 07 2 |

3. | 12 M 15 Apr 07 9 1 15 Jul 07 3 |

4. | 19 M 15 Sep 06 4 9 15 Dec 06 4 |

5. | 8 F 17 Mar 07 9 9 17 Jun 07 5 |

|----------------------------------------------------------|

6. | 6 F 04 Jan 08 2 4 04 Apr 08 6 |

+----------------------------------------------------------+

. codebook

------------------------------------------------------------------------------

age (unlabeled)

------------------------------------------------------------------------------

type: numeric (byte)

range: [6,32] units: 1

unique values: 6 missing .: 0/6

tabulation: Freq. Value

1 6

1 8

1 12

1 15

1 19

1 32

------------------------------------------------------------------------------sex (unlabeled)

-----------------------------------------------------------------------------

type: string (str1)

unique values: 2 missing "": 0/6

tabulation: Freq. Value

3 "F"

3 "M"

------------------------------------------------------------------------------

interview_dt (unlabeled)

------------------------------------------------------------------------------

type: numeric daily date (long)

range: [16802,17535] units: 1

or equivalently: [01jan2006,04jan2008] units: days

unique values: 6 missing .: 0/6

tabulation: Freq. Value

1 16802 01jan2006

1 17059 15sep2006

1 17210 13feb2007

1 17242 17mar2007

1 17271 15apr2007

1 17535 04jan2008

------------------------------------------------------------------------------

rate_1 (unlabeled)

------------------------------------------------------------------------------

type: numeric (byte)

range: [1,9] units: 1

unique values: 4 missing .: 0/6

tabulation: Freq. Value

1 1

2 2

1 4

2 9

------------------------------------------------------------------------------

rate_2 (unlabeled)

------------------------------------------------------------------------------

type: numeric (byte)

range: [1,9] units: 1

unique values: 4 missing .: 0/6

tabulation: Freq. Value

1 1

1 3

2 4

2 9

------------------------------------------------------------------------------

followup_dt (unlabeled)

------------------------------------------------------------------------------

type: numeric daily date (long)

range: [16861,17626] units: 1

or equivalently: [01mar2006,04apr2008] units: days

unique values: 6 missing .: 0/6

tabulation: Freq. Value

1 16861 01mar2006

1 17150 15dec2006

1 17299 13may2007

1 17334 17jun2007

1 17362 15jul2007

1 17626 04apr2008

4. Have a large data set? Before you open it in Stata, type “set mem 35m” in the command line.

5. Need to merge two data sets? (Two data sets with different different variables on the same individuals.) Both data sets must have the same unique id for individuals; both data sets must be sort by id.

. use "C:\practice1.dta", clear

. sort id

. merge id using "C:\practice2.dta"

. tab _merge

_merge | Freq. Percent Cum.

------------+-----------------------------------

3 | 6 100.00 100.00

------------+-----------------------------------

Total | 6 100.00

Stata creates a variable names _merge such that 1 indicates only in file 1, 2 indicates only in file 2 and 3 indicates in both files.

+----------------------------------------------------------------------------------------+

| id age sexn interv~w followup rate_1 rate_2 outcome1 outcome2 _merge |

|----------------------------------------------------------------------------------------|

1. | 1 32 M 16802 16861 1 3 Y N 3 |

2. | 2 15 F 17210 17299 2 4 N N 3 |

3. | 3 12 M 17271 17362 . 1 Y Y 3 |

4. | 4 19 M 17059 17150 4 . N Y 3 |

5. | 5 8 F 17242 17334 . . Y N 3 |

|----------------------------------------------------------------------------------------|

6. | 6 6 F 17535 17626 2 4 N Y 3 |

+----------------------------------------------------------------------------------------+

6. Need to append two data sets? (Two data sets with same variables on different individuals.)

. use "C:\practice1.dta", clear

. sort id

. append using "C:\practice3.dta"

7. Do you have multiple records for the same individual (same id). The Stata reshape command allows one to go from data in a “long” format with multiple records per person to a “wide” format with a single record per person.

(long form)

i j x_ij

id year sex inc

-----------------------

1 80 0 5000

1 81 0 5500

1 82 0 6000

2 80 1 2000

2 81 1 2200

2 82 1 3300

3 80 0 3000

3 81 0 2000

3 82 0 1000

(wide form)

i ....... x_ij ........

id sex inc80 inc81 inc82

-------------------------------

1 0 5000 5500 6000

2 1 2000 2200 3300

3 0 3000 2000 1000

Here is the example from the Stata help for the reshape command.

Given these data, you could use reshape to convert from one form to the other:

. reshape wide inc, i(id) j(year) (goes long to wide)

. reshape long inc, i(id) j(year) (goes from wide to long)

8. Don’t forget to use the Stata help menu. It may look ominous but if you scroll down, often there are examples at the end of the help file for a certain command.

9. Don’t forget to look back at your Biostat 621-623 lecture notes, problem sets, and Stata notes for tips.

10. Biostat 624 requires a data analysis project of your choice so this course will be helpful to you if you are working with another data set.

11. Are we missing a question that you may have? Please let us know.