48
Class 1 - Introduction; Overview of Stata -- LECTURE NOTES Contents 1. Topics ....................................... 2 2. Syllabus ...................................... 2 3. Course objectives .............................. 2 4. Course organization ............................ 2 4.1 Web site ................................. 2 4.2 Userid and password ....................... 2 4.3 Grading ................................. 3 4.4 Data analysis project ....................... 3 5. Stata statistical package ......................... 5 5.1 Introduction .............................. 5 5.2 Flavors of Stata ........................... 5 5.3 Requesting more memory for Stata ............ 5 5.4 On-line help .............................. 6 5.5 Resources for learning about Stata ............ 6 5.6 Stata software pricing ...................... 6 5.7 Customizing Stata ......................... 6 5.8 Keeping Stata up-to-date .................... 7 5.9 Datasets ................................ 7 5.10 Stata commands ......................... 7 5.11 How to re-issue commands ................. 8 5.12 Program files - do files ..................... 8 5.13 A special do-file – profile.do ................ 8 5.14 How to start Stata and set the working directory ....................................... 8 5.15 Keeping a log of your work ................. 9 5.16 Getting data into Stata ..................... 9 5.17 Stata tutorial on data input .................. 9 5.18 Saving a Stata dataset ................... 12 5.19 Loading a Stata dataset ................... 12 6. Stata programs – “do-files” ...................... 13 6.1 What are and why use do-files .............. 13 6.2 “Hello Mom” program ...................... 13 6.3 Start Stata do-file editor .................... 13 6.4 Edit and re-run “do” Program ............... 13 6.5 Another program ......................... 13 7. Using Stata to create “do” files ................... 15 8. Stat /Transfer for importing/exporting data .......... 15 9. Example 1: exploratory analysis of data from Altman’s Exercise 3-1 ................................ 16 9.1 Listing of data file ......................... 18 9.2 Analysis Plan ............................ 19 9.3 Box-Cox transform ........................ 19 9.4 Techniques Illustrated ..................... 20 9.5 Log Showing Commands and Output ......... 20 10. Example 2: input and display of data from Altman’s exercise 3-2 ................................ 34 10.1 Source data from Altman .................. 34 10.2 Raw data — text file on disk ............... 34 10.3 Analysis plan ........................... 36 10.4 Stata log .............................. 36 11. Common data analysis applications .............. 40 11.1 Descriptive statistics ..................... 40 11.2 Stem-and-leaf charts ..................... 40 11.3 Boxplots ............................... 40 11.4 Confidence interval for a mean ............. 40 11.5 Confidence interval for a proportion ......... 40 11.6 Student’s t-test .......................... 41 11.7 Test for binomial proportions ............... 41 11.8 Correlation ............................. 41 11.9 Simple linear regression .................. 41 11.10 Analysis of variance ..................... 42 11.11 Multiple linear regression ................. 42 11.12 Multiple logistic regression ............... 42 11.13 Epidemiologic calculations - epitab ......... 42 11.14 Sample size and power calculations ........ 47 Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 1

Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Contents

1. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2. Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3. Course objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

4. Course organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.1 Web site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2 Userid and password . . . . . . . . . . . . . . . . . . . . . . . 24.3 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.4 Data analysis project . . . . . . . . . . . . . . . . . . . . . . . 3

5. Stata statistical package . . . . . . . . . . . . . . . . . . . . . . . . . 55.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.2 Flavors of Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.3 Requesting more memory for Stata . . . . . . . . . . . . 55.4 On-line help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.5 Resources for learning about Stata . . . . . . . . . . . . 65.6 Stata software pricing . . . . . . . . . . . . . . . . . . . . . . 65.7 Customizing Stata . . . . . . . . . . . . . . . . . . . . . . . . . 65.8 Keeping Stata up-to-date . . . . . . . . . . . . . . . . . . . . 75.9 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75.10 Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . 75.11 How to re-issue commands . . . . . . . . . . . . . . . . . 85.12 Program files - do files . . . . . . . . . . . . . . . . . . . . . 85.13 A special do-file – profile.do . . . . . . . . . . . . . . . . 85.14 How to start Stata and set the working directory

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.15 Keeping a log of your work . . . . . . . . . . . . . . . . . 95.16 Getting data into Stata . . . . . . . . . . . . . . . . . . . . . 95.17 Stata tutorial on data input . . . . . . . . . . . . . . . . . . 95.18 Saving a Stata dataset . . . . . . . . . . . . . . . . . . . 125.19 Loading a Stata dataset . . . . . . . . . . . . . . . . . . . 12

6. Stata programs – “do-files” . . . . . . . . . . . . . . . . . . . . . . 136.1 What are and why use do-files . . . . . . . . . . . . . . 136.2 “Hello Mom” program . . . . . . . . . . . . . . . . . . . . . . 136.3 Start Stata do-file editor . . . . . . . . . . . . . . . . . . . . 136.4 Edit and re-run “do” Program . . . . . . . . . . . . . . . 136.5 Another program . . . . . . . . . . . . . . . . . . . . . . . . . 13

7. Using Stata to create “do” files . . . . . . . . . . . . . . . . . . . 15

8. Stat /Transfer for importing/exporting data . . . . . . . . . . 15

9. Example 1: exploratory analysis of data from Altman’sExercise 3-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169.1 Listing of data file . . . . . . . . . . . . . . . . . . . . . . . . . 189.2 Analysis Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199.3 Box-Cox transform . . . . . . . . . . . . . . . . . . . . . . . . 199.4 Techniques Illustrated . . . . . . . . . . . . . . . . . . . . . 209.5 Log Showing Commands and Output . . . . . . . . . 20

10. Example 2: input and display of data from Altman’sexercise 3-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3410.1 Source data from Altman . . . . . . . . . . . . . . . . . . 3410.2 Raw data — text file on disk . . . . . . . . . . . . . . . 3410.3 Analysis plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 3610.4 Stata log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

11. Common data analysis applications . . . . . . . . . . . . . . 4011.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . 4011.2 Stem-and-leaf charts . . . . . . . . . . . . . . . . . . . . . 4011.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4011.4 Confidence interval for a mean . . . . . . . . . . . . . 4011.5 Confidence interval for a proportion . . . . . . . . . 40

11.6 Student’s t-test . . . . . . . . . . . . . . . . . . . . . . . . . . 4111.7 Test for binomial proportions . . . . . . . . . . . . . . . 4111.8 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4111.9 Simple linear regression . . . . . . . . . . . . . . . . . . 4111.10 Analysis of variance . . . . . . . . . . . . . . . . . . . . . 4211.11 Multiple linear regression . . . . . . . . . . . . . . . . . 4211.12 Multiple logistic regression . . . . . . . . . . . . . . . 4211.13 Epidemiologic calculations - epitab . . . . . . . . . 4211.14 Sample size and power calculations . . . . . . . . 47

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 1

Page 2: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

1. Topics

! Outline course

! Overview of Stata

! Handouts

— Website and Schedule

— Lecture Notes #1

— e-Quiz #1 (due Fri, 8 Apr 2011)

2. Syllabus

! Multiple regression models:— Linear— Logistic— Conditional logistic (case-control studies)— Log-linear (Poisson) for counts & rates— Log-linear for contingency tables— Cox proportional hazards

! Longitudinal data analysis (repeated measures), analysis ofclustered data

! Random effects/mixed effects/multilevel models

! Model checking: analysis of residuals, measures ofleverage and influence

! Special topics: methods for missing data; reliability, inter-rater agreement, diagnostic tests, reference intervals,sample size, regression for survey samples

3. Course objectives

! Students who master the course contents will be able to:

— Frame a scientific question about the dependence ofa continuous, binary, count, or time-to-eventresponse on explanatory variables in terms oflinear, logistic, log-linear, or survival regressionmodel whose parameters represent quantities ofscientific interest

— Design a tabular or graphical display of a dataset thatmakes apparent the association betweenexplanatory variables and the response

— Choose a specific linear, logistic, log-linear, orsurvival regression model appropriate to address ascientific question and correctly interpret themeaning of its parameters.

— Appreciate that the interpretation of a particularmultiple regression coefficient depends on whichother explanatory variables are in the model

— Estimate the unknown coefficients and their standarderrors using maximum(or partial) likelihood andperform tests of relevant null hypotheses about theassociation with the response of particular subsetsof explanatory variables

— Check whether a model fits the data well; identifyways to improve a model when necessary

— Use several models for the analysis of a dataset toeffectively answer the main scientific questions

— Understand how longitudinal data differ from cross-sectional data and why special regression methodsare sometimes needed for their analysis

— Summarize in a table, the results of linear, logistic,log-linear, and survival regressions and write adescription of the statistical methods, results, andmain findings for a scientific report

— Perform data management, including input, editing,and merging of datasets, necessary to analyze datain Stata or equivalent statistical software

— Complete a data analysis project, including dataanalysis and a written summary in the form of ascientific paper

4. Course organization

! The course contents, schedule, and procedures aresummarized in course website pages:

— “Home” page: organizational details

— “Schedule” page: classes, e-quizzes, exam, project

4.1 Web site

! Web site URL:

http://biostat.jhsph.edu/courses/bio624/

4.2 Userid and password

! Some parts of the course site require a Userid andPassword, which are

Userid: bio624

Password: theedge

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 2

Page 3: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

4.3 Grading

20% e-quizzes (5 of these)

50% Data analysis project

1% Preliminary abstract (mustbe on time)

49% Completed project

30% Examination (in-class; required for grade of A,otherwise optional)

4.4 Data analysis project

! Conduct an analysis to address a scientific topic usingappropriate statistical methods

— Students must identify topics and datasetsindependently – ie, topics and datasets will not beassigned or provided

— The analysis should involve regression modeling withat least two explanatory variables

— The dataset and analysis should address a publichealth topic, with “public health” interpreted broadly

— Typically, datasets will have between 100 and100,000 observations; however, larger or smallerdatasets may also be appropriate - ask if in doubt

— Datasets with fewer than 50 observations arediscouraged, but not prohibited

— IMPORTANT: Conduct the final analysis and writethe final report INDEPENDENTLY

— However, CONSULTING/COLLABORATING withinstructors, TAs, students or others about the dataor analysis IS ENCOURAGED

— It is also OK to share datasets, as long as the finalanalysis (do-file), tables, and report are doneINDEPENDENTLY

! Prepare a report summarizing your findings in the form of amini scientific paper in the following format:

0. Title 1. Abstract (structured) 2. Introduction 3. Methods (including sample size

considerations) 4. Results (including at least one figure and one

table) 5. Discussion 6. Appropriate other tables, figures, etc

7. Appendices (as applicable)a. Variable list

b. Model checks (residuals, influentialpoints)

c. Sensitivity analyses (with/withoutinfluential points, etc)

d. Step-wise variable selectione. Non-linearity checksf. Collinearity assessment g. Interaction assessmenth. Confounding -- compare adjusted

and unadjusted modelsi. Likelihood testing or F-tests for

nested modelsj. Stata do-file(s) - REQUIREDk. Stata logs and graphs with enough

results to confirm statements in thethe paper

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 3

Page 4: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

4.4 Data analysis project (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! Possible sources for datasets:

— An important part of the project is to identify and gain access to an appropriate dataset

— The best dataset is one that you are familiar with frompast work that you can use to address questionsthat have not been addressed before

— Next best is a dataset from an advisor or colleague— ideally one whose subject matter is of interest toyou

— It is OK to use datasets from other classes or theMPH capstone project if they include enoughmaterial to support a regression analysis — if indoubt, ask an instructor from this class

— Online datasets. There are numerous datasets onlinethat could be used for a project. Some links topossible sources for datasets are posted on thecourse website (“Other links” on the home page):

http://www.biostat.jhsph.edu/courses/bio624/misc/datasets.htm

— Government and institutional websites ( a few arelisted below) contain an enormous amount of data,will require some exploration to find downloadable,raw data suitable for analysis):

www.fedstats.gov FEDSTATS (federalstatistics locator)

www.cdc.gov Centers for DiseaseControl, including theNational Center forHealth Statistics

NCHS public use data filesand documentation

www.cdc.gov/nchs/datawh/ftpserv/ftpdata/ftpdata.htm

www.census.gov US Census Bureau

www.who.ch World Health Organization

Emory Biostatistics Dept - excellent list of onlinedatabases

http://www.sph.emory.edu/bios/bioslist.html#database

— Statistical data warehouse with library of data anddata stories (ie, documentation): www.stat.cmu.edu

— click DASL under Related Links

If you decide to use one of these datasets,you must consult source paper(s) forthe dataset and attach with the supportingmaterials for your project report

— Some textbooks have collections of datasets that maybe suitable for further analysis

Again, if you decide to use one of thesedatasets, make sure to consult sourcepaper(s) for the dataset and attach withthe supporting materials for your projectreport

LC Hamilton, Statistics with Stata

www.stata.com/bookstore/swsdl.html

Duxbury publishing website - site contains datasetsfrom health statistics textbooks: Click “DataLibrary”:http://www.thomsonedu.com/statistics/discipline_content/dataLibrary.html

Hosmer and Lemeshow: Applied SurvivalAnalysis: ftp://ftp.wiley.com/public/sci_tech_med/survival/

Hosmer and Lemeshow: Applied LogisticRegression Analysis: Datasets arecontained in the University of MassachusettsDatasets Archive, which contains links to otherdata resources (make sure to type the URLexactly as given below and then scroll down tothe list of datasets by type of analysis - DONOT USE the low birthweight dataset)

http://www-unix.oit.umass.edu/~statdata/statdata/

Moore and McCabe: Introduction to the Practice ofStatistics (IPS), arguably, the best introductory statisticstext available. The applets help master statisticalconcepts. The datasets will require finding the sourcepapers

http://www.whfreeman.com/ips/

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 4

Page 5: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5. Stata statistical package

5.1 Introduction

! Stata , according to its authors, is used for:— Managing data— Analyzing data— Graphing data

! Stata offers a common interface across differentcomputers and operating systems: DOS, Windows,Macintosh, Unix, and others — files created on onesystem may be used on another without any conversion

! The Stata interface is command-driven — “type a little,get a little.”

! But commands can be a pain at times, so Stata offers amenu-based interface

! Stata is very fast, due mainly to storage of datasets inmemory during processing (as opposed to diskprocessing). Graphics are not so fast!

! Stata is capable of processing a large variety of datasetswith the sole restriction that the dataset must fit intoavailable computer memory. This restriction rules outreally large datasets such as Medicare or other healthinformation systems.

! Data integrity: Stata works on a copy of your dataset inmemory, making it “safe interactive use.” You can still destroy your data by explicitly saving over it.

Tip: always make copies of your key datasets beforedata handling activities that involve savingresults. Note that analysis activities are “safe”with very little risk of harm to your data. Datamanagement activities are “risky.”

! Stata is case-sensitive: The name “Myfile” is different from“myfile” — when in doubt, use lower case

! Stata is programmable — many parts of Stata are writtenin the Stata programming language. This language canbe used to generate, in principle, any statistical analysiswhether or not it is explicitly part of Stata (see “do” and“ado” files in the Manual)

! Stata has a very large and active on-line users group. Members meet via the Internet using a “listserv” e-mailsystem. Stata is continually updated and many updatescome from users. You may submit questions to the“listserv” -- your questions go to all members of the“listserv” – currently 25 questions per day are submitted

! The Stata website (www.stata.com) has a good Supportsection, especially the FAQs

! Stata’s e-mail based user support is very responsive andhelpful. Remember to provide your serial number in the e-mail along with your question

5.2 Flavors of Stata

! Stata 11 was released in 2009— Major revisions occur about every 3 years— Menus for nearly all commands— Vastly improved graphics— Enhancements to statistics, especially survival

analysis— We will use Stata 11 in this course— We will try to accomodate Macintosh users, but

some programs may not work with Macs— Macintosh users: see notes under “Other Links” on

the home page:

http://www.biostat.jhsph.edu/~courses/bio624

! Stata comes in three forms:

— Stata IC (Intercooled - we use this)— Small Stata - not for this course— Stata/SE (Special Edition “super-size”)— Stata/MP (Muliple processors)

! Stata/SE— Can analyze datasets with as many as 32,767

variables, and the only limit on observations is theamount of RAM on your computer

— Maximum length of a string variable is 244 characters

— Matrices may be up to 11,000 x 11,000

! Intercooled (IC) Stata — Can analyze datasets with as many as 2,047

variables, and the only limit on observations is theamount of RAM on your computer

— Maximum length of a string variable is 80 characters — Matrices may be up to 800 x 800— Computer should have at least 32 megabytes of RAM

5.3 Requesting more memory for Stata

! By default, Intercooled Stata starts with 1 megabyte ofmemory for datasets and work space. This can beincreased in one of 2 ways:

— Change memory:

To change from 1 megabye to 800 megabytes,give the following command:

set memory 800m

To make the change permanent every time youstart Stata,

set memory 800m , permanently

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 5

Page 6: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5.4 On-line help

! Stata has lots of on-line help available -- all sections of thewritten documentation is on-line in “abbreviated” form(sometimes too abbreviated, especially for statisticaltechniques)

! A good way to access on-line help is via the Help pull-downmenu - portal to all Stata Help including the complete setof manuals in well-indexed PDF format.

! If you know the name of the command, you can accessonline help via the help command. For example to gethelp for the summarize command:

help summarizeNote, upper right: dialog: summarize

– Nearly every Stata command has a dialog screen to construct the command

Note: [R] summarize -- Summary statistics

- Nearly every Stata command has an [R] linkto the PDF Documentation entry

! If you want to look up a topic use the “findit” command,which search help files, as well as internet resources atStata. The results are hyperlinked for easy access toresults. For example, to get information on “logisticregression”:

findit logistic regression

5.5 Resources for learning about Stata

! The primary documentation now spans 5,000+ pages. Themain components are the Reference Manual, theUser’s Guide, and the Graphics Manual. Whilesomewhat intimidating and irritating, these are nowinlcuded in a PDF - a necessity for “serious” users ofStata

! Introductory materials (may be purchased using the Statawebsite):

— Statistics with Stata by LC Hamilton — the bestbook on Stata

! The Stata Journal is a refereed journal and is publishedquarterly with articles about statistics, data analysis,teaching methods, and effective use of Stata’s language

Net courses on Stata. These range is length from a few to12 weeks. They are done via e-mail. There is a charge forthe courses.

5.6 Stata software pricing

! Prices vary for academic institutions, businesses, andstudents. Prices also depend on whether the system willbe used on a network and how many users there will be

! Manuals are purchased separately - some are available inthe JHMI bookstore

! There is a charge for a subscriptions to the Stata Journal are also extra, which comes in both hard copy and PDFformat

! Stata has no annual renewal fee, as do some otherstatistical packages such as SAS, and offers regular freeupdates containing fixes and extensions

! The Stata web site, www.stata.com, has the latest pricesand information on how to purchase items

! BSPH has a GRADPLAN for purchasing the lastest versionof Stata by students. Online ordering is atwww.stata.com/gpdirect

5.7 Customizing Stata

! Changing the size and fonts for Stata windows -- toimprove readability

— From the Edit menu, select:

Preferences / Manage Preferences / LoadPreferences / Maximized Window Settings

... Make font changes, etc. to taste

Preferences / Manage Preferences / LoadPreferences / New Preferences Set / YOURINITIALS

— Demonstrate changing the font and font size by usingthe control button at the upper left of each window,but the Results window is the most important oneto change

1. Click the control button and select Font

2. Select one a fixed space font -- one of the largerStata fonts or fixedsys are good choices

3. Make sure the font size is at least 9

4. IMPORTANT – save the windowing preferencesor the changes disappear:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 6

Page 7: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5.8 Keeping Stata up-to-date

! MAKE SURE your Stata is up-to-date:

— Updates are free

— Fixes and extends Stata

! The current version of Stata is updated frequently aboutevery two weeks. Updates are free. To see whatversion of Stata you are using, type the followingcommands:

aboutquery born

! To see if you need an update (you must be connected tothe internet), either use the Help menu or type thecommand:

update query

! This will advise you to one of the following:

1. Do nothing, all files up to date

2. Update both the executable and ado files

Click: update all

3. Update only the executable

Click: update executable

4. Update only the ado files

Click: update ado

! The new ado files are installed and ready to use as soon asthe download is completed

! One extra step is are required to install a new executable:

Click: update swap

! After installing an update, you can find out what has beenadded or changed by typing:

help whatsnew

5.9 Datasets

! In Stata, “Data” are a rectangular table of numbers andcharacter strings

— Each row is an “observation” on all the variables — Each column contains all the observations for a given

variable— Variables (columns) are represented by 8-character

names

— Observations (rows) are numbered from 1 to _N— Schematic on how data are stored in Stata

Columns = variable names

Rows=observations

var1 var2 var3 ... varn

123

celli,j = data value for variable jon observation i

N

— Stata gives the following simple example of “Data”

Var1 Var2 Name

1. 1 2 Bill2. 3 4 Mary3. 5 6 Pat4. 7 8 Roger5. 9 10 Sean

! In Stata, a “Dataset” is “Data” plus labels, formats, notes,and characteristics

5.10 Stata commands

! There are 200+ commands in Stata, many of which arecommands to obtain specific statistical analyses

! An early User’s Guide, lists 37 commands that “everyoneshould know” by function:

— Getting on-line helplookup, help, (and pull down Help menu)

— Operating system interfacepwd, cd

— Using and saving data from diskuse, saveappend, mergecompress

— Inputting data into Statainputeditinfileinfixinsheet

— Basic data reportingdescribecodebooklistbrowsecountinspecttabletabulate

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 7

Page 8: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

— Data manipulationgenerate, replacerecodeegenrenamedrop, keepsortencode, decodeorderbyreshape

— Keeping track of your worklognotes

— Conveniencedisplay

! Newer commands worth noting

— Handling subsets: define/analyze summary statisticscollapsecontractstatsby

— Tabulation - more compact results than tabulate or summarize

tabletabstattab_chi ( use findit tab_chi for

install/help)

5.11 How to re-issue commands

! Stata stores a long list of the commands you issue in theReview window

! These commands can be accessed and re-issued – VERYuseful for correcting errors without re-typing the wholecommand

To retrieve commands, use either:

Page Up/Page Downor

Click the command in the Review window

5.12 Program files - do files

! ! “Do-files” contain a collection of Stata statements thatperform a variety of tasks – called a Stata program

! Do-files will be used extensively in this course and byexperienced Stata practitioners

! Do-files allow you the document your work by making itpossible to exactly reproduce key analyses – “ a steptowards “Reproducible research”

5.13 A special do-file – profile.do

! When Stata begins, it looks for a file named profile.do ,containing commands that are to be executed as Statastarts

! In particular, Stata looks for the profile.do file in c:\data,among other places, so you can execute a set ofcommands every time you start Stata by placing them ina text file named profile.do , which you store in c:\data

! The profile.do file recommended for this course is asfollows and can be downloaded from various places onthe e-Quizzes page on the course website:

* profile.do for starting Stata* Place in C:\DATA or any working folder containing yourfiles

set memory 750mset linesize 75set more off

5.14 How to start Stata and set the workingdirectory

! The “working directory” in Stata is the folder where Statalooks for data and program files. By default, the workingdirectory is

c:\data

! When you start Stata from the Stata icon, the workingdirectory is set to the default:

c:\data

! You can change the working directory to the foldercontaining your files:

File / Change Working Directory

... Browse to folder

! Or, you can change the working directory by starting Stataby double-clicking a dataset or program (do-file) in thefolder containing the files related to your chosen project– most prefer this method!

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 8

Page 9: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5.15 Keeping a log of your work

! For documentation of your work, you should keep log files,which are transcripts of what appears in a Stata session– the log command or the Log button on the toolbarare used to manage logs

! These logs can be kept in either of two formats

text (recommended – very easy to importinto word processors)

orsmcl (a formatted log that preserves

hyperlinks, fonts and colors)

! You can translate form one format to the other:

translate mylog.smcl mylog.log

! You would usually store the log(s) in the same folder withyour data files related to your work

5.16 Getting data into Stata

! The easiest way to enter a small amount of data into Statais with the edit command. This is an interactivespreadsheet like process that is very intuitive --demonstrate

! If the data are stored in a file on disk and have spacesbetween each variable, use infile as we have done inthe example below

! Files with more complicated formats such as variable itemswith no spaces between them or character strings withembedded blanks, require more complicated input viainfile or infix with a data dictionary — details are inthe Reference Manual, User’s Guide and in on-line Help. By the way, Stata advises against the use of the datadictionary approach since there are other, easier ways todo it

5.17 Stata tutorial on data input

! In addition to the resources mentioned above, there is anold tutorial on data input -- still applies to Stata:

In this tutorial we show you how to enter your data into Stata.

You can enter your data by using -------------------------- --------------------------------------

directly from the keyboard edit (Stata for Windows or Macintosh) input (all versions of Stata)

indirectly from a file insheet infile

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 9

Page 10: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

5.17 Stata tutorial on data input (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

infix a transfer program -------------------------- -------------------------------------- Then you save your data by using save

-------------------------------------------------------------------------------edit is the easiest way to enter a small amount of data. You type

. clear (to drop any data in memory) . edit (to enter the spreadsheet editor)

Only Stata for Windows and Stata for Macintosh users can use edit. We arenot going to demonstrate it here. See the Getting Started manual or justtry it. input is available on all versions of Stata:-------------------------------------------------------------------------------

. clear

. input id mpg weight price

id mpg weight price 1. 1 22 2930 4099 2. 2 17 3350 4749 3. 3 22 2640 3799 4. 4 20 3250 4816 5. 5 15 4080 7827 6. end

-------------------------------------------------------------------------------input continues to accept observations until you type 'end'. Once you havesome data in memory, typing input by itself adds new observations:-------------------------------------------------------------------------------

. input id mpg weight price 6. 6 26 2230 4453 7. endOnly Stata for Windows and Stata for Macintosh users can use edit. We arenot going to demonstrate it here. See the Getting Started manual or justtry it. input is available on all versions of Stata:-------------------------------------------------------------------------------

. clear

. input id mpg weight price

id mpg weight price 1. 1 22 2930 4099 2. 2 17 3350 4749 3. 3 22 2640 3799 4. 4 20 3250 4816 5. 5 15 4080 7827 6. end

-------------------------------------------------------------------------------input continues to accept observations until you type 'end'. Once you havesome data in memory, typing input by itself adds new observations:-------------------------------------------------------------------------------

. input id mpg weight price 6. 6 26 2230 4453 7. end

-------------------------------------------------------------------------------Another way to enter this data would be to type it into a wordprocessor or aneditor, save it in a file, and then read the file. We have such a file:-------------------------------------------------------------------------------

. type "h:\stata\auto1.raw"make, mpg,weight, priceAMC Concord, 22, 2930, 4099AMC Pacer, 17, 3350, 4749

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 10

Page 11: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

5.17 Stata tutorial on data input (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

AMC Spirit, 22, 2640, 3799Buick Century, 20, 3250, 4816Buick Electra, 15,4080, 7827

-------------------------------------------------------------------------------Our file has the variable names at the top (that is not required) and we usedcommas to separate values one from the other. To read this, we can type:-------------------------------------------------------------------------------

. clear

. insheet using "h:\stata\auto1.raw"(4 vars, 5 obs)

. list

make mpg weight price 1. AMC Concord 22 2930 4099 2. AMC Pacer 17 3350 4749 3. AMC Spirit 22 2640 3799 4. Buick Century 20 3250 4816 5. Buick Electra 15 4080 7827

-------------------------------------------------------------------------------It's easy. insheet will read comma- or tab-delimited files, so it will readtext files created by spreadsheet and database programs.-------------------------------------------------------------------------------

-------------------------------------------------------------------------------If your values are separated by blanks rather than commas or tabs, you useinfile to read it. Here is such a file:-------------------------------------------------------------------------------

. type "h:\stata\autodata.raw""AMC Concord" 22 2930 4099"AMC Pacer" 17 3350 4749"AMC Spirit" 22 2640 3799"Buick Century" 20 3250 4816"Buick Electra" 15 4080 7827

. clear

. infile str14 make mpg weight price using "h:\stata\autodata"(5 observations read)

. list in ½

make mpg weight price 1. AMC Concord 22 2930 4099 2. AMC Pacer 17 3350 4749

-------------------------------------------------------------------------------Finally, if you have a formatted file, you use infile or infix to read it:-------------------------------------------------------------------------------

. type "h:\stata\auto3.raw"AMC Concord2229304099AMC Pacer1733504749AMC Spirit2226403799Buick Century2032504816Buick Electra1540807827

. clear

. infix 1: str make 1-18 2: mpg 1-2 weight 3-6 price 7-11> using "h:\stata\auto3.raw"(5 observations read)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 11

Page 12: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. list

make mpg weight price 1. AMC Concord 22 2930 4099 2. AMC Pacer 17 3350 4749 3. AMC Spirit 22 2640 3799 4. Buick Century 20 3250 4816 5. Buick Electra 15 4080 7827

Saving data-----------

After you have entered data into Stata, you can save it. The command is:

save filename

If you do not specify the extension for the filename, Stata assumes the ex-tension '.dta'. For instance, we could type 'save auto' to save this data.It would be saved in the file auto.dta. The command to retrieve previouslysaved data is:

use filename [, clear]

Thus, the next time we want to use auto.dta, we could type 'use auto' or 'useauto, clear'. Sometimes 'use auto' will work, but 'use auto, clear' will al-ways work. Stata stores data in memory. The clear option tells Stata thatit's okay to drop the data in memory in order to retrieve the new data.

5.18 Saving a Stata dataset

! To save the dataset in the current work space on disk, givethe command below along with the appropriate path tothe folder containing the file

! Command:

save blah.dta, replace

5.19 Loading a Stata dataset

! To load a saved dataset from disk into the work area

! Command:

use blah.dta, clear

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 12

Page 13: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

6. Stata programs – “do-files”

6.1 What are and why use do-files

! “Do-files” contain a collection of Stata statements thatperform a variety of tasks – called a Stata program

! Always use do-files to make your work “reproducible” andwell-documented

! Note: you can enter commands interactively and then savethe commands into a do-file by right clicking anywhere inthe Review window: Select Save ReviewContents... and navigating to the folder where youwant to save the file

! For example, we include a do-file for each e-Quiz exceptthe first containing the all the commands to carry out theanalyses: eq2.do, eq3.do, etc.

Demonstrate how to "run" eq1.do

! “Do-files” document your work

! “Do-files” permit reproducible analyses

! “Do-files” make re-running a series of commands very easy– one step

! “Do-files” for particular tasks can be copied and modified toperform similar tasks – “do-files” serve as templates forfuture work

! See Stata User’s Guide, for full documentation on what “do-files” can accomplish

6.2 “Hello Mom” program

e! This program simply displays the message “Hello Mom” --an easy way to try the do-file approach

! The name of the program file will be mom.do

! Store the program in a folder: My Documents\bio624

6.3 Start Stata do-file editor

! To create a program file:

Click: Start

Click: Stata icon

Click: Do-editor icon (envelope)

Note: You can also used NOTEPAD, WORDPAD oreven WORD -- anything that allows files to beread and written in “text” format

! Type the following Stata command into the file:

display “Hello Mom”

... Make sure you press [Enter] after typing the line

! Save the file:

Click File / Save As

Type: MyDocuments\bio624\mom.do

! Run the “do” file:

do mom.do (as a Stata command)

or,

Click: Do current file icon (in do-file editor)

6.4 Edit and re-run “do” Program

! Return to Do-file editor:

Click mom.do on the Task Bar

! Make the fixes (change to “Hello Mother Dear” ) and then(IMPORTANT) save the file

Click File / Save

! Re-run the program:

Click Intercooled... on the Task Bar

do mom.do

or (as above),

Click: Do current file icon (in do-file editor)

! Repeat the “Edit - Run” cycle until done or tired

6.5 Another program

! This program is a little more complicated – try it for funand practice in making do-files

! Open Stata by clicking profile.do in MyDocuments\bio624

! Input faculty IQ data and summarize it

! The name of the program will be blah.do

! The program is in folder: MyDocuments\bio624

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 13

Page 14: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

6.5 Another program (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! To create a program file:

Click: File/New or start Stata do-file editor asshown above

! Type the following Stata commands into to the do-fileeditor to enter the data and generate the summarystatistics:

* Turn off annoying – more – message

set more off

* Open log file on disk

* Trick for automatically opening a log file in a do-file

capture log closelog using blah.log, replace

input sno IQ1 1382 1423 1364 1245 1586 1087 1168 1289 12510 88end

list

summarize IQ , detail

histogram IQ , bin(10) fraction norm

graph export blah.wmf,replace

log close

! Save the file:

Click File / Save As

Type: MyDocuments\bio624\blah.do

! Change the working directory to the folder containing the“do” program file, if needed -- the current workingdirectory is shown on the lower left in the Status Bar:

cd “MyDocuments\bio624"

(Always change to the working directory, which willcontain related datasets, graphs, etc.)

! Run the “do” file:

do blah.do

or,

Click: Do current file

! Edit + re-run “do” Program

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 14

Page 15: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! Return to do-file editor:Click blah.do on the Task Bar

! Make the fixes and then save

Click File / Save

! Re-run the program:

Click Intercooled... on the Task Bar

do blah.do

or,

Click Do current file

7. Using Stata to create “do” files

! A good way to make do-files is to enter the commandsinteractively and then copy them to a do-file for further work:

Drag mouse to select commands (or select all)

Right click anywhere in the Review window

Click Save All or Save Selected

Paste into the do-file editor (or into Notepad or Wordpad)

8. Stat /Transfer for importing/exporting data

! Most often data are entered and managed using software otherthan Stata. This might done in a spreadsheet such as Excel, adatbase such as Access or Oracle, or another statisticalpackage such as SAS or SPSS

! In many cases, you can Copy/Paste the data from the outsidesource into the Stata Data Editor, which transfers the data insimple cases

! If worse comes to worse, data may be transferred to Stata foranalysis by writing a space or comma delimited ASCII text fileto disk and then reading that into Stata using infile or infix

! The best option is to use to translate the data into or from Stataformat is to use a “transfer program” such as StatTransfer --available in the PC Labs on the 3rd floor

! DEMO: To make the transfer, start Stat/Transfer and specify the

input file and select its type, then select the output file andselect its type (Stata version). Note that you may also translatea Stata dataset into any of the other supported file formats, ie,you could translate a Stata dataset for further analysis usingSAS or SPSS, for example

— Example: translate the SAS dataset alt3-1.sd2 into a Statadataset named alt3-1.dta

Start Stat/Transfer: Start Button, Program, ... click theStat/Transfer icon

Click the About tab and verify the version is 5 or higher —earlier versions of Stat/Transfer may not correctly transferSAS datasets

Select SAS for Windows/OS2 from the input File Type selectionbox

Click Browse ; locate and select the file SAS file ex3-1.sd2for the input File Specification box

Select Stata from the Output File Type selection box

Type ex3-1.sd2 in the File Specification box

Click the Transfer button

... SAS dataset should be converted to Stata format

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 15

Page 16: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

To test the transfer:

Start Stata and give the commands

use alt3-1.dta , cleardescribe

! Using the clipboard to import datasets

— Some datasets, such as spreadsheets, can be “copied” tothe clipboard

— These can “pasted” into the Stata Data Editor, which often isa very quick way to transfer data into Stata

— Demonstrate transfer from Excel to Stata

— Data can be exported from Stata, using the clipboard byreversing the process

9. Example 1: exploratory analysis of data fromAltman’s Exercise 3-1

! Data Source: The data comes from Exercise 3 on p.45 from thewell-written textbook Practical Statistics for MedicalResearch (Chapman & Hall) by Douglas Altman

! Data Story: The data has to do with 65 patients with rheumatoidarthritis, whether they experienced adverse drug reactions(REAC) to sodium aurothiomalate (SA), and whether age,dose, or an index (SI = sulphoxidation index) bear anyrelationship to the adverse reactions

! Data sheet:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 16

Page 17: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9. Example 1: exploratory analysis of data from Altman’sExercise 3-1 (cont'd)

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 17

Page 18: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

9.1 Listing of data file

! Below there is a listing of the contents of the file

alt3-1ex.dat,

which contains the raw data, one line (row) per patient

! The variables (columns) for each patient are as follows:

Id Number sno

Reaction (1=Yes 2=No) react

Age (years) age

Dose (mg) sadose

Sulphoxidation Index (no units) si

Whether Index is censored (1=Yes 0=No)censor

1 2 44 1560 1.0 02 2 65 1310 1.2 03 2 58 850 1.2 04 2 57 1250 1.7 05 2 51 950 1.8 06 2 64 850 1.8 07 2 33 1200 1.9 08 2 61 1390 2.0 09 2 49 1450 2.3 010 2 67 3300 2.8 011 2 39 2760 2.8 012 2 42 860 3.4 013 2 35 1810 3.4 014 2 31 1310 3.8 015 2 37 1250 3.8 016 2 43 1210 4.2 017 2 39 1460 4.9 018 2 53 2310 5.4 019 2 44 1360 5.9 020 2 41 1910 6.2 021 2 72 910 12.0 022 2 61 1410 18.8 023 2 48 2460 47.0 024 2 59 1350 70.0 025 2 72 810 80.0 126 2 59 1460 80.0 127 2 71 760 80.0 128 2 53 910 80.0 11 1 53 360 2.0 02 1 74 2010 2.0 03 1 29 1390 2.0 04 1 53 660 3.0 05 1 67 1135 3.5 06 1 67 510 5.3 07 1 54 410 5.7 08 1 51 910 6.5 09 1 57 360 13.0 010 1 62 1260 13.0 011 1 51 560 13.9 012 1 68 1135 14.7 013 1 50 1410 15.4 014 1 38 1110 15.7 015 1 61 960 16.6 016 1 59 1310 16.6 017 1 68 910 16.6 018 1 44 1235 22.0 019 1 57 2950 22.3 020 1 49 360 33.2 021 1 49 1935 47.0 022 1 63 1660 61.0 0

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 18

Page 19: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

23 1 29 435 65.0 024 1 53 310 65.0 025 1 53 310 80.0 126 1 49 410 80.0 127 1 42 690 80.0 128 1 44 910 80.0 129 1 59 1260 80.0 130 1 51 1260 80.0 131 1 46 1310 80.0 132 1 46 1350 80.0 133 1 41 1410 80.0 134 1 39 1460 80.0 135 1 62 1535 80.0 136 1 49 1560 80.0 137 1 53 2050 80.0 1

9.2 Analysis Plan

— Means, SDs , percentiles with summarize

— List data for checking with list

— Stem and Leafs for continuous variables using stem

— Scatterplot matrix to show bivariate relationships amongcontinuous variables using graph matrix

— Dot diagrams to show point distributions within groups usingdotplot

— Boxplots by group using graph box

— Shapiro-Wilk test for normal distribution using sw

— Diagnostic plots for normal distribution using qnorm

— Pick transformation using the Box-Cox transformation: boxcox

9.3 Box-Cox transform

! The Box-Cox transform is used to find a scale for the responsevariable that is approximately normally distributed — does notalways work, but worth trying. Don’t apply this without applyingcommon sense to the result

! It can be used in a regression model to find a transformation thatmakes the errors in the regression model approximatelynormally distributed

! The transform represents a family of “power” transformationscommonly used in data analysis:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 19

Page 20: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! See boxcox in the Stata reference manual for more details andexamples

9.4 Techniques Illustrated

! Use of comment statements for documentation

! Clear Stata’s work space

! Change the working folder (directory) on disks from Stata

! Make folder from Stata to help organize your work

! Print results by sending them to a file on disk so they can beincorporated into a word processor and printed

! Input free-format data from a data file on disk

! Label variables

! Label variable values

! List data

! Get summary statistics

! Get stem-and-leaf plots

! Get a scatterplot matrix

! Store Stata graphs on disk in “Windows metafile format” (.wmf) for incorporation into word processing programs and printing

! Get dot diagrams

! Get boxplots

! Generate the Shapiro-Wilk statistic for testing normality

! Produce a quantile-quantile plot for assessing goodness of fit to anormal distribution

! Use the Box-Cox transform to suggest a transformation tonormality

! NOTE: The do-file and data file are on the website as alt3-1ex.do and alt3-1ex.dat

9.5 Log Showing Commands and Output

.

. * Turn off MORE feature

.

. set more off

.

.

.

. * Input data

.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 20

Page 21: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. infile sno react age sadose si censor using alt3-1ex.dat(65 observations read)

.

.

.

. * Variable labels

. label variable sno "Study No."

. label variable react "Adverse Reaction"

. label variable age "Age in years"

. label variable sadose "Dose of SA (mg)"

. label variable si "Sulphoxidation Index"

.

.

.

. * Value labels

.

. label define reactlbl 1 "Yes" 2 "No"

.

. label values react reactlbl

.

.

.

.

. * Save Stata dataset

.

. save alt3-1ex.dta, replacefile alt3-1ex.dta saved

.

. . * List data for checking. . list in 1/10

+-------------------------------------------+ | sno react age sadose si censor | |-------------------------------------------| 1. | 1 No 44 1560 1 0 | 2. | 2 No 65 1310 1.2 0 | 3. | 3 No 58 850 1.2 0 | 4. | 4 No 57 1250 1.7 0 | 5. | 5 No 51 950 1.8 0 | |-------------------------------------------| 6. | 6 No 64 850 1.8 0 | 7. | 7 No 33 1200 1.9 0 | 8. | 8 No 61 1390 2 0 | 9. | 9 No 49 1450 2.3 0 | 10. | 10 No 67 3300 2.8 0 | +-------------------------------------------+

.

.

.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 21

Page 22: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. * Descriptive Statistics

.

. summarize , detail

Study No.------------------------------------------------------------- Percentiles Smallest 1% 1 1 5% 2 110% 4 2 Obs 6525% 9 2 Sum of Wgt. 65

50% 17 Mean 17.06154 Largest Std. Dev. 9.97477675% 25 3490% 31 35 Variance 99.4961595% 34 36 Skewness .163239499% 37 37 Kurtosis 2.000031

Adverse Reaction------------------------------------------------------------- Percentiles Smallest 1% 1 1 5% 1 110% 1 1 Obs 6525% 1 1 Sum of Wgt. 65

50% 1 Mean 1.430769 Largest Std. Dev. .499037575% 2 290% 2 2 Variance .249038595% 2 2 Skewness .279616499% 2 2 Kurtosis 1.078185

Age in years------------------------------------------------------------- Percentiles Smallest 1% 29 29 5% 33 2910% 38 31 Obs 6525% 44 33 Sum of Wgt. 65

50% 53 Mean 52.12308 Largest Std. Dev. 11.1964175% 61 7190% 67 72 Variance 125.359695% 71 72 Skewness -.065927599% 74 74 Kurtosis 2.326933

Dose of SA (mg)------------------------------------------------------------- Percentiles Smallest 1% 310 310 5% 360 31010% 410 360 Obs 6525% 860 360 Sum of Wgt. 65

50% 1260 Mean 1249.538 Largest Std. Dev. 622.313475% 1460 246090% 2010 2760 Variance 38727495% 2460 2950 Skewness .957271699% 3300 3300 Kurtosis 4.426923

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 22

Page 23: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Sulphoxidation Index------------------------------------------------------------- Percentiles Smallest 1% 1 1 5% 1.7 1.210% 1.9 1.2 Obs 6525% 3.4 1.7 Sum of Wgt. 65

50% 14.7 Mean 31.54308 Largest Std. Dev. 33.220175% 80 8090% 80 80 Variance 1103.57595% 80 80 Skewness .604477899% 80 80 Kurtosis 1.543044

censor------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 0 010% 0 0 Obs 6525% 0 0 Sum of Wgt. 65

50% 0 Mean .2615385 Largest Std. Dev. .442892675% 1 190% 1 1 Variance .196153895% 1 1 Skewness 1.08521799% 1 1 Kurtosis 2.177696

.

.

.

. * Stem and leaf

. stem age

Stem-and-leaf plot for age (Age in years)

2. | 99 3* | 13 3. | 578999 4* | 112234444 4. | 66899999 5* | 0111133333334 5. | 77789999 6* | 1112234 6. | 577788 7* | 1224

. stem sadose

Stem-and-leaf plot for sadose (Dose of SA (mg))

0*** | 310,310,360,360,360 0*** | 410,410,435,510,560 0*** | 660,690,760 0*** | 810,850,850,860,910,910,910,910,910,950,960 1*** | 110,135,135 1*** | 200,210,235,250,250,260,260,260,310,310,310,310,350,350,360,390,390 1*** | 410,410,410,450,460,460,460,535,560,560 1*** | 660 1*** | 810,910,935 2*** | 010,050 2*** | 310 2*** | 460 2*** | 760 2*** | 950 3*** | 3*** | 300

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 23

Page 24: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

AdverseReaction

Age inyears

Doseof SA(mg)

SulphoxidationIndex

1 1.5 2

20

40

60

80

20 40 60 80

0

2000

4000

0 2000 4000

0

50

100

AG

E

SCATTERPLOT MATRIX

REACTION

. stem si

Stem-and-leaf plot for si (Sulphoxidation Index)

si rounded to nearest multiple of .1plot in units of .1

0** | 10,12,12,17,18,18,19,20,20,20,20,23,28,28,30,34,34,35,38,38,42,49 0** | 53,54,57,59,62,65 1** | 20,30,30,39,47 1** | 54,57,66,66,66,88 2** | 20,23 2** | 3** | 32 3** | 4** | 4** | 70,70 5** | 5** | 6** | 10 6** | 50,50 7** | 00 7** | 8** | 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

.

.

.

. * Scatterplots Matrix

. graph box age, over (react) t1(AGE BOXPLOTS) t2(" ") l1(A> GE) b1(REACTION)(file alt3-1ex\boxplot1.gph saved)

.

. graph export alt3-1ex\scatmat.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\scatmat.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 24

Page 25: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

3040

5060

70A

ge in

yea

rs

Yes NoAdverse Reaction

AG

E

AGE DOTPLOT

REACTION

010

0020

0030

0040

00D

ose

of S

A (

mg)

Yes NoAdverse Reaction

SA

DO

SE

MG

SA DOSE DOTPLOT

REACTION

.

.

. * Dot diagram

.

. sort react

.

. dotplot age , by (react) t1(AGE DOTPLOT) l1(AGE) b1(REAC> TION)(file alt3-1ex\dotplot1.gph saved)

. graph export alt3-1ex\dotplot1.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot1.wmf written in Windows Metafile format)

.

. dotplot sadose, by (react) t1(SA DOSE DOTPLOT) l1(SADOSE M> G) b1(REACTION)(file alt3-1ex\dotplot2.gph saved)

. graph export alt3-1ex\dotplot2.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot2.wmf written in Windows Metafile format)

.

. dotplot si, by (react) t1(SI DOSE DOTPLOT) l1(SI) > b1(REACTION)(file alt3-1ex\dotplot3.gph saved)

. graph export alt3-1ex\dotplot3.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\dotplot3.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 25

Page 26: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

020

4060

80S

ulph

oxid

atio

n In

dex

Yes NoAdverse Reaction

SI

SI DOSE DOTPLOT

REACTION

. . * Letter values, outliers by reaction subgroup. . lv age if react==1 ,generate

# 37 Age in years --------------------------------- M 19 | 53 | spread pseudosigma F 10 | 46 52.5 59 | 13 10.05177 E 5.5 | 41.5 53.25 65 | 23.5 10.80392 D 3 | 38 53 68 | 30 10.23727 C 2 | 29 48.5 68 | 39 11.47614 B 1.5 | 29 50 71 | 42 11.27376 1 | 29 51.5 74 | 45 10.79743 | | | | # below # aboveinner fence | 26.5 78.5 | 0 0outer fence | 7 98 | 0 0

. list age if react==1 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F> ) - r(l_F)))) )

.

.

. lv age if react==2 ,generate

# 28 Age in years --------------------------------- M 14.5 | 52 | spread pseudosigma F 7.5 | 41.5 51.25 61 | 19.5 14.65586 E 4 | 37 52 67 | 30 13.28402 D 2.5 | 34 52.75 71.5 | 37.5 13.11905 C 1.5 | 32 52 72 | 40 11.51282 1 | 31 51.5 72 | 41 10.41174 | | | | # below # aboveinner fence | 12.25 90.25 | 0 0outer fence | -17 119.5 | 0 0

. list age if react==2 & ( (age >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (age <=( r(l_F) - 1.5*(r(u_F) > - r(l_F)))) )

.

.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 26

Page 27: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. lv sadose if react==1 ,generate

# 37 Dose of SA (mg) --------------------------------- M 19 | 1135 | spread pseudosigma F 10 | 560 985 1410 | 850 657.2313 E 5.5 | 385 997.5 1610 | 1225 563.183 D 3 | 360 1185 2010 | 1650 563.0501 C 2 | 310 1180 2050 | 1740 512.0124 B 1.5 | 310 1405 2500 | 2190 587.8463 1 | 310 1630 2950 | 2640 633.4493 | | | | # below # aboveinner fence | -715 2685 | 0 1outer fence | -1990 3960 | 0 0

. list sadose if react==1 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1> .5*(r(u_F) - r(l_F)))) )

+--------+ | sadose | |--------| 37. | 2950 | +--------+

.

. lv sadose if react==2 , generate

# 28 Dose of SA (mg) --------------------------------- M 14.5 | 1330 | spread pseudosigma F 7.5 | 930 1220 1510 | 580 435.9179 E 4 | 850 1580 2310 | 1460 646.489 D 2.5 | 830 1720 2610 | 1780 622.7175 C 1.5 | 785 1907.5 3030 | 2245 646.157 1 | 760 2030 3300 | 2540 645.0197 | | | | # below # aboveinner fence | 60 2380 | 0 3outer fence | -810 3250 | 0 1

. list sadose if react==2 & ( (sadose >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (sadose <=( r(l_F) - 1> .5*(r(u_F) - r(l_F)))) )

+--------+ | sadose | |--------| 26. | 2460 | 27. | 2760 | 28. | 3300 | +--------+

.

.

. lv si if react==1 ,generate

# 37 Sulphoxidation Index --------------------------------- M 19 | 22.3 | spread pseudosigma F 10 | 13 46.5 80 | 67 51.80529 E 5.5 | 4.4 42.2 80 | 75.6 34.75644 D 3 | 2 41 80 | 78 26.61691 C 2 | 2 41 80 | 78 22.95228 B 1.5 | 2 41 80 | 78 20.93699 1 | 2 41 80 | 78 18.71555 | | | | # below # aboveinner fence | -87.5 180.5 | 0 0outer fence | -188 281 | 0 0

. list si if react==1 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F) > - r(l_F)))) )

.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 27

Page 28: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. lv si if react==2 ,generate

# 28 Sulphoxidation Index --------------------------------- M 14.5 | 3.8 | spread pseudosigma F 7.5 | 1.95 8.675 15.4 | 13.45 10.10879 E 4 | 1.7 40.85 80 | 78.3 34.6713 D 2.5 | 1.2 40.6 80 | 78.8 27.5675 C 1.5 | 1.1 40.55 80 | 78.9 22.70904 1 | 1 40.5 80 | 79 20.06164 | | | | # below # aboveinner fence | -18.225 35.575 | 0 6outer fence | -38.4 55.75 | 0 5

. list si if react==2 & ( (si >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (si <=( r(l_F) - 1.5*(r(u_F) > - r(l_F)))) )

+----+ | si | |----| 23. | 47 | 24. | 70 | 25. | 80 | 26. | 80 | 27. | 80 | |----| 28. | 80 | +----+

.

.

.

.

.

.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 28

Page 29: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

3040

5060

70A

ge in

yea

rs

Yes No

AG

E

AGE BOXPLOTS

REACTION

01,

000

2,00

03,

000

4,00

0D

ose

of S

A (

mg)

Yes No

DO

SE

MG

SA DOSE BOXPLOTS

REACTION

. * Boxplots

.

. sort react

.

. graph box age, over (react) t1(AGE BOXPLOTS) t2(" ") l1(A> GE) b1(REACTION)(file alt3-1ex\boxplot1.gph saved)

. graph export alt3-1ex\boxplot1.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot1.wmf written in Windows Metafile format)

.

. graph box sadose, over (react) t1(SA DOSE BOXPLOTS) t2(" > ") l1(DOSE MG) b1(REACTION)(file alt3-1ex\boxplot2.gph saved)

. graph exort alt3-1ex\boxplot2.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot2.wmf written in Windows Metafile format)

.

. graph box si, over (react) t1(SI DOSE BOXPLOTS) t2(" ") l> 1(SI) b1(REACTION)

. graph export alt3-1ex\boxplot3.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\boxplot3.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 29

Page 30: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

020

4060

80S

ulph

oxid

atio

n In

dex

Yes No

SI

SI DOSE BOXPLOTS

REACTION

.

. * Shapiro-Wilk Test for Normality. . swilk age sadose si

Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z-------------+------------------------------------------------- age | 65 0.98503 0.868 -0.307 0.62061 sadose | 65 0.92756 4.199 3.107 0.00094 si | 65 0.82921 9.901 4.964 0.00000

.

.

.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 30

Page 31: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

3353

71

3040

5060

7080

Age

in y

ears

52.12308 70.5395333.70662

30 40 50 60 70 80Inverse Normal

AG

E

AGE Q-Q PLOTGrid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles

360

1260

2460

010

0020

0030

0040

00D

ose

of S

A (

mg)

1249.538 2273.153225.924

0 500 1000 1500 2000 2500Inverse Normal

SA

DO

SE

SA DOSE Q-Q PLOTGrid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles

* Diagnostic Plot for Normal Distribution (Q-Q plot). . qnorm age , grid b1(AGE Q-Q PLOT) l1(AGE)

. graph export alt3-1ex\qqplot1.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot1.wmf written in Windows Metafile format)

. qnorm sadose , grid b1(SA DOSE Q-Q PLOT) l1(SA DOSE)

. graph export alt3-1ex\qqplot2.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot2.wmf written in Windows Metafile format)

.

. qnorm si , grid b1(SI Q-Q PLOT) l1(SI)

. graph export alt3-1ex\qqplot3.wmf,replace(file C:\jt\bio624\2004\progs\alt3-1ex\qqplot3.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 31

Page 32: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

9.5 Log Showing Commands and Output (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

1.71

4.7

80

-50

050

100

Sul

phox

idat

ion

Inde

x

31.54308 86.18528-23.09912

-50 0 50 100Inverse Normal

SI

SI Q-Q PLOTGrid lines are 5, 10, 25, 50, 75, 90, and 95 percentiles

.

. * Box-Cox method to choose transformation to normality. . * nolog option suppresses iterations - nothing to do with logarithms. . boxcox age , nologFitting comparison model

Fitting full model

Number of obs = 65 LR chi2(0) = 0.00Log likelihood = -248.73918 Prob > chi2 = . ------------------------------------------------------------------------------ age | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- /theta | 1.028826 .527121 1.95 0.051 -.004312 2.061964------------------------------------------------------------------------------ Estimates of scale-variant parameters---------------------------- | Coef.-------------+--------------Notrans | _cons | 55.8456-------------+-------------- /sigma | 12.44209----------------------------

--------------------------------------------------------- Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2---------------------------------------------------------theta = -1 -256.76965 16.06 0.000theta = 0 -250.73362 3.99 0.046theta = 1 -248.74068 0.00 0.956---------------------------------------------------------

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 32

Page 33: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. boxcox sadose, nologFitting comparison model

Fitting full model

Number of obs = 65 LR chi2(0) = 0.00Log likelihood = -505.33421 Prob > chi2 = . ------------------------------------------------------------------------------ sadose | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- /theta | .4100593 .1929563 2.13 0.034 .031872 .7882467------------------------------------------------------------------------------ Estimates of scale-variant parameters---------------------------- | Coef.-------------+--------------Notrans | _cons | 41.58575-------------+-------------- /sigma | 9.273821----------------------------

--------------------------------------------------------- Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2---------------------------------------------------------theta = -1 -530.33416 50.00 0.000theta = 0 -507.58528 4.50 0.034theta = 1 -509.90097 9.13 0.003---------------------------------------------------------

. boxcox si , nologFitting comparison model

Fitting full model

Number of obs = 65 LR chi2(0) = 0.00Log likelihood = -285.74575 Prob > chi2 = . ------------------------------------------------------------------------------ si | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- /theta | .0403967 .1055843 0.38 0.702 -.1665448 .2473382------------------------------------------------------------------------------ Estimates of scale-variant parameters---------------------------- | Coef.-------------+--------------Notrans | _cons | 2.770815-------------+-------------- /sigma | 1.64801----------------------------

--------------------------------------------------------- Test Restricted LR statistic P-Value H0: log likelihood chi2 Prob > chi2---------------------------------------------------------theta = -1 -333.2825 95.07 0.000theta = 0 -285.81928 0.15 0.701theta = 1 -319.4322 67.37 0.000---------------------------------------------------------

.

.

.

.

.

. * Close the log -- may want to use for production runs

. *log close

10. Example 2: input and display of data fromAltman’s exercise 3-2

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 33

Page 34: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

! Data: These data are found on p.47 of Altman (Exercise 3.2). Thedata concerns airplane accidents (counts, rates/1000, and ratesper 100,000 flight hours) and how they relate to occupation ofthe pilot

! Script of Stata commands contained in alt3-2ex.do

! NOTE: The script file and data file are on the class disk asalt3-2ex.do and alt3-2ex.dat

10.1 Source data from Altman

10.2 Raw data — text file on disk

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 34

Page 35: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

10.2 Raw data — text file on disk (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 35

Page 36: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Professional pilots 1302 15.9 0.2Lawyers 57 11.0 1.5Farmers 166 10.1 1.3Sales representatives 137 9.0 1.2Physicians 76 8.7 1.8Mechanics and repairmen 44 6.9 1.5Policemen and detectives 48 6.6 1.8Managers and administrators 643 6.0 0.7Engineers 125 4.7 1.1Teachers 43 4.2 1.1Housewives 29 3.7 3.2Academic students 188 3.2 3.7Armed Forces Members 111 1.6 0.7

10.3 Analysis plan

! Explore this simple dataset with several graphs using the graphcommand

— Show how counts of accidents are related to occupation ofpilot

— Show how rates per 1000 pilots are related to occupation

— Show how rates per 100,000 flight hours are related tooccupation

— Show how the two rates are related to one another

! Consider other approaches to analysis

10.4 Stata log

.

.

. * Turn off MORE feature

.

. set more off

.

.

.

. * Input data, embedded blanks in string

.

. infix str occup 1-29 accid 30-34 rate1 40-44 rate2 50-54 using alt3-2ex.dat(13 observations read)

.

.

.

. * Variable labels

. label variable occup "Occupation"

. label variable accid "No. of Accidents"

. label variable rate1 "Rate per 1000"

. label variable rate2 "Rate per 100,000 hr"

.

. * List data for checking

.

. list

+-----------------------------------------------------+ | occup accid rate1 rate2 | |-----------------------------------------------------| 1. | Professional pilots 1302 15.9 .2 | 2. | Lawyers 57 11 1.5 | 3. | Farmers 166 10.1 1.3 | 4. | Sales representatives 137 9 1.2 | 5. | Physicians 76 8.7 1.8 |

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 36

Page 37: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

10.4 Stata log (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

0 500 1,000 1,500

Pro Pilot

Mgrs

Acad

Farm

Sales

Engin

Armed For

MD

Law

Police

Mech

Teach

Housewife

OC

CU

PA

TIO

N

AIRPLANE ACCIDENTS

No. of Accidents

|-----------------------------------------------------| 6. | Mechanics and repairmen 44 6.9 1.5 | 7. | Policemen and detectives 48 6.6 1.8 | 8. | Managers and administrators 643 6 .7 | 9. | Engineers 125 4.7 1.1 | 10. | Teachers 43 4.2 1.1 | |-----------------------------------------------------| 11. | Housewives 29 3.7 3.2 | 12. | Academic students 188 3.2 3.7 | 13. | Armed Forces Members 111 1.6 .7 | +-----------------------------------------------------+

.

.

.

. * Code occupations for graphs

. encode occup, gen(occup1)

.

.

.

. * Make shorter labels for graphs

.

. #delimit ;delimiter now ;. label define occuplab 1 "Acad" 2 "Armed For" 3 "Engin"> 4 "Farm" 5 "Housewife" 6 "Law"> 7 "Mgrs" 8 "Mech" 9 "MD"> 10 "Police" 11 "Pro Pilot" 12 "Sales"> 13 "Teach" ;

. #delimit crdelimiter now cr. . label values occup1 occuplab

.

.

.

.

. * Save as Stata dataset

.

. save alt3-2ex.dta, replacefile alt3-2ex.dta saved

.

.

. * Bar graph, See Figure 1

.

. sort occup1

.

. graph hbar accid , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT> ION) b1(No. of Accidents) t1 (AIRPLANE ACCIDENTS)

. graph export alt3-2ex\fig1.wmf,replace(file C:\jt\bio624\2004\progs\alt3-2ex\fig1.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 37

Page 38: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

10.4 Stata log (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

0 5 10 15

Pro Pilot

Law

Farm

Sales

MD

Mech

Police

Mgrs

Engin

Teach

Housewife

Acad

Armed For

OC

CU

PA

TIO

N

AIRPLANE ACCIDENTS

Rate per 1000 Pilots

0 1 2 3 4

Acad

Housewife

Police

MD

Mech

Law

Farm

Sales

Teach

Engin

Mgrs

Armed For

Pro Pilot

OC

CU

PA

TIO

N

AIRPLANE ACCIDENTS

Rate per 100000 hrs

0 1 2 3 4

Acad

Housewife

Police

MD

Mech

Law

Farm

Sales

Teach

Engin

Mgrs

Armed For

Pro Pilot

OC

CU

PA

TIO

N

AIRPLANE ACCIDENTS

Rate per 100000 hrs

.

.

.

. * Bar graph, See Figure 2

.

. graph hbar rate1 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT> ION) b1(Rate per 1000 Pilots) t1 (AIRPLANE ACCIDENTS)

. graph export alt3-2ex\fig2.wmf,replace(file C:\jt\bio624\2004\progs\alt3-2ex\fig2.wmf written in Windows Metafile format)

.

. * Bar graph See Figure 3

.

. graph hbar rate2 , over(occup1,sort(1)) ytitle(" ") l1(OCCUPAT> ION) b1(Rate per 100000 hrs) t1 (AIRPLANE ACCIDENTS)(file alt3-2ex\fig3.gph saved)

. graph export alt3-2ex\fig3.wmf,replace(file C:\jt\bio624\2004\progs\alt3-2ex\fig3.wmf written in Windows Metafile format)

.

.

. * Scatterplot See Figure 4

.

. graph twoway scatter rate1 rate2, mlabel(occup1) t1(AIRPLANE ACCIDENT RATES)

. graph export alt3-2ex\fig4.wmf,replace(file C:\jt\bio624\2004\progs\alt3-2ex\fig4.wmf written in Windows Metafile format)

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 38

Page 39: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

10.4 Stata log (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

Acad

Armed For

Engin

Farm

Housewife

Law

Mgrs

Mech

MD

Police

Pro Pilot

Sales

Teach

05

1015

Rat

e pe

r 10

00

0 1 2 3 4Rate per 100,000 hr

AIRPLANE ACCIDENT RATES

Acad

Armed For

Engin

Farm

Housewife

Law

Mgrs

Mech

MD

Police

Pro Pilot

Sales

Teach

05

1015

Rat

e pe

r 10

00

0 1 2 3 4Rate per 100,000 hr

AIRPLANE ACCIDENT RATES

. . . log close

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 39

Page 40: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

11. Common data analysis applications

! For simplicity of illustration, the data from the rheumatoid arthritisdata introduced earlier will be used in all the examples, some ofwhich may be contrived or inappropriate

! The examples shown below assume that the Stata dataset hasbeen loaded into the work space through input of the raw dataor by loading a saved data (e.g., use alt3-1ex\alt3-1ex.dta)

11.1 Descriptive statistics

! Means, SDs, and other descriptive statistics

! Variables: age, sadose, and si

! Command:

summarize age sadose si , detail

11.2 Stem-and-leaf charts

! Stem-and-Leaf to show distribution of continuous variable -- mustdo one variable at a time

! Variable: age

! Command:

stem age

11.3 Boxplots

! Boxplot to show distribution of a variable in subgroups of the data.Data must be sorted by the subgrouping variables. Store thegraph in a folder (sub-directory) in metafile format (*.wmf), so itcan be imported into a word processor for printing

! Variables:— Subgrouping: reac— Analysis: age

! Commands: [Type command below each on a single, longline]

sort react

graph box age, over (react) marker(1,mlab(sno)) t1(AGE BOXPLOTS) t2(" ") l1(AGE) b1(REACTION)

graph export alt3-1ex\boxplot1.wmf,replace

11.4 Confidence interval for a mean

! Calculate a 95% confidence interval for the mean value of avariable

! Variable: age

! Command:

ci age

! Immediate form of command — used as a “calculator” to produce95% CI from n, mean, and SD

. cii 65 52.12 11.20

11.5 Confidence interval for a proportion

! Calculate a 95% confidence interval for the proportion positive in abinomial distribution. Stata calculates exact binomial limits.

Note: Stata can also calculate limits for the mean of Poissondistribution using the poisson option of the ci or cii commands.

! Variable: censor

! Command:

ci censor , binomial

! Immediate form of command — used as a “calculator” to produce95% CI from n, # of events

. cii 65 17

! Poisson example ( 27 deaths, 645 person-years):

cii 645 27 , poisson

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 40

Page 41: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

11.6 Student’s t-test

! Used to test equality of means. It comes in 3 forms:

— Test that variable has a mean equal to specific # — this isthe one-sample t-test

— Test that variable1 has the same mean as variable2 — thisis the paired t-test

— Test that variable has the same mean within two groupsdefined by a grouping variable groupvar — this is the two-sample t-test

Note: Stata gives p-values for the t-tests, but also gives 95%

confidence intervals on means and differences in means

! Variables: age with reac as the subgrouping variable

! Commands:

— One-sample ttest: Test mean age = 50

ttest age = 50

— Paired t-test: (Stupidly, for illustration) test mean sadose = sittest sadose = si

— Two-sample t-test: Test age means are equal within reactiongroups

ttest age ,by (reac) or,

ttest age ,by (reac) unequal ... does not assume =variances

! Immediate forms of commands can be used as a “calculator” toget t-test given summary data on n, and the observed meansand standard deviations (sd):

— One-sample test (n=24, observed mean=62.6, sd=15.8; testmean=75)

ttesti 24 62.6 15.8 75

— Paired t-test: there is no immediate command for this

— Two-sample t-test: (n1=20,m1=20,sd1=5;n2=32,m2=15,sd2=4; test mean's equal)

ttesti 20 20 5 32 15 4

11.7 Test for binomial proportions

! Use to test equality of proportions within two subgroups

Note: Stata gives the 2x2 chi-square test and p-value. It alsogives the Fisher’s exact test p-value

! Variables: proportion censored (censor) within reactivity groups(reac)

! Commands:

tab censor reac, chi2 exact

! Immediate forms of commands can be used as a “calculator” totest equality of proportions in a 2x2 table. Enter the rows of thetable separated by a “\” character:

tabi 24 24 \ 13 4 , chi2 exact

11.8 Correlation

! Obtain either the Pearson’s or Spearman’s (rank) estimatedcorrelation coefficient of two measured responses x and y

! Variables: age and si

! Commands:

corr age si

spearman age si

Note: Pairs of correlations among a set of variables may beobtained by specifying the list of variables. E.g., to obtainage-sadose, age-si, and sadose-si correlations:

corr age sadose si

11.9 Simple linear regression

! Estimate simple linear model relating a measured response(dependent) variable y to a fixed, covariate (independent)variable x — y = α+βx+ε

Stata produces an analysis of variance, p-values, coefficientestimates, standard errors, and 95% confidence intervals

! Variables: Dependent = si and independent = age

! Commands:

regress si age

! Commands to obtain a graph of the data, fitted line, and 95% CIs:(Type the graph command on one line)

graph twoway (scatter si age) || (lfitci si age) t1("si=30.15+.0268age")

graph export alt3-1ex\lreg.wmf,replace

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 41

Page 42: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

11.10 Analysis of variance

! Used to tests equality of means withing two or more subgroups —usually 3 or more as the t-test is usually used for 2 groups

! Variables: Dependent variable = si, subgrouping variable= reac—only 2 groups in this example

! Command:

oneway si reac

11.11 Multiple linear regression

! Use either regress

! For details refer to the Reference Manual or

help regress

! Also see Stata User’s Guide Chapters 26 and 35 (in the handoutfor Part 1) for more details on fitting regression models

11.12 Multiple logistic regression

! Use logistic for logistic regression for binary responses

! Use clogit for matched or highly stratified case-control studies(including “frequency-matched” studies)

! Use ologit for logistic regression for ordered responses with morethan 2 categories

! Use mlogit for logistic regression for responses with more than 2categories (not ordered)

! For details refer to the Reference Manual or

help logistic

help clogit

help ologit

help mlogit

! Also see Stata User’s Guide Chapters 26 and 35 (in the handoutfor Part 1) for more details on fitting regression models

11.13 Epidemiologic calculations - epitab

! Most of the common calculations for epidemiologic analysis havebeen included in Stata in a group of command labeled “epitab”in the Reference Manual

! Most of the commands have an “immediate” form so that they maybe applied to summary tables, rather than to the raw data,which may not be available

! Details may be found in the Manual or by typing

help epitab

For convenience, the Help text is included below

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 42

Page 43: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

11.13 Epidemiologic calculations - epitab (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. help epitab

-------------------------------------------------------------------------------help for epitab, ir, iri, cs, csi, cc, cci, mcc, mcci (manual: [R] epitab)-------------------------------------------------------------------------------

Tables for epidemiologists--------------------------

ir case_var ex_var time_var [weight] [if exp] [in range] [, level(#) tb by(varname) fast estandard istandard standard(varname) ird nocrude pool nohet ]

iri #a #b #N1 #N2 [, level(#) tb ]

cs case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb woolf by(varname) fast or estandard istandard standard(varname) nocrude pool nohet ]

csi #a #b #c #d [, level(#) exact or tb woolf ]

cc case_var ex_var [weight] [if exp] [in range] [, level(#) exact tb woolf by(varname) fast estandard istandard standard(varname) nocrude pool nohet ]

cci #a #b #c #d [, level(#) exact tb woolf ]

mcc ex_case_var ex_cntl_var [weight] [if exp] [in range] [, level(#) tb ]

mcci #a #b #c #d [, level(#) tb ]

Description-----------

ir is used with incidence rate (incidence density or person-time) data; pointestimates and confidence intervals for the incidence rate ratio and differenceare calculated along with attributable or prevented fractions for the exposedand total population. iri is the immediate form of ir; see help immed.Also see help nbreg, help poisson and help stcox for related commands.

cs is used with cohort study data with equal follow-up time per subject and,in some cases, cross-sectional data. Risk is then the proportion of subjectswho become cases. Point estimates and confidence intervals for the risk dif-ference, risk ratio, and (optionally) the odds ratio are calculated along withattributable or prevented fractions for the exposed and total population. csiis the immediate form of cs; see help immed. Also see help logistic and helpglogit for related commands.

cc is used with case-control and cross-sectional data. Point estimates andconfidence intervals for the odds ratio are calculated along with attributableor prevented fractions for the exposed and total population. cci is the im-mediate form of cc; see help immed. Also see help logistic and help glogitfor related commands.

mcc is used with matched case-control data. McNemar's chi-squared, point esti-mates and confidence intervals for the difference, ratio, and relative differ-ence of the proportion with the factor, along with the odds ratio, are calcu-lated. mcci is the immediate form of mcc; see help immed. Also see helpclogit for a related command.

Options-------

level(#) specifies in percent the confidence level for confidence intervals.

exact requests Fisher's exact P be calculated rather than the chi-squared and its significance level. We recommend specifying exact whenever samples are small. A conservative rule-of-thumb for 2x2 tables is to specify exact when the least-frequent cell contains fewer than 1,000 cases. Note that exact does not affect whether exact confidence intervals are calculated; commands always calculate exact confidence intervals where they can unless tb or woolf is specified.

by(varname) specifies that the tables are stratified on varname. Within- stratum statistics are shown then combined with Mantel-Haenszel weights. If estandard, istandard, or standard() is also specified (see below), the weights specified are used in place of Mantel-Haenszel weights.

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 43

Page 44: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

11.13 Epidemiologic calculations - epitab (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

fast specifies that calculations of within-stratum confidence intervals are not to be made. This speeds execution of the command, although in the case of ir, it makes little difference and for the remaining commands, woolf or tb are almost as fast.

or is allowed only with the cs and csi commands. Specified without by(), or reports the calculation of the odds ratio in addition to the risk ratio. With by(), or specifies that a Mantel-Haenszel estimate of the combined odds ratio be made rather than the Mantel-Haenszel estimate of the risk ratio. In either case, this is the same calculation as would be made by cc or cci and, typically, the use of those commands is to be preferred for obtaining odds ratios.

tb requests that test-based confidence intervals be calculated wherever appropriate in place of confidence intervals based on other approximations or exact confidence intervals. We recommend that test-based confidence intervals be used only for pedagogical purposes and never be used for research work.

woolf requests that the Woolf approximation, also known as the Taylor expan- sion, be used for calculating the standard error of the odds ratio. Other- wise, the Cornfield approximation is used. The Cornfield approximation takes substantially longer (a few seconds) to calculate than the Woolf approximation. This standard error is used in calculating a confidence interval for the odds ratio. (For matched case-control data, exact con- fidence intervals are always calculated.)

estandard, istandard, and standard(varname) request that within-stratum statis- tics are to be combined with external, internal, or user-specified weights to produce a standardized estimate. These options are mutually exclusive and can only be used when by() is also specified. (When by() is specified without one of these options, Mantel-Haenszel weights are used.)

estandard external weights are the person-time for the unexposed (ir), the total number of unexposed (cs), or the number of unexposed controls (cc).

istandard internal weights are person-time for the exposed (ir), the total number of exposed (cs), or the number of exposed controls (cc). istandard can be used, among other things, to produce standardized mortality ratios (SMRs).

standard(varname) allows user-specified weights. varname must contain a constant within stratum and be nonnegative. The scale of varname is irrelevant.

ird may be used only with estandard, istandard, or standard(); it requests ir calculate the standardized incidence rate difference rather than the default incidence rate ratio.

rd may be used only with estandard, istandard, or standard(); it requests that cs calculate the standardized risk difference rather than the default risk ratio.

nocrude specifies that in a stratified analysis, the crude estimate -- the estimate one would obtain without regard to strata -- not be displayed. nocrude is relevant only if by() is also specified.

pool specifies that in a stratified analysis, the directly pooled estimate should also be displayed. The pooled estimate is a weighted average of the stratum-specific estimates using inverse-variance weights. pool is relevant only if by() is also specified.

nohet specifies that a chi-squared test for heterogeneity not be included in the output of a stratified analysis. This tests whether the exposure effect is the same across strata and can be performed for any pooled estimate -- directly pooled or Mantel-Haenszel. nohet is relevant only if by() is also specified.

Examples: incidence rate data------------------------------

The table for incidence rate data is

Exposed Unexposed ------------+--------------------- Cases | a b Person-time | N1 N0

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 44

Page 45: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

11.13 Epidemiologic calculations - epitab (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

The basic syntax (ignoring options) for iri is "iri #a #b #N1 #N2".For example:

. iri 41 15 28010 19017

. iri 41 15 28010 19017, level(90)

. iri 41 15 28010 19017, level(90) tb

The basic syntax (ignoring options) for ir is "ir case_var ex_var time_var".case_var contains the number of cases represented by an observation. ex_varcontains 0 if the observation represents unexposed and nonzero (e.g., 1) if theobservation represents exposed. time_var contains the exposure time (e.g.,person-years) represented by the observation. ir obtains the table by summingacross observations. Observations with missing values are not used.

. list cases exposed time 1. 20 1 14000 2. 21 1 14010 3. 15 0 19017

. ir cases exposed time, level(90) (output omitted)

To obtain Mantel-Haenszel combined IRR:

. list agegrp deaths exposed pyears 1. 1 14 1 1516 2. 1 10 0 1701 3. 2 76 1 949 4. 2 121 0 2245

. ir deaths exposed pyears, by(agegrp)

To obtain internally standardized IRR:

. irr deaths exposed pyears, by(agegrp) istandard

To weight each group equally:

. gen wgt=1 . irr deaths exposed pyears, by(agegrp) standard(wgt)

Examples: cohort-study data----------------------------

The table for cohort-study data is

Exposed Unexposed ------------+--------------------- Cases | a b Noncases | c d

The basic syntax (ignoring options) for csi is "csi #a #b #c #d".For example:

. csi 7 12 9 2

. csi 7 12 9 2, exact

. csi 7 12 9 2, exact level(90) tb

The basic syntax (ignoring options) for cs is "cs case_var ex_var". case_varcontains 1 if the observation represents a case and nonzero (e.g., 1) if itrepresents a noncase. ex_var contains 0 if the observation represents unex-posed and nonzero (e.g., 1) if it represents exposed. Frequency weights areallowed.

. list case exp pop 1. 0 0 2 2. 0 1 9 3. 1 0 12 4. 1 1 2

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 45

Page 46: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

5. 1 1 5

. cs case exp [freq=pop] (output omitted)

If "[freq=pop]" is not specified, each observation contributes 1.

Stratified tables work as with ir. To obtain the Mantel-Haenszel combinedrisk ratio:

. cs case exposed [freq=pop], by(age)

To obtain internally standardized risk ratio: 1. 0 0 2 2. 0 1 9 3. 1 0 12 4. 1 1 2 5. 1 1 5

. cs case exp [freq=pop] (output omitted)

If "[freq=pop]" is not specified, each observation contributes 1.

Stratified tables work as with ir. To obtain the Mantel-Haenszel combinedrisk ratio:

. cs case exposed [freq=pop], by(age)

To obtain internally standardized risk ratio:

. cs case exposed [freq=pop], by(age) istandard

To obtain externally standardized risk ratio:

. cs case exposed [freq=pop], by(age) estandard

To weight each age group equally:

. gen wgt=1 . cs case exposed [freq=pop], by(age) standard(wgt)

Examples: case-control data----------------------------

cc and cci work just like cs and csi. They differ in that they report theodds ratio rather than the risk ratio.

Examples: matched case-control data------------------------------------

mcc and mcci work just like cc and cci except that they report differentstatistics. Stratified tables are not allowed with mcc.

Also see--------

Manual: [R] epitabOn-line: help for bitest, ci, clogit, dstdize, immed, logistic, nbreg, poisson, st, stcox, tabulate

11.14 Sample size and power calculations

! The Stata command sampsi performs sample size of powercalculations for comparison of means or proportions

! Also see the free sample size software from Dupont and Plummer– “Other Links” on the course website Home page

! For details, refer to sampsi in the Reference Manual or type

help sampsi

For convenience, the Help text is given below:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 46

Page 47: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

11.14 Sample size and power calculations (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. help sampsi

-------------------------------------------------------------------------------help for sampsi (manual: [R] sampsi)-------------------------------------------------------------------------------

Sample size and power determination-----------------------------------

sampsi #1 #2 [, alpha(#) power(#) n1(#) n2(#) ratio(#) sd1(#) sd2(#) onesample onesided ]

Description-----------

sampsi estimates required sample size or power of tests for comparisons ofmeans or proportions. If n1() or n2() is specified, sampsi computes power;otherwise, it computes sample size. sampsi is an immediate command; all ofits arguments are numbers; see help immed.

sampsi computes sample size or power for four types of tests:

1. Two-sample comparison of means. The postulated values of the means are #1 and #2. The postulated standard deviations are sd1() and sd2().

2. One-sample comparison of mean to hypothesized value. Option onesample must be specified. The hypothesized value (null hypothesis) is #1. The postulated mean (alternative hypothesis) is #2. The postulated standard deviation is sd1().

3. Two-sample comparison of proportions. The postulated values of the proportions are #1 and #2.

4. One-sample comparison of proportion to hypothesized value. Option onesample must be specified. The hypothesized proportion (null hypothesis) is #1. The postulated proportion (alternative hypothesis) is #2.

Options-------

alpha(#) specifies the significance level of the test; the default is alpha(.05). (More correctly, the default is 1-level/100 from set level, see help level.)

power(#) is power of the test. Default is power(.90).

n1(#) specifies the size of the first (or only) sample and n2(#) specifies the size of the second sample. If specified, sampsi reports the power calculation. If not specified, sampsi computes sample size.

ratio(#) is an alternative way to specify n2() in two-sample tests. In a two-sample test, if n2() is not specified, n2() is assumed to be n1()*ratio(). That is, ratio() = n2()/n1(). The default is ratio(1).

sd1(#) and sd2(#) are the standard deviations for comparison of means. If not specified, comparison of proportions is assumed. In two-sample cases, if only sd1() is specified, sd2() is assumed to equal sd1().

onesample indicates a one-sample test. The default is a two-sample test.

onesided indicates a one-sided test. The default is a two-sided test.

Examples--------

1. Two-sample comparison of mean1 to mean2. Compute sample sizes with n2/n1 = 2:

. sampsi 132.86 127.44, p(0.8) r(2) sd1(15.34) sd2(18.23)

Compute power with n1 = n2, sd1 = sd2, and alpha = 0.01 one-sided:

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 47

Page 48: Corel Office Document - Biostatistics - Departments€¦ · — Perform data management, including input, editing, and merging of datasets, necessary to analyze data in Stata or equivalent

11.14 Sample size and power calculations (cont'd)Class 1 - Introduction; Overview of Stata -- LECTURE NOTES

. sampsi 5.6 6.1, n1(100) sd1(1.5) a(0.01) onesided

2. One-sample comparison of mean to hypothesized value = 180. Compute sample size:

. sampsi 180 211, sd(46) onesam

One-sample comparison of mean to hypothesized value = 0. Compute power:

. sampsi 0 -2.5, sd(4) n(25) onesam

3. Two-sample comparison of proportions. Compute sample size with n1 = n2 (i.e., ratio = 1, the default) and power = 0.9 (the default):

. sampsi 0.25 0.4

Compute power with n1 = 500 and ratio = n2/n1 = 0.5:

. sampsi 0.25 0.4, n1(300) r(0.5)

4. One-sample comparison of proportion to hypothesized value = 0.5:

. sampsi 0.5 0.75, power(0.8) onesample

Compute power:

. sampsi 0.5 0.6, n(200) onesam

Also see--------

Manual: [R] sampsiOn-line: help for immed

Biostatistics 624 © 2011 by JHU Biostatistics Dept. Sun, 27 Mar 2011 (6:47p) CLASS 1 - 48