Elliott Morrell 3e Solutions

Embed Size (px)

DESCRIPTION

sas solution practise

Citation preview

  • Instructor Solutions Manual

    for

    Prepared by

    Christopher H. Morrell Loyola University Maryland

    Australia Brazil Japan Korea Mexico Singapore Spain United Kingdom United States

    Learning SAS in the Computer Lab

    3rd EDITION

    Rebecca J. Elliott Statistically Significant

    Christopher H. Morrell

    Loyola University Maryland

  • Printed in the United States of America 1 2 3 4 5 6 7 11 10 09 08 07

    2010 Brooks/Cole, Cengage Learning ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored, or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the publisher except as may be permitted by the license terms below.

    For product information and technology assistance, contact us at Cengage Learning Customer & Sales Support,

    1-800-354-9706

    For permission to use material from this text or product, submit all requests online at www.cengage.com/permissions

    Further permissions questions can be emailed to [email protected]

    ISBN-13: 978-0-495-82797-9 ISBN-10: 0-495-82797-5 Brooks/Cole 20 Channel Center Street Boston, MA 02210 USA Cengage Learning is a leading provider of customized learning solutions with office locations around the globe, including Singapore, the United Kingdom, Australia, Mexico, Brazil, and Japan. Locate your local office at: international.cengage.com/region Cengage Learning products are represented in Canada by Nelson Education, Ltd. For your course and learning solutions, visit academic.cengage.com Purchase any of our products at your local college store or at our preferred online store www.ichapters.com

    NOTE: UNDER NO CIRCUMSTANCES MAY THIS MATERIAL OR ANY PORTION THEREOF BE SOLD, LICENSED, AUCTIONED,

    OR OTHERWISE REDISTRIBUTED EXCEPT AS MAY BE PERMITTED BY THE LICENSE TERMS HEREIN.

    READ IMPORTANT LICENSE INFORMATION

    Dear Professor or Other Supplement Recipient: Cengage Learning has provided you with this product (the Supplement) for your review and, to the extent that you adopt the associated textbook for use in connection with your course (the Course), you and your students who purchase the textbook may use the Supplement as described below. Cengage Learning has established these use limitations in response to concerns raised by authors, professors, and other users regarding the pedagogical problems stemming from unlimited distribution of Supplements. Cengage Learning hereby grants you a nontransferable license to use the Supplement in connection with the Course, subject to the following conditions. The Supplement is for your personal, noncommercial use only and may not be reproduced, posted electronically or distributed, except that portions of the Supplement may be provided to your students IN PRINT FORM ONLY in connection with your instruction of the Course, so long as such students are advised that they

    may not copy or distribute any portion of the Supplement to any third party. You may not sell, license, auction, or otherwise redistribute the Supplement in any form. We ask that you take reasonable steps to protect the Supplement from unauthorized use, reproduction, or distribution. Your use of the Supplement indicates your acceptance of the conditions set forth in this Agreement. If you do not accept these conditions, you must return the Supplement unused within 30 days of receipt. All rights (including without limitation, copyrights, patents, and trade secrets) in the Supplement are and will remain the sole and exclusive property of Cengage Learning and/or its licensors. The Supplement is furnished by Cengage Learning on an as is basis without any warranties, express or implied. This Agreement will be governed by and construed pursuant to the laws of the State of New York, without regard to such States conflict of law rules. Thank you for your assistance in helping to safeguard the integrity of the content contained in this Supplement. We trust you find the Supplement a useful teaching tool.

  • iii

    CONTENTS PREFACE...................................................................................................................................... iv MODULE 1: THE BASICS ..........................................................................................................1 MODULE 2: MORE SAS BASICS..............................................................................................4 MODULE 3: DATA MANAGEMENT........................................................................................7 MODULE 4: SAS FUNCTIONS ................................................................................................10 MODULE 5: DESCRIPTIVE STATISTICS I............................................................................12 MODULE 6: PROC GCHART...................................................................................................14 MODULE 7: DESCRIPTIVE STATISTICS II ..........................................................................17 MODULE 8: GENERATING RANDOM OBSERVATIONS...................................................21 MODULE 9: X-Y PLOTS ..........................................................................................................23 MODULE 10: ONE SAMPLE TESTS FOR , p .........................................................................26 MODULE 11: TWO SAMPLE T-TESTS ....................................................................................31 MODULE 12: ONE-WAY ANOVA ............................................................................................33 MODULE 13: TWO-WAY ANOVA AND MORE .....................................................................36 MODULE 14: MODEL CHECKING IN ANOVA ......................................................................38 MODULE 15: CORRELATIONS ................................................................................................41 MODULE 16: SIMPLE LINEAR REGRESSION .......................................................................43 MODULE 17: MODEL CHECKING IN REGRESSION. ...........................................................46 MODULE 18: MULTIPLE LINEAR REGRESSION..................................................................50 MODULE 19: MULTIPLE REGRESSION-CHOOSING THE BEST MODEL.........................53 MODULE 20: TESTS FOR CATEGORICAL DATA.................................................................56 MODULE 21: NON-PARAMETRIC TESTS ..............................................................................60 MODULE 22: ANALYSIS OF COVARIANCE..........................................................................62 MODULE 23: LOGISTIC REGRESSION...................................................................................63 MODULE 24: MATRIX COMPUTATIONS...............................................................................64 MODULE 25: MACRO VARIABLES AND PROGRAMS ........................................................66

  • iv

    PREFACE This solutions manual provides the SAS code needed for problems in Learning SAS in the Computer Lab, 3rd Edition. There are many possible ways to write programs that will run and generate the desired output. This manual provides one set of solutions. In this manual, SAS code will be displayed in a Courier font. Parts of problems (a, b, c, and so on) are often related and should be incorporated in one SAS program. The solution may have program code common to all parts of the problem listed first, followed by code for particular parts listed under a, b, c, and so on. In some cases, more common code follows the code for the parts. Problems in the early chapters call for label and title statements as well as the use of PROC FORMAT. Solutions for later chapters do not include these statements although I recommend they be assigned. Students should also be required/strongly encouraged to properly document their SAS program with comments. There are many different ways to read the data sets included with the manual. I have used different formats throughout the solutions manual as examples. Instructors may also wish to include some data sets as Microsoft Excel files for the students to read so that students can gain experience reading data in this common format. In Learning SAS in the Computer Lab, 3rd Edition, I recommend that SAS code be formatted in ways that make the code easy to read and debug. In order to save space, I have not included such formatting in the solutions. For some problems, answers to the statistical questions are provided. This may help to decide which problems to assign.

  • 1

    MODULE 1: THE BASICS 1.1 data one;

    input pH time temp; datalines; 4.5 20 125 4.1 22 133 4.8 18 149 4.0 26 120 5.0 25 120 6.0 21 138 ; run; proc print; run;

    1.2 Use the same data step as in 1.1 and then

    proc print; var temp pH; run; 1.3 data sizes;

    input size $ color $ price shipcost; datalines; large red 18.97 0.25 medium blue 24.68 1.10 x-large black 29.99 1.75 small orange 15.89 0.90 ; run; proc print; var size color price shipcost; run;

    1.4 Use the same data step as in 1.3 and then

    proc print; var color size price; run;

    1.5 data schools;

    input school $ no_teach no_stud; datalines; granite 5829 200486 jordan 12433 318992 davis 2358 126331 ; run; proc print; var school no_teach no_stud; run;

  • 2

    1.6 The input statement in 1.1 changes to

    input pH 1-3 time 5-6 temp 8-10; datalines; 1.7 The input statement in 1.1 changes to

    input @1 pH @5 time @8 temp; 1.8 The input statement in 1.3 changes to

    input size $ 1-7 color $ 9-14 price 16-20 shipcost 23-26; 1.9 The input statement in 1.3 changes to

    input @1 size $7. @9 color $6. @16 price 5.2 @23 shipcost 4.2; 1.10 data appoint;

    input time $ 1-5 person $ 8-12 where $ 15-27 subject $ 29-44 length 48-49; datalines; 11:OO Sally room 30 personnel review 45 1:00 Jim Jim's office brake design 30 3:00 Nancy lab test results 30 ; run; proc print; var time person where subject length; run ;

    1.11 The input statement in 1.10 changes to

    input @1 time $5. @8 person $5. @15 where $12. @29 subject $16. @48 length 2.0;

  • 3

    1.12 data popcorn; input @1 brand $20. @22 time $4. @27 notpop 3.0; datalines; Orville Redenbacker 2:15 80 Orville Redenbacker 2:15 89 Orville Redenbacker 2:30 57 Orville Redenbacker 2:30 60 Orville Redenbacker 2:45 60 Orville Redenbacker 2:45 46 Smith's 2:15 170 Smith's 2:15 147 Smith's 2:30 196 Smith's 2:30 114 Smith's 2:45 98 Smith's 2:45 90 Pop Secret 2:15 215 Pop Secret 2:15 78 Pop Secret 2:30 98 Pop Secret 2:30 83 Pop Secret 2:45 75 Pop Secret 2:45 65 ; run; proc print; run;

  • 4

    MODULE 2: MORE SAS BASICS 2.1 a data one; infile 'utility.dat';

    input @1 month $3. @5 year 2. phone 9-14 fuel 18-22 elec 25-29; if month='Jan' then monthnum=l; else if month='Feb' then monthnum=2; else if month='Mar' then monthnum=3; else if month='Apr' then monthnum=4; else if month='May' then monthnum=5; else if month='Jun' then monthnum=6; else if month='Jul' then monthnum=7; else if month='Aug' then monthnum=8; else if month='Sep' then monthnum=9; else if month='Oct' then monthnum=lO; else if month='Nov' then monthnum=ll; else if month='Dec' then monthnum=12; totalexp = phone + fuel + elec; run; proc print; run;

    b Use the same data step as in (a) and then

    proc sort; by year monthnum; run; proc print; by year; var month phone; run;

    c Use the same data step as in (a) and then

    proc sort; by monthnum year; run; proc print; by monthnum; var year phone; run;

    d Use the same data step as in (a) and then

    proc print; where year = 92; run;

    e Use the same data step as in (a) and then

    proc sort data = one; by year; proc print; where month = 'Jan' or month='Feb' or month='Mar'; by year; run;

    f Sort by year and month to compare years across months.

    Sort by month and year to compare months across years.

  • 5

    2.2 a data one; infile 'china#l.dat'; input year total exports imports; deficit = exports - imports; run; proc print; run;

    b data two; set one;

    if 1955

  • 6

    2.5 a, b proc format; value $ktfmt 'o' = 'Overhand' 'f' = 'Figure8'; value rfmt 1 = 'Cotton' 2 = 'Twine' 3 = 'Nylon'; value kdfmt 1 = 'Parallel' 2 = 'Perpendicular'; run; data one; infile 'knots.dat'; input Knot_Type $ 4 Rope 7 Knot_Direction 10 Weight 13-15; Break_Weight=Weight-162; Brk_Wgt_kg=Break_Weight/2.2; format Knot_Type $ktfmt. Rope rfmt. Knot_Direction kdfmt.; run; proc sort; by descending Break_Weight; run; proc print; run;

    2.6 proc format;

    value htnfmt 1='Normotensive' 2='IDH' 3='ISH' 4='Hypertension'; run; data one; infile 'btt.dat'; input childid sex bweight gestage momage parity mdbp msbp momeduc mmedaid socio dbp5 sbp5 ht5 wt5 hdl5 ldl5 trig5 smoke5 medaid5 socio5; bmi5 = wt5/(ht5*ht5); if msbp >= 140 and mdbp >= 90 then htn = 4; else if msbp >= 140 and mdbp < 90 then htn = 3; else if msbp < 140 and mdbp >= 90 then htn = 2; else if msbp < 140 and mdbp < 90 then htn = 1; else if msbp = . or mdbp = . then htn = .; format htn htnfmt.; run;

    a data one10; set one; if _n_

  • 7

    MODULE 3: DATA MANAGEMENT 3.1 data one; infile 'china#l.dat';

    input year 1-4 total 6-10 exports 12-16 imports 18-22; run ; / * It is first necessary to put data in year order before computing the change in exports or imports * /

    a proc sort; by year; run;

    data two; set one; / * The next two lines compute change in exports */ lastyrex = lag(exports); changeex = exports - lastyrex;

    b / * The next two lines compute change in imports */

    lastyrim = lag(imports); changeim = imports - lastyrim; run;

    proc print; var year exports lastyrex changeex imports lastyrim changeim; run;

    3.2 data utils; infile 'utility.dat';

    input @1 month $3. 85 year 2.0 phone 9-14 fuel 18-22 elec 25-29; if month = 'Jan' then monthnum =l; else if month = 'Feb' then monthnum =2; else if month = 'Mar' then monthnum =3; else if month = 'Apr' then monthnum =4; else if month = 'May' then monthnum =5; else if month = 'Jun' then monthnum =6; else if month = 'Jul' then monthnum =7; else if month = 'Aug' then monthnum =8; else if month = 'Sep' then monthnum =9; else if month = 'Oct' then monthnum =lo; else if month = 'Nov' then monthnum =11; else if month = 'Dec' then monthnum =12; run; /* Put data in year month order * / proc sort; by year monthnum; run;

    a data year90; set utils; if year = 90;

    lastmonth = lag(phone); change = phone - lastmonth; run; proc print; var year month phone lastmonth change; run;

    b data winter; set utils; if month = 'Jan';

    lastyr = lag(fue1); change = fuel - lastyr; run; proc print; var month year fuel lastyr change; run;

  • 8

    3.4 data DH; input flavor $ 1-10 height; brand = 'DH'; datalines; DevilsFood 39.0 DevilsFood 36.5 White 30.5 White 34.5 Yellow 37.0 Yellow 35.0 ; run; data BC; input flavor $ 1-10 height; brand = 'BC'; datalines; Yellow 35.5 Yellow 36.0 DevilsFood 35.5 DevilsFood 37.5 White 32.5 White 32.5 ; run;

    a * Concatenate the two data sets ;

    data Cake; set DH BC; file 'Module3-4a.dat'; put flavor $ 1-10 brand $ 12-13 height 15-18 .1; run;

    b * Reformulate Duncan Hines data for match merging ;

    data DH1; set dh; dhht = height; keep flavor dhht; run; proc sort; by flavor; run; * Reformulate Betty Crocker data for match merging; data BC1; set BC; bcht = height; keep flavor bcht; run; proc sort; by flavor; run; data Cake1; merge dh1 bc1; by flavor; file 'Module3-4b.dat'; put flavor $ 1-10 dhht 12-16 .1 bcht 18-22 .1; run;

    3.5 data ml_first25; infile 'moonlake.dat' obs=25;

    input propane 1 naturalgas 2 eeproducts 3 sshacwhs 4 ewrs 5 remr 6 garbage 7 tagto 8 internet 9 hss 10 New 12 OneBill 14 NG 15 Elec 16 PG 17 FuelOil 18 Wood 19 Coal 20 Solar 21 Source 22 AgeHeat 24 TypeWater 25 Agewater 26 HowLng 27 PCHome 34 PCPlan 35 Internet 36 Provider 37 Age 40 Educ 41 Income 42 sex 43; run; proc print; run;

  • 9

    3.6 data ml_26_50; infile 'moonlake.dat' firstobs=26 obs=50;

    Input statement as in 3.5.

    data ml_251_300; infile 'moonlake.dat' firstobs=251 obs = 300;

    Input statement as in 3.5.

    data ml2; set ML_26_50 ML_251_300; run; proc print data = ml2; run;

  • 10

    MODULE 4: SAS FUNCTIONS 4.1 data well; infile 'well#l.dat';

    input @1 date $8. nitrate zinc TDS; month = substr(date,l,3); day = substr(date,4,2); run; proc print; run;

    4.2 data one; input value;

    posval = abs(value); root = sqrt(posva1); newval = sqrt(abs(va1ue)) ; datalines; 2.7 -6.9 3.4 0.5 1.3 ; run; proc print; run;

    4.3 data one; input x;

    a cumprob = probbnm1(0.23,13,x);

    b greater = 1 - cumprob;

    c if x = 0 then lessprob = .; else lessprob = probbnm1(.23,13,x-1); datalines; 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ; run; proc print; run;

  • 11

    4.4 data binomial; input x; n = 5; p = 0.40;

    a cdf=probbnml(p, n, x);

    b pdf=cdf-lag(cdf); if x = 0 then pdf = cdf; datalines; 0 1 2 3 4 5 ; run; proc print; run;

    4.5 data norm; mu=12.6; sigma=2.3;

    x = 10; z=(x-mu)/sigma; x1 = 15; z1=(x1-mu)/sigma; x2 = 7.6; z2=(x2-mu)/sigma;

    a prob_a=probnorm(z);

    b prob_b=probnorm(z1)-probnorm(z2);

    run; proc print; run;

  • 12

    MODULE 5: DESCRIPTIVE STATISTICS I 5.1 data utils; infile 'utility.dat';

    input @1 date $6. @5 year 2.0 phone 9-14 fuel 17-22 elec 25-29; total = phone + fuel + elec; label phone = 'phone costs'

    fuel = 'fuel costs' elec = 'electricity costs' total = 'total utility costs'; run;

    proc univariate plot; var phone fuel elec total; id date; title 'Descriptive Stats for utility Costs'; run;

    Extreme phone costs: Low--Jan92, Jan89, Dec91, Oct92, Jan93. HighMay90, Jan91, Apr90, Jan90, Jun90. No outliers.

    Extreme fuel costs: Low--Jul92, Ju190, Aug90, Ju189, Aug89. High--Jan92, Feb89, Jan89, Feb93, Jan92. No outliers.

    Extreme elec costs: Low--Jun92, Sep91, Mar92, Apr90, Mar90. High--Jun89, Nov88, Jan89, Oct88, Dec88. Dec88 is an outlier.

    Extreme total costs: Low--Sep92, Aug92, Oct92, Aug91, May92. High--Jan89, Feb91, Dec88, Jan90, Jan91. No outliers.

    5.2 Use data step as in 5.1 and then

    proc sort; by year; run; proc univariate; by year; var total; id date; title 'Total utility Costs for each Year'; run;

    5.3 proc format; value lsfmt 1 = "Athletic" 2 = "Sedentary"; run;

    data athlete; infile 'athlete.dat'; input sbp 1-3 dbp 6-7 sex $ 10 ls 13; label sbp = 'Systolic Blood Pressure'

    dbp = 'Diastolic Blood Pressure' ls = 'Lifestyle'; format ls lsfmt.; run; proc sort data = athlete; by sex ls; run; * Compare bp's among the 4 sex by lifestyle groups ;

    a proc univariate plots; var dbp; by sex ls;

    title 'Description of diastolic bp by sex and lifestyle'; run;

    b proc univariate plots normal; var sbp;

    probplot sbp / normal(mu = est sigma = est); title 'Checking whether sbp is normal'; run;

  • 13

    5.4 data one; infile 'china#l.dat'; input year 1-4 total 6-10 exports 12-16 imports 18-22; deficit = imports - exports; run; proc univariate plot; var imports exports deficit; id year; title 'Statistics on China''s Trade'; run;

    5.5 proc format;

    value bfmt 1 = 'Duracell' 2 = 'Energizer' 3 = 'Rayovac' 4 = 'Radio Shack'; run;

    data one; infile 'battery.dat'; input brand 1 load 4-6 time 9-11; label brand = 'Battery Brand'

    time = 'Time to discharge'; format brand bfmt.; run; proc boxplot; plot time*brand / boxstyle=schematic cboxes = black; title 'Comparing discharge times among battery brands'; run;

    5.6 data park; infile 'parking.dat';

    input id miles; if miles = 99 then miles = .; label miles = 'Distance live from campus'; run; proc univariate plot; var miles; id id; title 'Descriptive statistics of distance live from campus'; run;

    5.7 data quarterback; infile 'quarterback.dat';

    input player $ 5-22 rating 101-105; label rating = 'Quarterback rating'; run; proc univariate plot; var rating; id player; title 'Descriptive statistics of quarterback ratings'; run;

    5.8 proc format;

    value sfmt 1 = 'Male' 2 = 'Female'; run; data btt; infile 'btt.dat'; input childid 1-4 sex 6 bweight 8-11 gestage 13-14; label bweight = 'Birth weight'

    gestage = 'Gestational age'; format sex sfmt.; run; proc sort; by sex; run; proc univariate plot; var bweight gestage; id childid; by sex; title 'Statistics for birth weight and gestaional age by sex'; run;

  • 14

    MODULE 6: PROC GCHART 6.1 data one; infile 'utility.dat';

    input @1 date $char6. @5 year 2.0 phone fuel elec; total = phone + fuel + elec; label phone = 'phone costs'

    fuel = 'fuel costs' elec = 'electricity costs' total = 'total utility costs'; run;

    proc gchart; vbar phone fuel elec total / space = 0; title 'Histograms of utility costs'; run;

    The distributions are right skewed.

    6.2 Use the same data step as in 6.1 and then

    data two; set one; if 90

  • 15

    6.5 proc format; value $sexfmt 'F'='Female' 'M'='Male'; run; data run; infile 'running.dat'; input class sex $ @5 minute1 1.0 @7 second1 2.0 @10 minute2 1.0 @12 second2 2.0; time1 = minute1*60 + second1; time2 = minute2*60 + second2; label class = 'Grade in School'

    time1 = 'Running Time for First Race' time2 = 'Running Time for Second Race';

    format sex sexfmt.; run; goptions htext = 2; proc gchart data = run; vbar time1 / space = 0 width = 10 midpoints = 70 to 140 by 10; vbar time2 / space = 0 width = 10 midpoints = 70 to 130 by 10; run;

    6.6 proc format;

    value sfmt 1 = 'Natural Gas' 2 = 'Electricity' 3 = 'Propane Gas' 4 = ' ' 5 = 'Wood' 6 = 'Coal';

    value incfmt 1='=$75,000' 6='Refuse'; run;

    data ml; infile 'moonlake.dat'; input propane 1 Source 22 Income 42; label propane = "Interest in purchasing propane (1=Not, 5=Very)"

    Source = "Primary Energy Source for Heat" Income = "Annual Household Income";

    format Source sfmt. Income incfmt.; run; proc gchart data = ml; * The bars for source ordered from highest to lowest; hbar source / midpoints = 1 3 2 5 6 ; hbar propane / midpoints = 1 to 6 by 1; hbar income / midpoints = 1 to 6 by 1; run;

  • 16

    6.7 proc format; value fsfmt 0 = 'Student' 1 = 'Faculty/Staff'; value usrn 1='Usually' 2='Sometimes' 3='Rarely' 4='Never'; run; data park; infile 'parking.dat'; input id miles bus_convenient carpool years status bus Monday Tuesday Wednesday Thursday Friday drive permit meters lots; if id = 400 then fac_staff = 0; if years = 99 then years = .; if bus = 99 then bus = .; if Monday = 99 then Monday = .; if Tuesday = 99 then Tuesday = .; if Wednesday = 99 then Wednesday = .; if Thursday = 99 then Thursday = .; if Friday = 99 then Friday = .; busdays = Monday + Tuesday + Wednesday + Thursday + Friday; if bus = 2 then busdays = 0; if lots = 99 then lots = .; format fac_staff fsfmt. bus yn. lots usrn.; run; proc gchart data = park;

    a hbar years / space = 0 width = 6 midpoints = 1 to 7 by 1;

    c vbar busdays /

    space = 0 width = 10 midpoints = 0 to 5 by 1; run;

    b proc sort data = park; by fac_staff bus; run; proc gchart data = park; hbar lots / midpoints = 1 to 4 by 1; by fac_staff bus; run;

    6.8 proc format;

    value mefmt 1 = '= HS'; run;

    data btt; infile 'btt.dat'; input childid 1-4 momeduc 29 socio 33 socio5 73; format momeduc mefmt.; run; goptions htext = 1;

    a proc gchart data = btt;

    vbar momeduc / midpoints = 1 to 4 by 1; run; goptions htext = 2;

    b proc gchart data = btt;

    hbar socio socio5 / midpoints = 0 to 4 by 1; run;

  • 17

    MODULE 7: DESCRIPTIVE STATISTICS II 7.1 data ml; infile 'moonlake.dat';

    input propane 1 naturalgas 2 eeproducts 3 sshacwhs 4 ewrs 5 remr 6 garbage 7 tagto 8 internet 9 hss 10; run;

    data omitmissing; set ml; if propane = 6 then propane = .; if naturalgas = 6 then naturalgas = .; if eeproducts = 6 then eeproducts = .; if sshacwhs = 6 then sshacwhs = .; if ewrs = 6 then ewrs = .; if remr = 6 then remr = .; if garbage = 6 then garbage = .; if tagto = 6 then tagto = .; if internet = 6 then internet = .; if hss = 6 then hss = .; run; proc means data = omitmissing; var propane naturalgas eeproducts sshacwhs ewrs remr garbage tagto internet hss; run;

    7.2 proc format;

    value $sexfmt 'F'='Female' 'M'='Male'; run; data running; infile 'running.dat'; input class 1 sex $ 3 min1 5 sec1 7-8 min2 10 sec2 12-13; time1=min1*60+sec1; time2=min2*60+sec2; label time1 = 'Time for Race 1'

    time2 = 'Time for Race 2'; format sex sexfmt.; run; proc means data = running; class sex class; var time1; run;

    7.3 proc format;

    value lsfmt 1 = "Athletic" 2 = "Sedentary"; run; data athlete; infile 'athlete.dat'; input sbp 1-3 dbp 6-7 sex $ 10 ls 13; label sbp = 'Systolic Blood Pressure'

    dbp = 'Diastolic Blood Pressure' ls = 'Lifestyle';

    format ls lsfmt.; run; proc means data = athlete; class ls; var sbp dbp; run;

    7.4 data golf; infile 'golf.dat';

    input Golfer 1 Compression 3-5 Material 8 Distance ; run; proc means data = golf; class Golfer; var distance; run;

  • 18

    7.5 proc format; value contfmt 0 = 'Not Contaminated' 1 = 'Contaminated'; run; data well; infile 'well#1.dat'; input date $ 1-5 month $ 1-3 day 4-5 year 7-8 nitrate 11-15 .3 zinc 18-22 .3 TDS 25-27; if (nitrate > 0.12) or (zinc > 0.02) or (TDS > 516) then contaminate = 1; else contaminate = 0; format contaminate contfmt.; run;

    a proc freq;

    table contaminate;

    b table contaminate*year; run;

    All of the data in 1990 is contaminated. In 1991, half is contaminated. 7.6 proc format;

    value outfmt 0 = 'Failure' 1 = 'Success'; run; data one; infile 'survresp.dat'; input incentive n_cont n_treat r_cont r_treat; label n_cont = 'Sample size for control group'

    n_treat = 'Sample size for treatment group' r_cont = 'Response rate for control group' r_treat = 'Response rate for treatment group';

    if r_cont < r_treat then outcome = 1; else outcome = 0; format outcome outfmt.; run; proc freq; tables outcome outcome*incentive; run;

  • 19

    7.7 proc format; value bgfmt 0 = 'Bad' 1 = 'Good'; run; data skin; infile 'sclero.dat'; input clinic id drug thickl thick2 mobill mobil2 assessl assess2; if thickl > thick2 then r_thick = 1; else r_thick = 0 ; if mobill < mobil2 then r_mobil = 1; else r_mobil = 0; if assessl > assess2 then r_assess = 1; else r_assess = 0; label r_thick = 'Skin thickening improvement'

    r_mobil = 'Skin mobility improvement' r_assess = 'Patient assessment improvement';

    format r_thick bgfmt. r_mobil bgfmt. r_assess bgfmt. ; run;

    a proc freq; tables clinic; run;

    Clinics #46 and #49 had the largest number of patients in the study.

    b proc freq; tables drug*clinic; run;

    c proc freq data = skin;

    where clinic = 46 or clinic = 48 or clinic = 49; tables clinic*drug*(r_thick r_mobil r_assess); run;

    d proc freq data = skin;

    where drug = 1; tables r_thick*r_assess; run;

    34.38%, 21.88%

  • 20

    7.8 proc format; value sfmt 1 = 'Natural Gas' 2 = 'Electricity' 3 = 'Propane Gas'

    4 = ' ' 5 = 'Wood' 6 = 'Coal'; value agefmt 1='18-34' 2='35-49' 3='50-64' 4='>=65' 5='Refuse'; run; data moonlake; infile 'moonlake.dat'; input propane 1 internet 9 NG 15 Elec 16 PG 17 FuelOil 18 Wood 19 Coal 20 Solar 21 Source 22 Internet 36 Age 40; label propane = "Interest in purchasing propane (1=Not, 5=Very)"

    Source = "Primary Energy Source for Heat" format Solar availfmt. Source sfmt. age agefmt.; run; proc freq data = moonlake;

    a table Source ;

    b table NG Elec PG FuelOil Wood Coal Solar;

    c table NG*Propane;

    d table Internet*age; run;

    e proc freq data = moonlake;

    where PCHome = 1; table Internet*age; run;

    7.9 proc format;

    value mefmt 1 = '= HS'; run;

    data btt; infile 'btt.dat'; input momeduc 29 socio 33 socio5 73; format momeduc mefmt.; run; proc freq data = btt; table socio*socio5 socio*momeduc; run;

  • 21

    MODULE 8: GENERATING RANDOM OBSERVATIONS 8.1 data one; do i=1 to 1000; obs = rannor(4241)*20 + 50;

    output; end; run; proc gchart data = one; vbar obs / space = 0; title 'Random samples from N(50,400)'; run;

    8.2 a data one; do i=1 to 50; obs = rannor(70776)*10 + 10;

    output; end; run; proc gchart; vbar obs / space = 0; title 'Random sample of 50 obs of N(10,lOO)'; run;

    b data two; do i=1 to 500; obs = rannor(70776)*10 + 10;

    output; end; run; proc gchart; vbar obs / space = 0; title 'Random sample of 500 obs of N(10,lOO)'; run;

    c data three; do i=1 to 5000; obs = rannor(70776)*10 + 10;

    output; end; run; proc gchart; vbar obs / space = 0; title 'Random sample of 5000 obs of N(10,lOO)'; run;

    8.3 data exp; do i = 1 to 1000; x = ranexp(6664)/7;

    output; end; run; proc gchart; vbar x / space = 0 width = 6; title 'An exponential distribution with lambda=7'; run;

    8.4 data poisson; do i = 1 to 700; y = ranpoi(9001, 5);

    output; end; run; proc gchart; vbar y / space = 0; title 'A Poisson distribution with mean=5'; run;

    8.5 data bin; do i = 1 to 500; xval = ranbin(2721, 40, 0.2);

    output; end; run; proc gchart; vbar xval / space = 0 midpoints = 0 to 20 by 1; title 'A Binomial distribution with n=40 and p=0.2'; run;

  • 22

    8.6 data new; do i = 1 to 1000; x1 = ranexp(434911)/7; x2 = ranexp(434911)/7; x3 = ranexp(434911)/7; x4 = ranexp(434911)/7; x5 = ranexp(434911)/7; x6 = ranexp(434911)/7; x7 = ranexp(434911)/7; x8 = ranexp(434911)/7; x9 = ranexp(434911)/7; x10= ranexp(434911)/7; average = (x1+x2+x3+x4+x5+x6+x7+x8+x9+x10)/10; output; end; run; proc gchart; vbar average / space = 0 width = 6; title 'Distribution of average of exponential r.v.''s '; run;

    8.7 data uniform; do i = 1 to 1000;

    val1= ranuni(887890)*10 + 10; va12= ranuni(887890)*10 + 10; va13= ranuni(887890)*10 + 10; va14= ranuni(887890)*10 + 10; va15= ranuni(887890)*10 + 10; va16= ranuni(887890)*10 + 10 ; va17= ranuni(887890)*10 + 10; va18= ranuni(887890)*10 + 10; va19= ranuni(887890)*10 + 10; vall0=ranuni(887890)*10 + 10; ave = (val1+va12+va13+va14+va15+va16+va17+va18+va19+vall0)/10; output; end; run; proc gchart; vbar ave / space = 0 width = 6; title 'Distribution of average of a Uniform r.v. on (10,20)'; run;

  • 23

    MODULE 9: X-Y PLOTS 9.1 data utility; infile 'utility.dat';

    input month $ 1-3 year 5-6 phone 9-15 fuel 17-22 elec 25-29; total=phone + fuel + elec; if month = 'Jan' then mnth = 1; else if month = 'Feb' then mnth = 2; else if month = 'Mar' then mnth = 3; else if month = 'Apr' then mnth = 4; else if month = 'May' then mnth = 5; else if month = 'Jun' then mnth = 6; else if month = 'Jul' then mnth = 7; else if month = 'Aug' then mnth = 8; else if month = 'Sep' then mnth = 9; else if month = 'Oct' then mnth = 10; else if month = 'Nov' then mnth = 11; else mnth = 12; if 89

  • 24

    9.3 data well; infile 'well#8.dat'; input @1 month $3. @4 day 2. @7 year 2. zinc; if month = 'Jan' then mo = 1; else if month = 'Feb' then mo = 2; else if month = 'Mar' then mo = 3; else if month = 'Apr' then mo = 4; else if month = 'May' then mo = 5; else if month = 'Jun' then mo = 6; else if month = 'Jul' then mo = 7; else if month = 'Aug' then mo = 8; else if month = 'Sep' then mo = 9; else if month = 'Oct' then mo = 10; else if month = 'Nov' then mo = 11; else if month = 'Dec' then mo = 12; format date date7. ; date = mdy (mo, day, year) ; run; proc sort; by year mo day; run; goptions csymbol = black; symbol1 value = dot i = join; proc gplot; by year; plot zinc*date; title 'Zinc concentrations over time'; run;

    9.4 data one; infile 'handinj.dat';

    input id $ type $ dayslost cost; label dayslost = 'Days of work lost'

    cost = 'Cost in Irish pounds'; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot; plot dayslost*cost; title 'Lost work days vs. cost'; run;

    9.5 data two; infile 'survresp.dat';

    input incentive n_cont n_treat r_cont r_treat; improve =(r_treat - r_cont)/r_cont; label n_cont = 'Sample size for control group'

    n_treat = 'Sample size for treatment group' r_cont = 'Response for control group' r_treat = 'Response for treatment group' improve = 'Improvement in response rate'; run;

    goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot; plot improve*incentive; title 'Improvement in response vs. types of incentive'; run;

  • 25

    9.6 data athlete; infile 'athlete.dat'; input sbp 1-3 dbp 6-7 sex $ 10 ls 13; label sbp = 'Systolic Blood Pressure'

    dbp = 'Diastolic Blood Pressure'; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot; plot sbp*dbp; title 'Plot of systolic vs. diastolic blood pressure'; run;

    9.7 data injury; infile 'injury.dat';

    input year 1-4 burns 6-10 amputations 12-16; run; goptions csymbol = black; symbol1 value = dot i = join; symbol2 value = star i = join line = 2; axis1 label = ('Injuries'); legend1 label = (H = 1.5 cell) value = (H = 1.5 cell); proc gplot; plot burns*year=1 amputations*year=2 /

    overlay vaxis=axis1 legend=legend1; title 'Plot of burns and amputations by year'; run;

    9.8 proc format;

    value $efmt 's' = 'Southern' 'n' = 'Northern'; run; data trees; infile 'trees.dat'; input location $ 1 elevation 3-6 damage 8-9; format location $efmt.; run; goptions csymbol = black htext = 2; symbol1 value = 'n'; symbol2 value = 's'; proc gplot data = trees; plot damage*elevation = location; run;

    9.9 data quarterback; infile 'quarterback.dat';

    input YdsPerGame 63-67 TD 70-71 Int 74-75 Rating 101-105; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot; plot Rating*(YdsPerGame TD Int); run;

  • 26

    MODULE 10: ONE SAMPLE TESTS FOR , p 10.1 data well; infile 'well#1.dat';

    input @11 nitrate 4. @18 zinc 5. @25 tds 3.; testnitr = nitrate - 0.1; testzinc = zinc - 0.01; testtds = tds - 475; run; proc means n mean std t probt; var testnitr testzinc testtds; run;

    a p-value = 0.22725

    b p-value = 0.05735

    c p-value = 0.0001

    10.2 data utils; infile 'utility.dat';

    input @9 phone 6. @17 fuel 6. @25 elec 5.; testphone = phone - 50; testelec = elec - 30; run; proc means n mean std t probt; var testphone testelec; run;

    a p-value < 0.00005

    b p-value = 0.0498

    10.3 data running; infile 'running.dat';

    input class 1 sex $ 3 min1 5 sec1 7-8 min2 10 sec2 12-13; time1=min1*60+sec1; time2=min2*60+sec2; testt1_78=time1-78; testt2_95=time2-95; label time1 = 'Time for Race 1'

    time2 = 'Time for Race 2'; run; proc means data = running n mean std t probt; where sex = 'F'; var testt1_78 testt2_95; run;

    a p-value = 0.0217

    b p-value = 0.03435

  • 27

    10.4 data debate; infile 'debate.dat'; input id school gender compare argue research reason speak; if compare = 1 then debate_more =1; else debate_more =0; if compare = . then debate_more = .; if argue = 1 then argue_very =1; else argue_very =0; if argue = . then argue_very = .; if research = 1 then research_very =1; else research_very =0; if research = . then research_very = .; if reason = 1 then reason_very =1; else reason_very =0; if reason = . then reason_very = .; if speak = 1 then speak_very =1; else speak_very =0; if speak = . then speak_very = .; run; proc freq;

    a tables debate_more / chisq testp = (0.25, 0.75);

    c tables argue_very / chisq testp = (0.2, 0.8);

    e tables research_very / chisq testp = (0.25, 0.75);

    f tables reason_very / chisq testp = (0.05, 0.95); run;

    a) p = 0.771, p-value = 0.3857/2 = 0.19285. c) p = 0.853, p-value = 0.0187. e) p = 0.722, p-value = 0.2564/2 = 0.1282. f) p = 0.893, p-value < 0.0001.

    b proc freq; where school = 8;

    tables debate_more / chisq testp = (0.25, 0.75); run;

    d proc freq; where gender = 1; tables argue_very / chisq testp = (0.2, 0.8); run;

    g proc freq; where gender=2 and school=9;

    tables speak_very / chisq testp = (0.25, 0.75); run;

    b) p = 0.887, p-value= 0.0127/2 = 0.00635. d) p = 0.881, p-value=0.0058. g) p = 0.708, p-value = 0.6374/2 = 0.3187.

  • 28

    10.5 data src; infile 'src.dat'; input @8 environ 2. @18 plant_an 2. @30 employ 2. @55 libcon 1.; if environ in (8, 9, 10) then env_strong = 1; else if 1

  • 29

    10.7 data bball; input baskets; datalines; 12 8 11 10 12 6 10 14 12 8 12 12 6 8 12 15 13 9 11 10 ; run; proc means data = bball n mean std clm; var baskets; run;

    Note: The clm option tells proc means to compute a confidence interval for the mean.

  • 30

    10.8 proc format; value sfmt 1 = 'Male' 2 = 'Female'; run; data btt; infile 'btt.dat'; input childid 1-4 sex 6 bweight 8-11 gestage 13-14; testgage=gestage - 266/7; if bweight < 2500 then low_wt = 1; else low_wt = 0; if bweight = . then low_wt = .; testbwgt = bweight - 3332; format sex sfmt.; run; proc freq data = btt;

    a tables sex / chisq testp = (0.5, 0.5);

    c tables low_wt / chisq testp = (0.918, 0.082); run;

    b, d proc means data = btt n mean std t probt;

    var testgage testbwgt; run;

    a) Girls = 0.475, p-value = 0.4552. b) t = 1.96, p-value = 0.0518. c) = 0.055, p-value = 0.1517. d) t = 5.29, p-value is

  • 31

    MODULE 11: TWO SAMPLE T-TESTS 11.1 data lens; infile 'cataract.dat';

    input type $ astig; run; proc ttest; class type; var astig; run;

    Variances unequal: t=2.00, p-value=0.0724.

    11.2 data gas; infile 'gas.dat';

    input @43 trans $1. @45 mileage 4.; run; proc ttest; class trans; var mileage; run;

    Variances unequal: t=4.03, p-value=0.0051/2 = 0.00255.

    11.3 data grades; infile 'grades.dat';

    input @5 gender $1. @25 final 3.0; run; proc ttest; class gender; var final; run;

    Variances equal: t=1.00, p-value=0.3229.

    11.4 data hands; infile 'handinj.dat';

    input @7 type $5. @13 days 2.0 @16 cost 4.0; run; proc ttest; class type; var days cost; run;

    a Variances unequal: t=1.08, p-value=0.2904. b Variances unequal: t=0.68, p-value=0.5039.

    11.5 data src; infile 'src.dat';

    input @6 gender $1. @8 environ 2.0; run; proc ttest; class gender; var environ; run;

    Variances equal: t=0.33, p-value=0.7391.

    11.6 data robots; infile 'robot.dat';

    input put_humn put_robt qul_humn qul_robt; put_diff = put_humn - put_robt; qul_diff = qul_humn - qul_robt; run; proc means n mean std t prt; var put_diff qul_diff; run;

    a Paired t-test: t=2.63, p-value=0.0340. b Paired t-test: t=1.96, p-value=0.0914.

  • 32

    11.7 proc format; value $sexfmt 'F'='Female' 'M'='Male'; run; data running; infile 'running.dat'; input class 1 sex $ 3 min1 5 sec1 7-8 min2 10 sec2 12-13; time1=min1*60+sec1; time2=min2*60+sec2; label time1 = 'Time for Race 1'

    time2 = 'Time for Race 2'; format sex sexfmt.; run; proc ttest data = running; class sex; var time1 time2; run;

    Time1: Variances unequal: t=2.31, p-value=0.0411. Time2: Variances equal: t=2.33, p-value=0.0336.

    11.8 proc format;

    value lsfmt 1 = "Athletic" 2 = "Sedentary"; run; data athlete; infile 'athlete.dat'; input sbp 1-3 dbp 6-7 sex $ 10 ls 13; label sbp = 'Systolic Blood Pressure'

    dbp = 'Diastolic Blood Pressure' ls = 'Lifestyle';

    format ls lsfmt.; run; proc ttest data = athlete; class ls; var dbp sbp; run;

    DBP: Variances equal: t=2.02, p-value=0.0503. SBP: Variances unequal: t=5.75, p-value

  • 33

    MODULE 12: ONE-WAY ANOVA 12.1 data one; infile 'taillite.dat';

    input @13 zone 2. @4 truck 1. @17 response 3. @7 group 1.; run ;

    a data zone30; set one; if zone = 30 and group = 1; run; proc glm; class truck; model response = truck; means truck / tukey lines; run;

    F=16.4, p-value

  • 34

    12.4 data airplanes; infile 'airplanes.dat' delimiter = ','; input design $ paper $ hang_time; run;

    a proc anova data = airplanes; class design;

    model hang_time = design; means design / snk lines ; run;

    F = 9.91, p-value=0.0004, design groups: glide vs. dart, sonic.

    b proc anova data = airplanes; class paper; model hang_time = paper; means paper / snk lines ; run;

    F = 1.29, p-value=0.2954.

    12.5 data popcorn; input @1 brand $20. @22 time $4. @27 notpop 3.0;

    datalines; Orville Redenbacker 2:15 80 Orville Redenbacker 2:15 89 Orville Redenbacker 2:30 57 Orville Redenbacker 2:30 60 Orville Redenbacker 2:45 60 Orville Redenbacker 2:45 46 Smith's 2:15 170 Smith's 2:15 147 Smith's 2:30 196 Smith's 2:30 114 Smith's 2:45 98 Smith's 2:45 90 Pop Secret 2:15 215 Pop Secret 2:15 78 Pop Secret 2:30 98 Pop Secret 2:30 83 Pop Secret 2:45 75 Pop Secret 2:45 65 ; run; proc anova data = popcorn; class brand; model notpop = brand; means brand / tukey; run;

    F = 4.30, p-value=0.0334, Smiths = Pop Secret, Pop Secret = Orville Redenbacker

  • 35

    12.7 proc format; value bfmt 1 = 'Duracell' 2 = 'Energizer'

    3 = 'Rayovac' 4 = 'Radio Shack'; run; data battery; infile 'battery.dat'; input Brand 1 load 4-6 time 9-11; format Brand bfmt.; run;

    a proc anova data = battery;

    class brand; model time = brand; means brand; run;

    F = 0.02,p-value=0.9950.

    b proc anova data = battery;

    class load; model time = load; means load / snk; run;

    F=996.24, p-value

  • 36

    MODULE 13: TWO-WAY ANOVA AND MORE 13.1 data one; infile 'taillite.dat';

    input @4 type 1. @7 group 1. @I0 position 1. @13 zone 2. @17 response 3. @23 follow 2.; run;

    a proc glm; class group type;

    model response = group type group*type; means group type / tukey lines; run;

    Group and type are significant. The interaction is not. Type groupings: 4 vs. 3, 1, 2. Group: F=4.63, p-value=0.0317. Type: F=9.38, p-value

  • 37

    13.4 data calls; infile 'calls.dat'; input week shift day $ number; run; proc glm; class shift day; model number = shift day shift*day; means shift day; run;

    Model: F=0.95, p-value=0.5087. Shift: F=2.26, p-value=0.1080; Day: F=1.00, p-value=0.4119. SxD: F=0.60, p-value=0.7788.

    13.5 proc format;

    value bfmt 1 = 'Duracell' 2 = 'Energizer' 3 = 'Rayovac' 4 = 'Radio Shack'; run;

    data battery; infile 'battery.dat'; input Brand 1 load 4-6 time 9-11; format Brand bfmt.; run; proc glm data = battery; class load brand; model time = load brand load*brand; run;

    No interaction: p = 0.8208; No effect in Brand on time: p = 0.1117; Significant effect of load on time: P < 0.0001;

    13.6 data airplanes; infile 'airplanes.dat' delimiter = ',';

    input design $ paper $ hang_time; run; proc glm data = airplanes; class paper design; model hang_time = paper design paper*design; run;

    Significant interaction: p = 0.0149.

    13.10 data btt; infile 'btt.dat';

    input childid 1-4 sex 6 msbp 25-27 mmedaid 31 socio 33; run; proc glm data = btt; class mmedaid socio; model msbp = socio mmedaid socio*mmedaid; means socio*mmedaid; run;

    socio: F = 2.28, p-value=0.0623. mmedaid: F = 6.05, p-value=0.0147. socio*mmedaid: F = 5.94, p-value=0.0156.

  • 38

    MODULE 14: MODEL CHECKING IN ANOVA 14.1 data one; infile 'taillite.dat';

    input @4 type 1. @7 group 1. @I0 position 1. @13 zone 2. @17 response 3. @23 follow 2.; run;

    a proc glm; class group type; model response = group type group*type; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*group / vref = 0; plot sresid*type / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    b proc glm; class group zone; model response = group zone group*zone; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*group / vref = 0; plot sresid*zone / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    c proc glm; class group position; model response = group position group*position; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*group / vref = 0; plot sresid*position / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    14.2 data one; infile 'brownie.dat'; input day pan $ mix $ width; run; proc glm; class pan mix; model width = pan mix pan*mix; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*pan / vref = 0; plot sresid*mix / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

  • 39

    14.3 data wear; infile 'wear.dat'; input grit $ 1-5 cut wear; run; proc glm; class grit cut; model wear = grit cut grit*cut; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*grit / vref = 0; plot sresid*cut / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    14.4 data calls; infile 'calls.dat';

    input week shift day $ number; run; proc glm; class shift day; model number = shift day shift*day; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*shift / vref = 0; plot sresid*day / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    14.5 proc format;

    value bfmt 1 = 'Duracell' 2 = 'Energizer' 3 = 'Rayovac' 4 = 'Radio Shack'; run;

    data battery; infile 'battery.dat'; input Brand 1 load 4-6 time 9-11; format Brand bfmt.; run; proc glm data = battery; class load brand; model time = load brand load*brand; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*load / vref = 0; plot sresid*brand / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

  • 40

    14.10 data btt; infile 'btt.dat'; input childid 1-4 sex 6 msbp 25-27 mmedaid 31 socio 33; run; proc glm data = btt; class mmedaid socio; model msbp = socio mmedaid socio*mmedaid; means socio*mmedaid; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*socio / vref = 0; plot sresid*mmedaid / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    14.11 proc format; value sfmt 1 = 'Male' 2 = 'Female'; run;

    data btt; infile 'btt.dat'; input childid 1-4 sex 6 bweight 8-11 momeduc 29; format sex sfmt.; run; proc glm data = btt; class momeduc; model bweight = momeduc; means momeduc / hovtest = levene; output out=new p=yhat student = sresid; run; goptions csymbol = black htext = 2; symbol1 value = dot; proc gplot data = new; plot sresid*yhat / vref = 0; plot sresid*momeduc / vref = 0; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run; Constant variance assumption OK: p-value = 0.3738. Normality of residuals OK: p-value = 0.5631. Plots all look OK.

  • 41

    MODULE 15: CORRELATIONS 15.1 data one; infile 'electric.dat';

    input house income air index number load; run; proc corr; var house number index income; run;

    a No, p-value

  • 42

    15.5 data utils; infile 'utility.dat'; input @9 phone 6. @17 fuel 6. @25 elec 5.; run; proc corr; var phone fuel elec; run;

    Fuel and electricity.

    15.6 proc format; value $sexfmt 'F'='Female' 'M'='Male'; run;

    data running; infile 'running.dat'; input class 1 sex $ 3 min1 5 sec1 7-8 min2 10 sec2 12-13; time1=min1*60+sec1; time2=min2*60+sec2; label time1 = 'Time for Race 1'

    time2 = 'Time for Race 2'; format sex sexfmt.; run; proc corr data = running; var time1 time2; run; proc sort data = running; by sex class; run; proc corr data = running; var time1 time2; by sex class; run;

    15.8 data quarterback; infile 'quarterback.dat';

    input Rank 1-2 Player $ 5-22 Team $ 25-27 Comp 30-32 Att 35-37 Pct 40-43 AttPerGame 46-49 Yds 52-55 Avg 58-60 YdsPerGame 63-67 TD 70-71 Int 74-75 FirstDown 77-80 FirstDownPct 83-86 Over20 89-90 Over40 93-94 Sack 97-98 Rating 101-105; run; proc corr data = quarterback; var rating comp pct yds int sack; run;

    Percent completed has the highest correlation with quarterback rating followed by total yards and number of completions.

  • 43

    MODULE 16: SIMPLE LINEAR REGRESSION 16.1 goptions csymbol = black htext = 2; symbol1 value = dot;

    data one; infile bonescor.dat'; input index ccratio csi width score pct; run; proc reg; model score = pct; plot score*pct; run;

    a Yhat = 4.845 + .0253x, R2=0.0864. c Bone score and % young normal do not appear to be linearly related.

    16.2 data one; infile 'electric.dat';

    input house income air applindx number peakload; run;

    a proc reg; model peakload = air; plot peakload*air; run;

    Yhat = 2.265 + 0.742x, R2=0.8598.

    b proc reg;

    model peakload = applindx; plot peakload*applindx; run;

    Yhat=-0.729 + 0.947x, R2=0.7851.

    c proc reg;

    model peakload = number; plot peakload*number; run;

    Yhat=4.809 - 0.0581x, R2=0.0045.

  • 44

    16.3 data one; infile 'gas.dat'; input disp power torque ratio axle barrel speed clen cwid cwt trans mileage; run;

    a proc reg;

    model power = disp; plot power*disp ; run;

    Yhat = 33.5 + 0.362x, R2=0.8848.

    b proc reg;

    model torque = disp; plot torque*disp ; run;

    Yhat = 15.48 + .7085x, R2=0.9793.

    c proc reg;

    model torque= power; plot torque*power ; run;

    Yhat = -27.835 + 1.794x, R2=0.9300.

    d proc reg;

    model mileage = disp; plot mileage*disp ; run;

    Yhat=33.49 - 0.0471x, R2=0.7601.

    e proc reg;

    model mileage = torque; plot mileage*torque ; run;

    Yhat=33.996 - 0.064x, R2=0.7214.

    f proc reg;

    model mileage = power; plot mileage*power; run;

    Yhat=35.35 - 0.112x, R2=0.6345.

    16.4 data one; infile electric.dat';

    input house income air applindx number peakload; run;

    a proc reg; model peakload = air / clm; run;

    b proc reg;

    model peakload = applindx / cli; run;

  • 45

    16.5 data one; infile 'gas.dat'; input disp power torque ratio axle barrel speed clen cwid cwt trans mileage; run;

    a proc reg;

    model power = disp / clm; run;

    b proc reg; model torque = disp / cli; run;

    c proc reg;

    model torque= power / clm; run;

    d proc reg; model mileage = disp / cli; run;

    e proc reg;

    model mileage = torque / clm; run;

    f proc reg; model mileage = power / cli; run;

    16.6 data quarterback; infile 'quarterback.dat';

    input Rank Player $ 5-22 Team $ 25-27 Comp Att Pct AttPerGame Yds Avg YdsPerGame TD Int FirstD FirstDP P20 P40 Sck Rate; run; proc reg data = quarterback; model rate = pct / clm cli; plot rate*pct; title 'Regression model for Quarterback Rating'; run;

    a Yhat = -60.92505 + 2.340x, R2=0.6331. c The line appears to fit the trend in the data very well.

  • 46

    MODULE 17: MODEL CHECKING IN REGRESSION 17.1 goptions csymbol = black htext = 2;

    symbol1 value = dot; data one; infile 'bonescor.dat' ; input index ccratio csi width score pct; run; proc reg; model score = pct; plot score*p.; plot student.*p.; plot student.*pct; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    17.2 data one; infile 'electric.dat' ;

    input house income air applindx number peakload; run;

    a proc reg data = one; model peakload = air; plot peakload*p.; plot student.*p.; plot student.*air; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    There may be some curvature in the residual plots. A linear model may not be appropriate.

    b proc reg data = one;

    model peakload = applindx; plot peakload*p.; plot student.*p.; plot student.*applindx; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal ; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Assumptions appear to be satisfied.

    c proc reg data = one;

    model peakload = number; plot peakload*p.; plot student.*p.; plot student.*number; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Assumptions appear to be satisfied.

  • 47

    17.3 data one; infile 'gas.dat'; input disp power torque ratio axle barrel speed clen cwid cwt trans mileage; run;

    a proc reg data = one; model power = disp;

    plot power*p.; plot student.*p.; plot student.*disp; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Assumptions appear to be satisfied.

    b proc reg data = one; model torque = disp;

    plot torque*p.; plot student.*p.; plot student.*disp; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    There may be increasing variation.

    c proc reg data = one; model torque= power;

    plot torque*p.; plot student.*p.; plot student.*power; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    There may be some curvature in the residuals.

    d proc reg data = one; model mileage = disp;

    plot mileage*p.; plot student.*p.; plot student.*disp; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Assumptions appear to be satisfied.

    e proc reg data = one; model mileage = torque;

    plot mileage*p.; plot student.*p.; plot student.*torque; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Assumptions appear to be satisfied.

  • 48

    f proc reg data = one; model mileage = power; plot mileage*p.; plot student.*p.; plot student.*power; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Assumptions appear to be satisfied.

    17.4 data one; infile grades.dat';

    input id $ sex $ class $ quiz exam1 exam2 lab finalexam; run;

    a proc reg data=one; model finalexam = exam1; plot finalexam*p.; plot student.*p.; plot student.*exam1; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Data contains an outlier.

    b proc reg data=one;

    model finalexam = exam2; plot finalexam*p.; plot student.*p.; plot student.*exam2; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Data contains an outlier.

    c proc reg data=one;

    model finalexam = quiz; plot finalexam*p.; plot student.*p.; plot student.*quiz; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Assumptions appear to be satisfied.

  • 49

    17.5 data quarterback; infile 'quarterback.dat'; input Rank Player $ 5-22 Team $ 25-27 Comp Att Pct AttPerGame Yds Avg YdsPerGame TD Int FirstD FirstDP P20 P40 Sck Rate; run; proc reg data = quarterback; model rate = pct; plot rate*p.; plot student.*p.; plot student.*pct; output out = new p = yhat r = resid student = sresid; run; proc univariate data = new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    There may be increasing variation.

  • 50

    MODULE 18: MULTIPLE LINEAR REGRESSION 18.1 goptions csymbol = black htext = 2; symbol1 value = dot;

    data one; infile 'gas.dat'; input @45 mileage 4. @7 power 3. @38 car_wt 4. @11 torque 3. @1 disp 5.; run;

    a proc reg; model mileage = power car_wt torque;

    output out=new1 p=yhat student=resid; run; proc gplot data=new1; plot mileage*yhat; title 'Mileage vs. yhat'; plot resid*yhat / vref = 0; title 'Resid vs. yhat'; plot resid*power / vref = 0; title 'Resid vs. xl'; plot resid*car_wt / vref = 0; title 'Resid vs. x2'; plot resid*torque / vref = 0; title 'Resid vs. x3'; run; proc univariate data=new1 normal; var resid; probplot resid / normal (mu = est sigma = est) square; title 'Probability plot of residuals'; run ;

    b proc reg data=one; model mileage = power disp torque;

    output out=new2 p=yhat2 student=resid2; run; proc gplot data=new2; plot mileage*yhat2; title 'Mileage vs. yhat'; plot resid2*yhat2 / vref = 0; title 'Resid vs. yhat'; plot resid2*power / vref = 0; title 'Resid vs. xl'; plot resid2*disp / vref = 0; title 'Resid vs. x2'; plot resid2*torque / vref = 0; title 'Resid vs. x3'; run; proc univariate data = new2 normal; var resid2; probplot resid2 / normal (mu = est sigma = est) square; title 'Probability plot of residuals'; run ;

    18.2 data one; infile 'grades.dat';

    input id $ sex $ class $ quiz exam1 exam2 lab final; run; proc reg; model final = quiz exam1 exam2 lab; plot final*p.; plot student.*p.; plot student.*quiz; plot student.*exam1; plot student.*exam2; plot student.*lab; output out=new p=yhat student=sresid; title 'Multiple Regression Model and Model Checking Plots'; run;

    There appears to be a low outlier.

    proc univariate data=new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; title 'Probability plot of residuals'; run;

    With a p-value of 0.0239 for the normality test, the residuals may be nonnormal.

  • 51

    18.3 data one; infile 'electric.dat'; input house income air_cond index fam_num peakload; run; proc reg; model peakload = house income air_cond fam_num; plot peakload*p.; plot student.*p.; plot student.*house; plot student.*income; plot student.*air_cond; plot student.*fam_num; output out = new p=yhat student=sresid; title 'Multiple Regression Model and Model Checking Plots'; run; proc univariate data=new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; title 'Probability plot of residuals'; run;

    Number in family: t=1.401, p-value=0.1668. Family number is not needed in the model. The model assumptions appear valid.

    18.4 data prod; input prod temp light;

    templight=temp*light; datalines; 45 64 60 49 64 65 47 66 60 57 66 65 48 68 60 53 68 65 51 70 60 54 70 65 56 72 60 64 72 65 ; run; proc reg data = prod; model prod = temp light templight; run;

    Interaction not significant eliminate.

    proc reg data = prod; model prod = temp light; plot prod*p.; plot student.*p.; plot student.*temp; plot student.*light; output out = new p=yhat student=sresid; run; proc univariate data=new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; title 'Probability plot of residuals'; run;

    R2 = 79.92%, model p-value = 0.0163; Residuals are normal: p-value = 0.8184; Plots: may be increasing variation, curvature vs. temp, and increasing variability with light.

  • 52

    18.7 goptions csymbol = black htext = 2; symbol1 value = dot; data quarterback; infile 'quarterback.dat'; input Rank 1-2 Player $ 5-22 Team $ 25-27 Comp 30-32 Att 35-37 Pct 40-43 AttPerGame 46-49 Yds 52-55 Avg 58-60 YdsPerGame 63-67 TD 70-71 Int 74-75 FirstDown 77-80 FirstDownPct 83-86 Over20 89-90 Over40 93-94 Sack 97-98 Rating 101-105; run; proc reg data = quarterback; model rating = comp pct yds int sack; plot rating*p.; plot student.*p.; plot student.*comp; plot student.*pct; plot student.*yds; plot student.*int; plot student.*sack; output out=new p=yhat student=sresid; run; proc univariate data=new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    R2 = 0.9433, model p-value < 0.0001. Sack is not needed in the model: p-value = 0.4234. Residuals not quite normal: p-value = 0.0137. Otherwise residual plots look OK.

  • 53

    MODULE 19: MULTIPLE REGRESSION CHOOSING THE BEST MODEL

    19.1 data one; infile 'gas.dat';

    input disp power torque ratio axle barrels speeds car_ln car_wd car_wt trans mileage; run; proc reg; model mileage = disp power torque speeds car_wt car_ln / influence collin spec ; run;

    19.2 data one; infile 'gas.dat';

    input disp power torque ratio axle barrels speeds car_ln car_wd car_wt trans mileage; run;

    a proc reg;

    model mileage = disp power torque speeds car_wt car_ln / selection=stepwise; run;

    b proc reg;

    model mileage = disp power torque speeds car_wt car_ln / selection=backward; run;

    c proc reg;

    model mileage = disp power torque speeds car_wt car_ln / selection=forward; run;

    d proc reg;

    model mileage = disp power torque speeds car_wt car_ln / selection=rsquare cp adjrsq mse; run;

    19.5 data grades; infile 'grades.dat';

    input quiz 9-10 examl 12-14 exam2 16-18 lab 20-22 final 25-27; run; proc reg; model final = examl exam2 quiz lab / influence collin spec ; run;

  • 54

    19.6 data grades; infile grades.dat'; input quiz 9-10 examl 12-14 exam2 16-18 lab 20-22 final 25-27; run;

    a proc reg; model final = examl exam2 quiz lab

    / selection = stepwise; run;

    b proc reg; model final = examl exam2 quiz lab / selection = backward; run;

    c proc reg; model final = examl exam2 quiz lab

    / selection = forward; run;

    d proc reg; model final = examl exam2 quiz lab / selection = rsquare cp adjrsq mse; run;

    19.8 data pharmacy; infile 'pharmacy.dat';

    input pharmacy 1-2 volume 5-6 floor_space 9-12 rx_space 15-16 parking 19-20 shop_center 23 income 26-27; format shop_center scfmt.; sc_fs=shop_center*floor_space; sc_rxs=shop_center*rx_space; sc_p=shop_center*parking; sc_i=shop_center*income; run;

    * Model allowing for interactions between shopping center and other variables; proc reg data = pharmacy; model volume = floor_space rx_space parking shop_center income sc_fs sc_rxs sc_p sc_i; run;

    All interactions are not significant. After sequentially eliminating non-significant terms we obtain the final model.

    goptions csymbol = black htext = 2; symbol1 value = dot; symbol2 value = square; * Final model; proc reg data = pharmacy; model volume = floor_space rx_space; plot volume*p.; output out=new p=yhat student=sresid; run; proc univariate data=new normal; var sresid; probplot sresid / normal (mu = est sigma = est) square; run;

    Normality of residuals OK.

    proc gplot data = new; plot sresid*yhat=shop_center / vref = 0; plot sresid*floor_space=shop_center / vref = 0; plot sresid*rx_space=shop_center / vref = 0; run;

    Residual plots all look OK.

  • 55

    19.12 data quarterback; infile 'quarterback.dat'; input Rank 1-2 Player $ 5-22 Team $ 25-27 Comp 30-32 Att 35-37 Pct 40-43 AttPerGame 46-49 Yds 52-55 Avg 58-60 YdsPerGame 63-67 TD 70-71 Int 74-75 FirstDown 77-80 FirstDownPct 83-86 Over20 89-90 Over40 93-94 Sack 97-98 Rating 101-105; run;

    a proc reg data = quarterback;

    model rating = Comp Att Pct AttPerGame Yds Avg YdsPerGame TD Int FirstDown FirstDownPct Over20 Over40 Sack / selection = stepwise; run;

    Using defaults, variables in final model: Avg, TD, Int, Pct, Att, Comp.

    b proc reg data = quarterback;

    model rating = Comp Att Pct AttPerGame Yds Avg YdsPerGame TD Int FirstDown FirstDownPct Over20 Over40 Sack / selection = backwards; run;

    Using defaults, variables in final model: Pct, AttPerGame, Yds, Avg, YdsPerGame, TD, Int, FirstDown, FirstDownPct, Over40.

    c proc reg data = quarterback;

    model rating = Comp Att Pct AttPerGame Yds Avg YdsPerGame TD Int FirstDown FirstDownPct Over20 Over40 Sack / selection = forward; run;

    Using defaults, variables in final model: Avg, TD, Int, Pct, Over40, Att, Comp, YdsPerGame, AttPerGame, Yds.

  • 56

    MODULE 20: TESTS FOR CATEGORICAL DATA 20.1 data debate; infile 'debate.dat';

    input id school gender compare argue research reason speak ; if school = 3 or school = 5 or school = 6 or school = 8; if research = 2 or research = 3 then research = 4; run; proc freq data=debate; tables (research reason speak argue)*school / chisq expected; run;

    20.2 Use data step from 20.1 and then

    data skyline; set debate; if school = 8; run; proc freq data=skyline; tables gender*(compare argue research reason speak) / chisq expected; run;

    20.3 proc format; value gfmt 1 = 'Female' 5 = 'Male'; run;

    data src; infile 'src.dat'; input id gender environ quality air health plants jobslost pop jobs hours income age party libcon; if environ =1 or environ =2 or environ =3 then env =1; else if 4

  • 57

    20.5 proc format; value fsfmt 0 = 'Student' 1 = 'Faculty/Staff'; value yn 1 = 'Yes' 2 = 'No'; value yndk 1 = 'Yes' 2 = 'No' 3 = 'Dont Know'; value statfmt 1='OnCampus' 2='OffCampus' 3='On+Off' 4='DontWork'; value perfmt 1 = 'No' 2 = 'Yearly' 3 = 'Quarterly'; value usrn 1='Usually' 2='Sometimes' 3='Rarely' 4='Never'; run; data park; infile 'parking.dat'; input id miles bus_convenient carpool years status bus Monday Tuesday Wednesday Thursday Friday drive permit meters lots; if id = 400 then fac_staff = 0; if bus_convenient = 99 then bus_convenient = .; if id = . then fac_staff = .; if miles = 99 then miles = .; if carpool = 99 then carpool = .; if years = 99 then years = .; if status = 99 then status = .; if bus = 99 then bus = .; if Monday = 99 then Monday = .; if Tuesday = 99 then Tuesday = .; if Wednesday=99 then Wednesday = .; if Thursday = 99 then Thursday = .; if Friday = 99 then Friday = .; if drive = 99 then drive = .; if permit = 99 then permit = .; if meters = 99 then meters = .; if lots = 99 then lots = .; format fac_staff fsfmt. bus yn. bus_convenient yndk. status statfmt. permit perfmt. meters usrn. lots usrn.; run; proc freq data = park;

    b table fac_staff*bus_convenient/chisq expected cellchi2;

    d table fac_staff*meters/chisq expected cellchi2;

    f table bus*permit/chisq expected cellchi2; run;

    proc sort data =park; by fac_staff; run; proc freq data =park; table bus*permit/chisq expected cellchi2; by fac_staff; run;

  • 58

    20.6 proc format; value mfmt 1 = 'Never' 2 = 'Occasional' 3 = 'Regular'; value pfmt 1 = 'Neither' 2 = 'One' 3 = 'Both'; run; data a; input s_marijuana p_alc_drug count; datalines; 1 1 141 1 2 68 1 3 17 2 1 54 2 2 44 2 3 11 3 1 40 3 2 51 3 3 19 ; run; proc freq data = a; table p_alc_drug*s_marijuana / chisq expected cellchi2; weight count; format s_marijuana mfmt. p_alc_drug pfmt.; run;

    20.7 proc format;

    value agefmt 1 = '15-54' 2 = '55-64' 3 = '65-74' 4 = 'Over 74'; value locfmt 1 = 'Home' 2 = 'Acute-Care' 3 = 'Chronic-Care'; run; data a; input Age Location Count; datalines; 1 1 94 1 2 418 1 3 23 2 1 116 2 2 524 2 3 34 3 1 156 3 2 581 3 3 109 4 1 138 4 2 558 4 3 238 ; run; proc freq data = a; table age*location / chisq expected cellchi2; weight count; format age agefmt. location locfmt.; run;

  • 59

    20.8 proc format; value sfmt 1 = 'Male' 2 = 'Female'; run; data btt; infile 'btt.dat'; input childid 1-4 sex 6 momeduc 29 mmedaid 31 socio 33 smoke5 69 medaid5 71 socio5 73; format sex sfmt.; run; proc freq data = btt;

    a tables sex / chisq testp = (0.5, 0.5);

    b tables momeduc*socio / chisq expected;

    c tables smoke5*socio5 / chisq expected;

    d tables medaid5*socio5 / chisq expected; run;

    Note: In (b), (c), and (d) there are many low expected frequencies. Combining the socioeconomic categories 3 and 4 into a single category may help. Fishers exact test is a better solution but the computations can take a long time if the sample size is large.

  • 60

    MODULE 21: NON-PARAMETRIC TESTS 21.1 data one; infile 'taillite.dat';

    input id type group position zone resptime folltime; run;

    a proc npar1way wilcoxon data = one; where zone = 30; class type; var resptime; run;

    p-value

  • 61

    21.4 proc format; value lsfmt 1 = 'Athletic' 2 = 'Senentary'; run; data athlete; infile 'athlete.dat'; input sbp 1-3 dbp 6-7 sex $ 10 ls 13; label sbp = 'Systolic Blood Pressure'

    dbp = 'Diastolic Blood Pressure' ls = 'Lifestyle';

    format ls lsfmt.; run; proc npar1way wilcoxon data = athlete; class sex; var sbp dbp; run;

    SBP: p-value = 0.0366. SBP significantly different between males and females; DBP: p-value < 0.0001. DBP significantly different between males and females;

    proc sort data = athlete; by sex; run; * Check normality assumption that would be needed for t-test; proc univariate normal data = athlete; var sbp dbp; by sex;

    SBP: Shapiro-Wilks p-values: Female = 0.0455, Male = 0.0170 SBP is not quite normal. DBP: Shapiro-Wilks p-values: Female = 0.5903, Male = 0.4913 - DBP is normal.

    21.6 data btt; infile 'btt.dat';

    input childid 1-4 bweight 8-11 momeduc 29; run; proc npar1way wilcoxon anova data = btt; class momeduc; var bweight; run;

    Kruskal Wallis p-value = 0.0881. ANOVA p-value = 0.0931.

  • 62

    MODULE 22: ANALYSIS OF COVARIANCE 22.1 data one; infile 'gas.dat';

    input @45 mileage 4. @43 trans $1. @25 speeds $1. @38 car_wt 4. @11 torque 3.; run;

    a proc glm; class trans speeds;

    model mileage = trans speeds car_wt / solution; run;

    b proc glm; class trans speeds; model mileage = trans speeds torque / solution; run;

    22.2 data two; infile 'dummy.dat';

    input species $ 1 impactor $ 3-5 stiff1 stiff2 calcium magnesium; run;

    a proc glm; class species impactor;

    model stiff1 = species impactor calcium / solution; run ;

    b proc glm; class species impactor; model stiff1 = species impactor magnesium; run;

    22.4 proc format; value sfmt 1 = 'Male' 2 = 'Female'; run;

    data btt; infile 'btt.dat'; input childid 1-4 sex 6 bweight 8-11 gestage 13-14 mmedaid 31; format sex sfmt.; run; proc glm data = btt; class sex momeduc mmedaid; model bweight = sex mmedaid sex*mmedaid gestage; run;

  • 63

    MODULE 23: LOGISTIC REGRESSION 23.1 proc format;

    value fsfmt 0 = 'Student' 1 = 'Faculty/Staff'; value yn 1= 'Yes' 2 = 'No'; value yndk 1= 'Yes' 2 = 'No' 3 = 'Dont Know'; run; data park; infile 'parking.dat'; input id miles bus_convenient carpool years status bus; if id = 400 then fac_staff = 0; if carpool = 99 then carpool = .; if years = 99 then years = .; if bus = 99 then bus = .; if bus_convenient = 99 then bus_convenient = .; format fac_staff fsfmt. bus yn. bus_convenient yndk.; run;

    a proc logist data = park;

    model bus_convenient = fac_staff years; run;

    b proc logist data = park; model bus = fac_staff years; run;

    c proc logist data = park;

    model carpool = fac_staff years; run; 23.4 proc format;

    value sfmt 1 = 'Male' 2 = 'Female'; run; data btt; infile 'btt.dat'; input childid 1-4 sex 6 bweight 8-11 gestage 13-14 momage 16-17 parity 19 mdbp 21-23 msbp 25-27 mmedaid 31; format sex sfmt.; run;

    a proc logist data = btt;

    model sex = bweight gestage parity; run; proc logist data = btt; model sex = parity; run;

    b proc logist data = btt;

    model mmedaid = bweight gestage momage parity mdbp msbp; run; proc logist data = btt; model mmedaid = momage; run;

  • 64

    MODULE 24: MATRIX COMPUTATIONS 24.1 to 24.3 require the following initial creation of matrices A, B, and C. proc iml; A = { 2 1 0 3, -1 0 2 4, 4 -2 7 0}; B = {-4 3 5 1, 2 2 1 -1, 3 2 -4 5}; C = {5, 4, 8}; print A B C; 24.1 a D = A+B;

    b E = A-B; c F = A#B; d G = A/B; print D E F G;

    24.2 a H = A//B;

    b I = A||B; c J = A(|,3|); d K = B(|2,|); e L = B(|1:2,3:4|); print H I J K L;

    24.3 a M=T(B);

    b D=A*t(B); c N=det(D); d O=trace(D); e P = diag(D); * Note diag produces a diagonal matrix;

    Q = vecdiag(D); f R = solve(D, C); print M D, N O, P Q R; quit;

  • 65

    24.4 data a; input x1 x2 x3; datalines; 1 4 0.2 1 5 0.2 1 6 0.2 1 7 0.2 1 4 0.3 1 5 0.3 1 6 0.3 1 7 0.3 1 4 0.4 1 5 0.4 1 6 0.4 1 7 0.4 run;

    proc iml; use A; * To make data set A available within proc iml; read all var {x1 x2 x3} into X; Y = {4.3, 5.5, 6.8, 8.0, 4.0, 5.2, 6.6, 7.5, 2.0, 4.0, 5.7, 6.5}; I12=I(12); J12=J(12, 12, 1); print X Y I12 J12;

    a B=inv(X`*X)*X`*Y; b A=X*B; c C=Y`*Y-Y`*J12*Y/12; d D=Y`*Y-B`*X`*Y; e E=Y-X*B; f F=C-D; g G=D/9; h H=X*inv(X`*X)*X`; k K=Y`*(I12-H)*Y; l L=Y`*(H-J12/12)*Y; m M=G*inv(X`*X); n N=sqrt(diag(M)); o O=(I12-H)*Y; print B A C E, D F G, H, K L M N O;

    * Create a SAS data set containing a matrix for use in 24.5; create ydata from y[colname={y}]; append from y; quit;

    24.5 data reg;

    merge a ydata; run; proc reg data = reg; model y = x2 x3 / p r influence; run;

  • 66

    MODULE 25: MACRO VARIABLES AND PROGRAMS 25.1 goptions csymbol = black htext = 2;

    proc format; value lsfmt 1 = "Athletic" 2 = "Sedentary"; value $sfmt 'M' = 'Male' 'F'= 'Female'; run; data athlete; infile 'athlete.dat'; input sbp 1-3 dbp 6-7 sex $ 10 ls 13; label sbp = 'Systolic Blood Pressure'

    dbp = 'Diastolic Blood Pressure' ls = 'Lifestyle';

    format ls lsfmt. sex $sfmt.; run;

    %macro boxt(data, y, x); proc sort data = &data; by &x; run;

    (i) proc boxplot data = &data;

    plot &y*&x / boxstyle=schematic; run; (ii) proc ttest data = &data;

    class &x; var &y; run;

    %mend boxt;

    a %boxt(athlete, sbp, sex); b %boxt(athlete, dbp, sex); c %boxt(athlete, sbp, ls); d %boxt(athlete, dbp, ls);

  • 67

    25.2 goptions csymbol = black htext = 2; symbol1 value = dot; symbol2 value = square; data elec; infile 'electric.dat'; input hs 1-3 fi 6-11 acc 14-16 ai 19-23 fm 26-28 phl 31-35; label hs = 'House Size'

    fi = 'Family Income' acc = 'Air Conditioning Capacity' phl = 'Peak Hour Load'; run;

    %MACRO simplereg(data, yvar, xvar);

    (i) proc gplot data = &data;

    plot &yvar * &xvar; title "Plot of &yvar vs. &xvar"; run;

    (ii) proc corr data = &data; var &yvar &xvar; title "Correlation of &yvar vs. &xvar"; run;

    (iii) proc reg data = &data; model &yvar = &xvar;

    (iv) plot &yvar * p. p.*p. / overlay; plot student.*p.; plot student.* &xvar; title "Regression of &yvar vs. &xvar and model-checking plots"; run;

    %MEND simplereg;

    a %simplereg(elec, phl, hs); b %simplereg(elec, phl, fi); c %simplereg(elec, phl, acc);

    Contents1: The Basics2: More SAS Basics3: Data Management4: SAS Functions5: Descriptive Statistics I6: Proc gchart7: Descriptive Statistics II8: Generating Random Observations9: X-Y Plots10: One Sample Tests11: Two Sample T-Tests12: One-Way ANOVA13: Two-Way ANOVA and More14: Model Checking in ANOVA15: Correlations16: Simple Linear Regression17: Model Checking in Regression18: Multiple Linear Regression19: Multiple Regression Choosing the Best Model20: Tests for Categorical Data21: Non-Parametric Tests22: Analysis of Covariance23: Logistic Regression24: Matrix Computations25: Macro Variables and Programs