Tips and Tricks Group... · 2016-03-11 · Tips & Tricks With lots of help from other SUG and...

Preview:

Citation preview

Tips & Tricks

With lots of help from other SUG and

SUGI presenters

SAS CALGARY USER GROUP

OCTOBER 12, 2012

1

2

Sorting

3

THREADS

Multi-threading available if your computer has more than one

processor (CPU)

Divide and conquer: calculations are split up and run in

parallel on the processors

4

SORTING

•In memory or using a utility file

– details option

•Three copies of the data

– Original data set

– Temporary sorted files

– Sorted version of the data set

– Original file not overwritten until sort is complete …

• …. unless you specify the ‘overwrite’ option

5

SORTING - TAGSORT

•Useful when there is not enough disk space to sort a large data set.

•Only sort keys (listed on the by statement) and observation number (= tags) are sorted

•Tags are used to retrieve record from input dataset in sorted order.

•If sort keys small relative to total record length, temporary disk use is reduced.

•Requires reading each observation twice

•Single-threaded.

• Considerably slower

6

THE ‘NOEQUALS’

OPTION AND SQL

Default sorting behavior is to retain the order of records

within each ‘by’ group

‘noequals’ does not retain the order

Faster …..

….unless you are relying on retaining the order

SQL does the equivalent of ‘noequals’

7

COMPRESSION AND

SORTING

•SAS compress removes repeating blanks, characters, and

numbers from each observation

•Adds a tag, containing the information needed to

uncompress the observation.

•Sorting may be faster on an compressed data set (less I/O)

•!!Compress can result in a data set that is larger than the

original (even in v9)

– – size of tag may be larger than saved space.

8

SAVING DISK SPACE

9

COMPRESS (AGAIN)

options compress = yes;

• Requires CPU to uncompress each observation prior to use

• Data sets may grow rather than shrink

• Probably not a good idea to apply compression automatically

data mydata (compress = yes);

• Check log file to see if file shrunk

10

DROP VARIABLES AND

OBSERVATIONS

•Drop unnecessary variables ASAP (use the keep and drop

options on the data step)

•Get rid of unnecessary observations ASAP (use the “where”

statement on input to avoid wasting CPU time processing

observations you don’t need to process)

data females; set all (where = (sex = ‘F’));

•Murphy’s law of SAS datasets

– Any variable that is dropped from a dataset will be required

two procs later.

11

LENGTH STATEMENT

•Each character (including spaces!) requires 1 byte.

data bigword;

length word $10;

length flag 3;

To change the length of a variable after the fact

data smallword;

length word $4;

set mydata;

12

NUMBERS

Length in

bytes

Largest integer

represented exactly

Exponential notation

3 8,192 213

4 2,097,152 221

5 536,870,912 229

6 137,438,953,472 237

7 35,184,372,088,832 245

8 9,007,119,254,740,992 253

13

14

THE DATA STEP

15

DATA SET LISTS

Useful for reading in like-named datasets.

Can specify libraries.

WHERE, DROP, KEEP etc. can be applied in one data step.

16

MULTIPLE INPUT

DATASETS

Useful when you want to read in multiple like-named datasets

that have an incremental numbering sequence as part of the

name.

e.g.

indata1, indata2, indata3…indata50

17

OPTIONS

•Write Macro

– Can complicate data step

• Can type the datasets names

– Time consuming

– Prone to error

– Do you really want to do this?

data result_out;

set indata1 indata2 indata3 indata4 indata5 indata6 indata7 indata8

indata9 indata10 indata11 indata12 indata13 indata14 indata15 indata16

indata17 indata18 indata19 indata20 indata21 indata22 indata23 indata24

indata25 indata26 indata27 indata28 indata29 indata30 indata31 indata32

indata33 indata34 indata35 indata36 indata37 indata38 indata39 indata40

indata41 indata42 indata43 indata44 indata45 indata46 indata47 indata48

indata49 indata50;

run;

18

USE DASH LISTS

•Dash lists make it simple

data out;

set indata1 - indata50;

run;

•Can also specify library

data out;

set cmlib.indata1 – cmlib.indata50;

run;

19

MULTIPLE INPUT

DATASETS

PART 2

Useful when you want to read in multiple like named datasets

that start with the same prefix or when all the datasets to be

read are not known

e.g. Prod_region1, Prod_region4, Prod_region7,

Prod_region21…Prod_region36

20

OPTIONS

• Write Macro

– Can complicate datastep

• Can type the datasets names

– Time consuming

– Prone to error

– Do you really want to do this?

data total_production;

set Prod_region1 Prod_region1 Prod_region4 Prod_region7

Prod_region20 Prod_region21 Prod_region24,Prod_region27

Prod_region31 Prod_region32 Prod_region33 Prod_region36;

run;

21

USE COLON LISTS

Colon lists make it simple

data out;

set prod_region: ; run;

Can also specify library

data out;

set CMLIB.prod_region: ; run;

22

23

WHAT ELSE CAN THE

DATA STEP DO?

• Lots of stuff, but…here’s one example

• You already know how to read in multiple datasets…

• What if you want to write out multiple datasets?

Data east_production (where=(region=1))

west_production (where=(region=2))

north_production (where=(region=3))

south_production (where=(region=4));

Set prod_reg:;

Run;

24

25

WHAT ABOUT EG?

ALL THIS APPLIES TO

EG ALSO!

26

OTHER TIPS FOR EG

USERS

• Autoexec in Enterprise Guide?

• Query option to limit output during development

• Use Notes node to document Project

• Organize project using multiple process flows

• Understand your data before performing analysis on it

• Frequency counts

• Charts

• Characterize data

27

28

29

30

31

QUESTIONS?

32

33