Upload
herbert-higgins
View
251
Download
1
Embed Size (px)
Citation preview
1
Chapter 4: Introduction to Lookup Techniques
4.1 Introduction to Lookup Techniques
4.2 In-Memory Lookup Techniques
4.3 Disk Storage Techniques
2
Chapter 4: Introduction to Lookup Techniques
4.1 Introduction to Lookup Techniques 4.1 Introduction to Lookup Techniques
4.2 In-Memory Lookup Techniques
4.3 Disk Storage Techniques
6
4.01 Multiple Choice PollWhich of these is an example of a table lookup?
a. You have the data for January sales in one data set, February sales in a second data set, and March sales in a third. You need to create a report for the entire first quarter.
b. You want to send birthday cards to employees. The employees’ names and addresses are in one data set and their birthdates are in another.
c. You need to calculate the amount each customer owes for his purchases. The price per item and the number of items purchased are stored in the same data set.
7
4.01 Multiple Choice Poll – Correct AnswerWhich of these is an example of a table lookup?
a. You have the data for January sales in one data set, February sales in a second data set, and March sales in a third. You need to create a report for the entire first quarter.
b. You want to send birthday cards to employees. The employees’ names and addresses are in one data set and their birthdates are in another.
c. You need to calculate the amount each customer owes for his purchases. The price per item and the number of items purchased are stored in the same data set.
8
Overview of Table Lookup Techniques Arrays, hash objects, and formats provide an
in-memory lookup table. The DATA step MERGE statement, multiple SET
statements in the DATA step, and SQL procedure joins use lookup values that are stored on disk.
9
Chapter 4: Introduction to Lookup Techniques
4.1 Introduction to Lookup Techniques
4.2 In-Memory Lookup Techniques4.2 In-Memory Lookup Techniques
4.3 Disk Storage Techniques
10
Objectives Describe arrays as a lookup technique. Describe hash objects as a lookup technique. Describe formats as a lookup technique.
12
4.02 Multiple Answer PollWhich techniques do you currently use when you perform table lookups with a single data set?
a. Arrays
b. Hash object
c. Formats
d. None of the above
14
Overview of ArraysAn array is similar to a numbered row of buckets.
SAS puts a value in a bucket based on the bucket number.
1 2 3 4
...
15
Overview of ArraysAn array is similar to a numbered row of buckets.
SAS puts a value in a bucket based on the bucket number.
A value is retrieved from a bucket based on the bucket number.
1 2 3 4
16
DATA data-set-name; ARRAY array-name { subscript } <$><length> <array-elements> <(initial-value-list)>; < READ statement (s)> new-variable=array-name{subscript-value};RUN;
DATA data-set-name; ARRAY array-name { subscript } <$><length> <array-elements> <(initial-value-list)>; < READ statement (s)> new-variable=array-name{subscript-value};RUN;
Overview of ArraysGeneral form of the ARRAY statement:
The READ statement can be the SET, MERGE or INFILE/INPUT statement.
The ARRAY statement associates variables or initial values to be retrieved using the array name and a subscript value.
The assignment statement retrieves values from the array based on the value of the subscript.
17
Overview of Arrays
data country_info; array Cont_Name{91:96} $ 30 _temporary_ ('North America', ' ', 'Europe', 'Africa', 'Asia', 'Australia/Pacific'); set orion.country; Continent=Cont_Name{Continent_ID};run;
The ARRAY statement associates variables or initial values to be retrieved using the array name and a subscript value.
The assignment statement retrieves values from the array based on the value of the subscript.
p304d01
19
Setup for the Poll
data country_info; array Cont_Name{91:96} $ 30 _temporary_ ('North America', ' ', 'Europe', 'Africa', 'Asia', 'Australia/Pacific'); set orion.country; Continent=Cont_Name{Continent_ID};run;
p304d01
20
4.03 Multiple Choice PollIn p304d01, how many elements are in the array Cont_name?
a. 0
b. 5
c. 6
d. unknown
21
4.03 Multiple Choice Poll – Correct AnswerIn p304d01, how many elements are in the array Cont_name?
a. 0
b. 5
c. 6
d. unknown
22
Overview of a Hash ObjectA hash object is similar to rows of buckets that are identified by the value of a key.
Key Data Data
...
23
Overview of a Hash ObjectA hash object is similar to rows of buckets that are identified by the value of a key.
SAS puts value(s) in the data bucket(s) based on the value(s) in the key bucket.
Key Data Data
...
24
Overview of a Hash ObjectA hash object is similar to rows of buckets that are identified by the value of a key.
SAS puts value(s) in the data bucket(s) based on the value(s) in the key bucket.
Value(s) are retrieved from the data bucket(s) based on the value(s) in the key bucket.
Key Data Data
25
DATA data-set-name; < READ statement(s) > IF _N_=1 THEN DO; DECLARE HASH object-name(<attribute:value>); object-name.DEFINEKEY('key-name'); object-name.DEFINEDATA('data-name'); object-name.DEFINEDONE(); END; return-code=object-name.FIND(<key: value>);RUN;
DATA data-set-name; < READ statement(s) > IF _N_=1 THEN DO; DECLARE HASH object-name(<attribute:value>); object-name.DEFINEKEY('key-name'); object-name.DEFINEDATA('data-name'); object-name.DEFINEDONE(); END; return-code=object-name.FIND(<key: value>);RUN;
Overview of Hash ObjectsGeneral form of the hash object:
The READ statement can be the SET, MERGE, or INFILE/INPUT statement.
The syntax within the DOgroup defines and canpopulate the hash object.
The FIND method retrieves the data value based on the key value.
26
Overview of Hash Objects
data country_info; length Continent_Name $ 30; if _N_=1 then do; declare hash Cont_Name(dataset:'orion.continent'); Cont_Name.definekey('Continent_ID'); Cont_Name.definedata('Continent_Name'); Cont_Name.definedone(); end; set orion.country; rc=Cont_Name.find(key:Continent_ID); if rc=0;run;
The syntax within the DO group defines and populates the hash object.
The FIND method retrieves the data value based on the key value.
p304d02
28
Setup for the Poll
data country_info; length Continent_Name $ 30; if _N_=1 then do; declare hash Cont_Name(dataset:'orion.continent'); Cont_Name.definekey('Continent_ID'); Cont_Name.definedata('Continent_Name'); Cont_Name.definedone(); end; set orion.country; rc=Cont_Name.find(key:Continent_ID); if rc=0;run;
p304d02
29
4.04 Multiple Choice PollIn p304d02, how many times do the statements in the DO group execute?
a. only once
b. once for every observation in the data set orion.country
c. once for every observation in the data set orion.continent
30
4.04 Multiple Choice Poll – Correct AnswerIn p304d02, how many times do the statements in the DO group execute?
a. only once
b. once for every observation in the data set orion.country
c. once for every observation in the data set orion.continent
31
Overview of a FormatA format is similar to rows of buckets that are identified by the data value.
Data Value Label
...
32
Overview of a FormatA format is similar to rows of buckets that are identified by the data value.
SAS puts data values and label values in the buckets when the format is used in a FORMAT statement, PUT function, or PUT statement.
Data Value Label
...
33
Overview of a FormatA format is similar to rows of buckets that are identified by the data value.
SAS puts data values and label values in the buckets when the format is used in a FORMAT statement, PUT function, or PUT statement.
SAS uses a binary search on the data value bucket in order to return the value in the label bucket.
Data Value Label
34
Overview of a FormatGeneral form of the user-defined format:
The READ statement can be the SET, MERGE, or INFILE/INPUT statement.
PROC FORMAT;VALUE <$>fmtname range-1=label-1
. . . range-n=label-n;RUN;
DATA data-set-name; < READ statement(s)>; new-variable=PUT(variable,fmtname.);RUN;
PROC FORMAT;VALUE <$>fmtname range-1=label-1
. . . range-n=label-n;RUN;
DATA data-set-name; < READ statement(s)>; new-variable=PUT(variable,fmtname.);RUN;
When the PUT function executes, the format is loaded into memory, and a binary search is used to retrieve the format value.
The FORMAT stepcompiles the formatand stores it on disk.
35
Overview of a Format
proc format; value Cont_Name
91='North America' 93='Europe' 94='Africa' 95='Asia' 96='Australia/Pacific';run;
data country_info; set orion.country; Continent=put(Continent_ID,Cont_Name.);run;
When the PUT function executes, the format
is loaded into memory, and a binary search is used to retrieve the format value.
The FORMAT step compiles the format and stores it on disk.
p304d03
36
Chapter 4: Introduction to Lookup Techniques
4.1 Introduction to Lookup Techniques
4.2 In-Memory Lookup Techniques
4.3 Disk Storage Techniques4.3 Disk Storage Techniques
37
Objectives List methods for combining data horizontally. Use multiple SET statements to combine data
horizontally. Compare methods for combining SAS data sets.
38
Combining Data HorizontallyDATA step techniques for combining data horizontally include using the following: MERGE statement multiple SET statements UPDATE statement MODIFY statement
In addition, you can use the SQL procedure with an inner or outer join.
40
4.05 Multiple Answer PollWhich techniques do you currently use when you perform table lookups with multiple data sets?
a. MERGE statement
b. Joins
c. Multiple SET statements
d. UPDATE statement
e. MODIFY statement
f. None of the above
41
Overview of Merges and JoinsThe DATA step MERGE and the SQL join operators are similar to multiple stacks of buckets that are referred to by the value of one or more common variables.
By Value(s) Data Data By Value(s) Data Data
42
DATA Step MERGE StatementGeneral form of the DATA step merge:
Matches on equal values for like-named variables:
Continent_ID Continent_ID
Continent_ID
DATA data-set-name; MERGE SAS-data-sets; BY variables;RUN;
DATA data-set-name; MERGE SAS-data-sets; BY variables;RUN;
43
DATA Step MERGE Statement
proc sort data=orion.country out=country; by Continent_ID;run;
data country_info; merge country orion.continent; by Continent_ID;run;
Matches on equal values for like-named variables
p304d04
45
Setup for the Poll
proc sort data=orion.country out=country; by Continent_ID;run;
data country_info; merge country orion.continent; by Continent_ID;run;
p304d04
46
4.06 Multiple Choice PollIn p304d04, if the data set country has seven observations and the data set orion.continent has five observations, what stops the execution of the DATA step?
a. end of file for work.country, the data set with the most observations
b. end of file for orion.continent, the last data set listed in the MERGE statement
c. end of file for the data set that contains the final value of the BY variable Continent_ID
47
4.06 Multiple Choice Poll – Correct AnswerIn p304d04, if the data set country has seven observations and the data set orion.continent has five observations, what stops the execution of the DATA step?
a. end of file for work.country, the data set with the most observations
b. end of file for orion.continent, the last data set listed in the MERGE statement
c. end of file for the data set that contains the final value of the BY variable Continent_ID
48
You can use an SQL procedure inner or outer join to create a SAS data set.
General form of the SQL procedure CREATE TABLE statement with an inner join:
PROC SQL; CREATE TABLE SAS-data-set AS SELECT column-1, column-2,… ,column-n FROM table-1, table-2,…,table-n WHERE joining criteria ORDER BY sorting criteria;QUIT;
PROC SQL; CREATE TABLE SAS-data-set AS SELECT column-1, column-2,… ,column-n FROM table-1, table-2,…,table-n WHERE joining criteria ORDER BY sorting criteria;QUIT;
The SQL Procedure
Performs an inner join based on the WHERE criteria
49
The SQL Procedureproc sql; create table country_info as select country.*, Continent_Name from orion.country, orion.continent
where country.Continent_ID= continent.Continent_ID; order by country.Continent_ID;quit;
Performs an inner join where the Continent_ID values from both data sets are equal
p304d05
51
4.07 Multiple Choice PollWhich of the following is true of the SQL inner join?
a. The resulting data set contains only the observations with matching key values.
b. The resulting data set contains both the observations with matching key values and those observations where the key values do not match.
52
4.07 Multiple Choice Poll – Correct AnswerWhich of the following is true of the SQL inner join?
a. The resulting data set contains only the observations with matching key values.
b. The resulting data set contains both the observations with matching key values and those observations where the key values do not match.
53
Multiple SET StatementsThe DATA step with multiple SET statements combines data sets by performing one-to-one reading.
Data Data Data Data
54
Multiple SET StatementsYou can use multiple SET statements to combine observations from several SAS data sets.
When you use multiple SET statements, the following occurs: Processing stops when SAS encounters the end-of-file
marker on either data set. The variables in the PDV are not reinitialized when
a second SET statement is executed.
55
Multiple SET StatementsGeneral form of the DATA step with multiple set statements:
DATA data-set-name; SET SAS-data-set; SET SAS-data-set; RUN;
DATA data-set-name; SET SAS-data-set; SET SAS-data-set; RUN;
56
Multiple SET Statements
data country_info; set orion.country; set orion.continent; run;
Country_ Country_ Continent_ Country_FormerObs Country Name Population ID ID Name Continent_Name
1 AU Australia 20,000,000 160 91 North America 2 CA Canada . 260 93 Europe 3 DE Germany 80,000,000 394 94 East/West Germany Africa 4 IL Israel 5,000,000 475 95 Asia 5 TR Turkey 70,000,000 905 96 Australia/Pacific
p304d06
Listing of country_info
57
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_1 2 . 1
58
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_1 2 A . 1
D
59
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_1 2 A 3 1
D
60
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_1 2 A 3 1
Implicit OUTPUT;Implicit RETURN;
D
61
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_1 2 A . 2
Initialize PDV.
D
62
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_2 3 A . 2
D
63
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_2 3 B . 2
D
64
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_2 3 B 5 2
D
65
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_2 3 B 5 2
Implicit OUTPUT;Implicit RETURN;
D
66
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_2 3 B . 3
Initialize PDV.
D
67
Execution
...
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_3 4 B . 3
D
68
Execution
threeX Y Z Total1 2 A 32 3 B 5
oneX Y1 22 33 4
twoZAB
data three; set one; set two; Total=X+Y;run;
PDVX Y Z Total _N_3 4 B . 3
EOF
D
Processing stops.
70
Setup for the PollThe previous example created a data set named three with two observations.
Using the same one and two data sets, if the SET statements were reversed, how many observations would be in the data set three?
data three; set one; set two; Total=X+Y;run;
oneX Y1 22 33 4
twoZAB
data three; set two; set one; Total=X+Y;run;
71
4.08 Multiple Choice PollUsing the same one and two data sets, if the SET statements were reversed, how many observations would be in the data set three?
a. 5
b. 2
c. 3
d. 6
72
4.08 Multiple Choice Poll – Correct AnswerUsing the same one and two data sets, if the SET statements were reversed, how many observations would be in the data set three?
a. 5
b. 2
c. 3
d. 6
73
DATA Step Methods for Reading SAS DataCode Which variables are reinitialized
to missing at the topof the DATA step?
What stops the DATA step?
data two; set one; New_Var=Value;run;
variables created in the DATA step
end of the file for data set one
data three; merge one two; by Var; New_Var=Value;run;
variables created in the DATA step
all variables when the BY value changes
the last end of file that is encountered
data three; set one two; New_Var=Value;run;
variables created in the DATA step
all variables when SAS finishes reading data set one and starts reading data set two
end of the file for data set two
data three; set one; set two; New_Var=Value;run;
variables created in the DATA step
the first end of file that is encountered
74
Chapter Review1. What are the three types of in-memory table lookups?
2. What are three types of disk storage table lookups?
3. When multiple SET statements are executed, when does execution stop?
75
Chapter Review – Correct Answers1. What are the three types of in-memory table lookups?
arrays, hash objects, and formats
2. What are three types of disk storage table lookups?
PROC SQL, the DATA step with a MERGEstatement, or the DATA step with multiple SET statements
3. When multiple SET statements are executed, when does execution stop?
Execution stops when the first end of file isencountered.