Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Conditional processing
There are several places where conditional processing can be usedand is useful. Last week, we talked about (1) using IF statments toconditionally write things to a file in order to generate a report,and (2) using IF and IF-ELSE statements to create ranges for avariable, such as low, medium, and high.
There are lots of other ways that conditional processing can beuseful, and we’ll go over some of them.
SAS Programming
Uses of IF statements
A common way to use an IF statements is to conditionally read indata. This can be useful for large datasets where you only want asubset of the data.
In this case, there is no need to read in all of the data initially andcreate a huge dataset which will take a lot of memory. Instead, youcan create a smaller dataset to begin with.
SAS Programming
Dataset
To work with a new, and slightly larger dataset, we’ll use somecrime data from a book called Crowding and Behavior by J. L.Freedman (1975). An example from this book uses crime data on110 cities in the US and lists their percentage of nonwhiteresidents, population density, and crime rate per 100,000 residents.
The book argues that there is not a significant correlation betweenpopulation density and crime rate. For now, we’ll be moreinterested in manipulating the data than using it for statisticalanalysis. It is often a good idea to look at the text file for the datato see if there is anything weird or problematic to begin with, andalso to see how variables are delimited.
SAS Programming
Dataset
SAS Programming
City crime example
In this case, all variables appear to be numeric except for city, butthere are some missing values. The city names can be long, longerthan 8 characters, such as for ALBUQUERQUE. Here you might thinkof two strategies:(1) either worry about the longest city name, or(2) (maybe slightly lazy way....) just use the first 8 characters asan abbreviation for the city name.
Will this work?
SAS Programming
Dataset
A more thorough look at the data reveals that some more caremight be needed.
SAS Programming
Dataset
A look at the next few cities reveals that the first 8 characters donot uniquely determine the name of the city, since there are twocities whose first 8 characters are identitcal:
CHARLESTON.SC
CHARLESTON.W.VA
One feature of the data which is nice is that spaces in city namesare separated by periods, e.g.,
LOS.ANGELES
But you might need to be cautious in assuming that thisconvention is applied throughout.
SAS Programming
Dataset
The main question of interest for this data is the relationshipbetween population density and crime rate, but first we will dosome processing just to describe the data. Here are some questionsof interest, for which we will use some conditional processing, whilesome will require more ideas later on.
1. What proportion of observations have at least one missingvalue? What proportion are missing for
2. Is there a relationship between missing values and the valuesof the values of any other variables? (Does the data appear tobe missing at random?)
3. Create a dataset that only has cities with crime rates above4000 per 100,000 people.
4. How does Albuquerque rank on the list?5. How do we create a variable that ranks the crime rate for
each city?6. How can we print out the city in each state with the highest
(or lowest) crime rate? SAS Programming
Dataset
Some of these questions will be addressed a bit later. For now, wecan address the first few questions.
First, how can we count how many observations have missingvalues?
SAS Programming
Describing missing values
There are a couple of ways of counting missing values, partlydepending on whether they are missing for character versus numericvalues. The most straightforward is to use an IF statement:
SAS Programming
Describing missing values
The previous approach sets the missing variable to missing for allnonmissing observations. It might be better to code these as 0.Then the proportion of 1s in the missing column is the proportionof observations that are missing.
SAS Programming
Describing missing values
You could also just read in the values that are non-missing, butthen you lose the ability to find patterns in the missingness.
SAS Programming
Conditional reading-in of the data
Note that the syntax is a little weird on the previous slide. Insteadof writing
if condition then input .....
You put the IF statement after the action, the INPUT statement.
SAS Programming
Coutning missing values
An alternative way of counting missing values is to use the cmiss
function in SAS.
SAS Programming
Using proc means and proc freq to count missing data
The procedures proc means and proc freq can also be used tocount missing data for numeric and character varaibles,respectively. These provide summaries which aren’t enough to findpatterns in the missing data.
SAS Programming
Using proc means and proc freq to count missing data
SAS Programming
Patterns in the missing data
To look at patterns in the missing data, you could create a subsetof observations that only has missing data. This will be useful ifthere isn’t too much missing data and not too many variables. Inthis case, it is revealing.
SAS Programming
Patterns in the missing data using WHERE
So a clear pattern in the data is the population size and density arealways missing together. We might also be concerned with whethermissing density values are associated with either the percentage ofnonwhite residents or crime rates of the cities. In other words, docities with missing density values tend to have higher crime rates?
Similarly to the subsetting IF, you can run proc means or otherprocedures conditional on certain values in the data set. However,instead of using IF, you use WHERE.
SAS Programming
Running proc means conditional on a particular value
SAS Programming
Running proc means conditional on a particular value
SAS Programming
Looking for correlations
To see if there is an association between missingness and either thedensity or nonwhite variables, we can use the proc corr
procedure, which generates a correlation matrix.
SAS Programming
Plotting the data
It’s always a good idea to plot your data as well. Associationsbetween variables might show up if the two groups of observations(with missing data versus no missing data), have differentdistributions. The correlations are pretty weak and nonsignificant,but it is still good to look at the plots.The older SAS graphics was purely text based, and has a certaincharm. It also means you can create plots that take very littlememory, since they can be saved as plain text files. Theprocedures for this are proc plot (older), and proc gplot.
/* also try gplot instead of plot */
proc plot data=crimeRates;
var crime*missing3;
run;
SAS Programming
Results of proc print, saved as pdf
T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 0 : 1 6 A M 1T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 0 : 1 6 A M 1
P l o t o f c r i m e * m i s s i n g 3 . L e g e n d : A = 1 o b s , B = 2 o b s , e t c . c r i m e | | 6 0 0 0 + | | | | A | | 5 0 0 0 + A | A | A | | A | C | A 4 0 0 0 + C | B | E | B | D | H | F A 3 0 0 0 + D | H A | E A | E A | F B | C | F 2 0 0 0 + D | D | E | B | A B | A | A A 1 0 0 0 + C | A | | B | A | | 0 + | - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - 0 1 m i s s i n g 3
SAS Programming
Results of proc gprint, saved as pdf
T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 2 : 5 2 A M 1T u e s d a y , S e p t e m b e r 2 , 2 0 1 4 1 1 : 4 2 : 5 2 A M 1
c r i m e
0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
m i s s i n g 3
0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0
SAS Programming
Questions about the data
From the questions about the data earlier on in the slides, we havefound that there were 10 cities with missing density and populationvalues, and that these values were always missing together wheneither was missing. The crime and nonwhite variables were nevermissing. Based on correlations and plots, there is no obviousrelationship between missingness and the nonwhite or crimevariables. There is no way to assess whether missingness is relatedto population density or population size, however, based on thedata.
SAS Programming
Creating a subset of the data
To create a subset of the data with cities that have crime ratesabove 4,000 per 100,000 people, we can do the following:
data crimeHigh;
set crimeRates;
if crime > 4000;
run;
proc print;
run;
SAS Programming
High crime cities
SAS Programming
Sorting by crime
From looking at the previous subset of the data with the highestcrime cities, it is easiest to pick out Albuquerque’s rank among the110 cities. Albuquerque has the second highest crime rate, withSan Francisco being the highest.
To answer the general question of how cities rank, you would needto sort them by crime rate. This can be done in SAS using proc
sort:
proc sort data=crimeRates out=crimeRanked;
by descending crime;
run;
SAS Programming
High crime cities (top 10)
SAS Programming
Sorting data for proc means BY processing
Sorting data is also needed to analyze data separately for differentlevels of a categorical variable. For example, instead of doing doingproc means separately for different values of missing3, I couldcreate a categorical variable indicating missingness and run proc
means on both values within a single procedure.
SAS Programming
Using BY in proc means
SAS Programming
BY command
Using the BY command is a powerful feature of SAS. It allows youto generate reports, tables, graphics, or data analysis on differentsubsets of your data automatically. Here we did it for just levels ofa missingness variable, but you could also do separate analyses fordifferent states, different cities, etc. depending on the data.
We also just used the BY command in proc means, but it can beused in many other procedures as well.
Unfortunately, to do the BY command, you have to sort the datafirst based on the variables you are using BY with. Otherprocedures, such as the histogram, also require data to be sortedfirst.
SAS Programming
Sorting issues
In the example of the sort we did earlier, a SAS dataset was sortedand saved as a second SAS dataset, which is a common way tosort data, so that the original order of the data is preserved. If youdo not specify a new dataset for the sort procedure to output to, itwill sort your original data, which you might or might not want.
SAS Programming
Sorting speed
SAS is fast at sorting. I attempted to do a little experiment. Igenerated 4 million uniform random variables (in R, I have toadmit, it is so easy there!) and saved them all to a file. This wasabout as much as I could to stay within my 50Mb quota on myUNM linux account. Nevertheless I transferred as much of the fileas I could using
scp uniform.txt [email protected]
To test how fast SAS can sort this, I use the time command inlinux (also works in OS X):
SAS Programming
Sorting speed: SAS on UNM computer
[jamdeg@polaris SAS]$ cat sort.sas
data uniform;
infile "uniform.txt";
input u;
run;
proc sort data=uniform;
by u;
run;
[jamdeg@polaris SAS]$ time sas sort.sas
Starting SAS ....
The http://Fastinfo.unm.edu entry for SAS is:
http://unm.custhelp.com/cgi-bin/unm.cfg/php/enduser/std_adp.php?p_faqid=6341
real 0m0.002s
user 0m0.001s
sys 0m0.001s
SAS Programming
Sorting speed: R on my laptop
To run R code in batch mode, you type R CMD BATCH myfile.r
at a linux/OS X prompt. This is EXTREMELY useful forsimulations. To see how long it takes, you can put the linuxcommand time in front. Here’s how long on my Mac Book Pro toread in and sort the same data set:
Jamess-MacBook-Pro:superjames$ time R CMD BATCH sort.r
real 0m56.772s
user 0m56.084s
sys 0m0.603s
This isn’t a fair comparison because I’ve used different computers.Unfortunately SAS is available on linux.unm.edu but R isn’t, and Ris available on my laptop, but SAS isn’t. C’est la vie!
SAS Programming
sorting code
SAS:
data uniform;
infile "uniform.txt";
input u;
run;
proc sort data=uniform;
by u;
run;
R:
Jamess-MacBook-Pro:Music superjames$ cat sort.r
x <- read.table("uniform.txt")
y <- sort(x)
SAS Programming
SAS boxplot using sorting on missing
Note that the boxplot command requires that the missing2 value
already be sorted.
SAS Programming
Finding the longest character string
Here I used the length function to determine the length of thestring, created a new variable based on this, and sorted by length.So that I didn’t have to print the entire data set, I used an optionin proc print to only print the first 10 observations.
SAS Programming
String functions in SAS
SAS has MANY string functions which are useful for manipulatingtext. I think of data as fundamentally being strings rather thannumbers. This is more a point of view than anything else.Obviously, we can think of it all as 0s and 1s. Often strings can beinterpreted as numbers, but often it is unnatural to force the to be.Data can be any kind of information – image files, text files, htmlfiles, lists of email addresses, etc. So it useful to have stringfunctions that can manipulate strings in different ways.
Example. Suppose you have a list of 100,000 email addresses, andyour job is to determine how many addresses are .edu versus .comversus .org, etc. What would need to determine this? You’d haveto be careful about weird email addresses [email protected], [email protected], and so forth.
SAS Programming
Ranking cities by state
The last question is the hardest. Some states have only one city onthe list (e.g., New Mexico), while others have several cities(California, Texas, Ohio, etc.) Suppose we want to create avariable that has the state for each city. Then we might want tosort by state, then sort cities within states.
Furthermore, we could sort alphabetically by state, then citiesalphabetically within states; or alphabetically by state, then bycrime rate within each state.
SAS Programming
Ranking cities by state
First, how do we get the states for each city? Often we might haveseparate data files that when combined have the information weneed. We might have one file with cities with demographic andcrime statistics, and another file with states for each city.
For the current dataset, you could probably manually enter thestate for each city either by modifying the data file or by having along IF statement like this
data crime;
...
if city = "ALBUQUERQUE" then state = "NM";
else if city = "BOSTON" then state = "MA";
and so on.
SAS Programming
Ranking cities by state
Although this would work for this data set (it doesn’t take too longto enter in 110 states, even if you have to look most of them up),it would be better to able to MERGE the data from a list of citiesin another data set. If you had more cities, or county level data(there are about 3000 counties in the US), then doing it by handwould not be feasible.
SAS Programming
Sorting on two variables
Supposing you do have a variable that lists the state for every city,you can then sort on two variables.
SAS Programming
The IN operator
For now we’ll assume that we have the state variable in our dataset as well. We could also create variables to indicate the region ofa certain state. For example, suppose we want a list of cities forstates on the East Coast. We’ll define the East Coast as being thefollowing states:
I Florida (FL)
I South Carolina (SC)
I North Carolina (NC)
I Virginia (VA)
I Delaware (DE)
I Pennsylvania (PA)
I New Jersey (NJ)
I New York (NY)
I Connecticut (CT)
I Rhode Island (RI)
I Massachusetts (MA)
I New Hampshire (NH)
I Maine (ME)
SAS Programming
The IN operator
Clearly you could do this to do something like this:
data crimeRegions;
set crimeRates;
if state="FL" or state="SC" or state="NC" ... or state = "ME"
then region = "East Coast";
run;
where the ... is filled in.
SAS Programming
The IN operator
A slightly more compact way is to write
data crimeRegions;
set crimeRates;
if state in
("FL","SC","NC","DE","PA","NJ","NY","CT","RI","MA","NH","ME")
then region = "East Coast";
run;
SAS Programming
The IN operator
For the IN operator, SAS checks whether the character variablematches one of the items in a list.
The IN operator can also be used with numeric variables, such as
proc print data=crimeRates;
where crime in (0:2000,4000:6000);
run;
which will print the observations with the lowest and highest crimerates.
SAS Programming
IN operator
There are usually many ways to do things. Usually, you will findways that you like and stick with them, which will become part ofyour coding style. The previous example could also be achievedusing
proc print data=crimeRates;
where 0 <= crime <= 2000 | 4000 <= crime <= 6000;
run;
SAS Programming
IN operator
The IN operator on numeric data could also be used to pick outunusual observations in a data set called mydata. For example,suppose observations 7, 518, 1009, and 2112 have been flagged asbeing unusual. It is difficult to look at them simultaneously in theoriginal file because they are widely separated. You could then dothis:
data _null_;
set mydata;
file "outliers.txt";
if _n_ in (7,518,1009,2112) then put _all_;
run;
This code prints the weird observations to a separate file so thatthey can be examined more easily together. The all keywordmeans that SAS will also print out an ERROR code (which is 0 ifthere is no error) and the n variable itself along with theuser-defined variables. (Try this yourself on your own data.)
SAS Programming
WHERE statement
The WHERE operator works much like the subsetting IF but canbe used in procedures (IF cannot be used in procedures). WHEREspecifies a condition that tells SAS to only apply the procedure tothose observations meeting the condition.
As an example, you could print only those cities where crime wasgreater than 4000:
proc print data=crimeRates;
where crime>4000;
var city crime;
run;
SAS Programming
WHERE statement
WHERE has many built-in operators that make it behave in specialways:
Operator Description ExampleIS MISSING Matches a missing value where Subj is missing
IS NULL Equivalent to IS MISSING where Subj is null
BETWEEN AND An inclusive range where age between 20
and 40CONTAINS Matches a substring where Name contains Mac
LIKE Matching with wildcards where Name like R n%
=* Phonetic matching where Name =* Nick
SAS Programming
WHERE statement with operators
SAS Programming
WHERE statement with wildcards
Wild cards work with meaning one character, and % being astring of any length (including an empty string). You can havemultiple wildcards in one statement. For example, if you wereprinting files that were either .dat or .txt, and name is a variablein your program, you might use
where name like "%.__t"
Here I used two underscores in a row. In this case you can also use
where name like "%.txt" or name like "%.dat";
so you can apply logical operators to the condition of a WHEREstatement, similar to IF statements.
SAS Programming
Wildcards in linux/MAC OS X/MS-DOS (Windows)
WIldcards in operator systems work much the same way as in SASexcept that * is used for a string of any length and ? is used for astring of length 1. For example
ls *.sas
lists all files ending with extension .sas in the current directory. InMS-DOS, use dir *.sas instead.
To list the contents of all subdirectories, you can do this
ls */*
ls */*/*
etc
SAS Programming
Wildcards for listing files
Large data project can generate directories with thousands of files,and this can crash your computer if you try to search the directoryusing FInder (MAC) or Explore (Windows). DIrectories with lots offiles can also be hard to search, so using wildcards can narrowdown the possibilities. If there are too many files, you can’t evenlist them directly. For example, if I want to list all my mp3 files inmy iTunes folder, I try the following from the music folder (inwhich each directory is an artist/band/composer).
Jamess-MacBook-Pro:Music superjames$ ls */*/*.mp3
-bash: /bin/ls: Argument list too long
It is too long because I probably have close to 10,000 mp3 files. Soinstead I settle for listing a few of them
SAS Programming
Wildcards for listing files
Jamess-MacBook-Pro:Music superjames$ ls A*/*/*.mp3
AC_DC/Back In Black/01 ’’’’’’’.mp3
AC_DC/Back In Black/02 Shoot To Thrill.mp3
AC_DC/Back In Black/03 What Do You Do For Money Honey.mp3
AC_DC/Back In Black/04 Givin’ The Dog A Bone.mp3
AC_DC/Back In Black/05 Let Me Put My Love Into You.mp3
AC_DC/Back In Black/06 Back In Black.mp3
AC_DC/Back In Black/07 You Shook Me All Night Long.mp3
AC_DC/Back In Black/08 Have A Drink On Me.mp3
AC_DC/Back In Black/09 Shake A Leg.mp3
AC_DC/Back In Black/10 Rock And Roll Ain’t Noise Pollution.mp3
Al Di Meola/Elegant Gypsy/01 Track 1.mp3
Al Di Meola/Elegant Gypsy/02 Track 2.mp3
Al Di Meola/Elegant Gypsy/03 Track 3.mp3
Al Di Meola/Elegant Gypsy/04 Track 4.mp3
Al Di Meola/Elegant Gypsy/05 Track 5.mp3
Al Di Meola/Elegant Gypsy/06 Track 6.mp3SAS Programming
WHERE statement with operators
SAS Programming
WHERE vs IF
To process a subset of the observations in a PROC, you should usethe WHERE statement. The IF statement is only available in theData step.
However, in the datastep, you could use WHERE instead of IF inmany cases. For many cases, the difference will be transparent tothe user. Technically, a difference is that with the IF statement,data is read into SAS, into what is called the Program Data Vector(PDV), then tested. For a WHERE statement, the data is testedfirst, then read into the PDV.
SAS Programming
WHERE vs IF
Because the data is read in first with the IF statement, it can beless efficient than WHERE, which does not have to (temporarily)store the data in the PVC for testing. This can make more of adifference on larger data sets.
However, there are cases where the IF statement is still needed anda WHERE statement doesn’t work. The subsetting IF is requiredinstead of WHERE if
I the selection is based on anything other than what alreadyexists in a SAS dataset, such as if you are reading in datafrom an external file
I selection is made later in the data step rather than thebeginning
I where the selection is made conditionally (e.g., nested IFstatements)
SAS Programming
Student questions
I’ll show a couple of slides that will answer some student questionsthat came up that I think are instructive.
First, what happens if one variable is set equal to another variablethat has a format?
SAS Programming
Format question
[jamdeg@mizar SAS]$ cat date.txt
09/03/2014
[jamdeg@mizar SAS]$ cat date.sas
data date;
informat time mmddyy10.;
infile "date.txt";
input time;
format time mmddyy8.;
time2 = time;
run;
proc print data=date;
run;
SAS Programming
Format question
[jamdeg@mizar SAS]$ cat date.lst
The SAS System 16:43 Wednesday, September 3, 2014 1
Obs time time2
1 09/03/14 19969
SAS Programming
infile "/home/james/data.txt"; versus infile
"∼/home/james/data.txt";
Several students have gotten errors by trying to put the tildesymbol at the beginning of their path. When using SAS Studio,your files are uploaded to the directory /home/[name]/. If youspecify the full path for the file, you don’t need to use the tilde.The tilde is shorthand for whatever your home directory is.
If you do use the tilde successfully, the log in SAS will show youthe full path it used to find the file.
SAS Programming
infile "/home/james/data.txt"; versus infile
"∼/home/james/data.txt";
As an example, if I’m in SAS Studio and I type
infile "~/home/james/data.txt";
then I might get an error in my log file like this:
58 data address;
59 infile "~/home/james/data.txt";
60 input name :$41. address :$41. city :$41.; run;
ERROR: Physical file does not exist,
/home/james/home/james/data.txt
What has happened is that SAS substituted /home/james/ for thetilde symbol, but this is redundant — you DON’T need both thetilde and the full path. The tilde approach is useful especially onlinux-type systems, so it might be safer to use the full path (butthen the path might change depending on the computer you areon).
SAS Programming
Finding the path for your file
To find the path for a file, look up the file in your system (forexample using MyComputer in Windows) and click on properties,and hopefully you’ll be able to find the path. For more details trygoogling “finding the path to a file windows”. There are slightlydifferent ways depending on the version of Windows you are using.
In Mac OS X, you can go to a file in the FInder, click on the filewhile holding down the Control key, and click on “Get Info”, andthis will information about the file including the path (underWhere:). Alternatively, if you are in the terminal program on MacOS X or linux, you can type pwd which shows the path to thecurrent directory.
SAS Programming
Question 3
What if you have multiple observations per row? This can happenwhen saving data file from Mac’s program Pages. E.g., you havetwo variables and they are saved like this in the data file.
patient time
24358 09/03/2001 24661 09/01/2003
25663 08/02/2002 27779 08/31/2002
SAS Programming
Multiple observations per row
The previous data can be read in using a special bit of SAS syntax,@@:
data times;
informat time mmddyy10.;
infile "patientTimes.txt";
input ID $ time @@;
format time mmddyy10.;
run;
proc print data=times;
run;
The double at-signs tells SAS that after it reads in two values forthe two variables, keep reading without skipping to the next line.
SAS Programming
The @@ approach
A neat thing that you can do with the @@ approach is read inarbitrary text, one word at a time. So, I tried this with Romeo andJuliet, the Shakepeare play (older literature like this is oftenavailable online in plain text format). After very little searching, Ifound the plain text file at
http://sydney.edu.au/engineering/it/ matty/Shakespeare/texts/tragedies/romeoandjuliet
SAS Programming
Romeo and Juliet as data
To read in the play as data, a first stab is just to read in everyword and see what comes up. After reading in every word, wecould make a frequency distribution of the words to see whichwords were the most common or which words occurred uniquely(occurred once but were never repeated). Here is code for readingin every word this way
SAS Programming
Romeo and Juliet as data: word frequencies
SAS Programming
Romeo and Juliet as data: the log file
SAS Programming
Romeo and Juliet as data: the log file
SAS Programming
Romeo and Juliet as data: sorted word frequenies
The original file Romeoandjuliet.txt was modified to removepunctuation and tabs using several global search and replaces. Thenew file is Romeoandjuliet2.txt
SAS Programming
Romeo and Juliet as data: the log file
SAS Programming
Romeo and Juliet as data: sorted word frequenies
You can also output the word frequencies to a new dataset forfurther analyses.
SAS Programming
Romeo and Juliet as data: word frequenies
You can also output the word frequencies to a new dataset forfurther analyses.
SAS Programming
Romeo and Juliet as data: word frequenies
SAS Programming
Romeo and Juliet as data: sorted word frequenies
SAS Programming
Romeo and Juliet as data: reading in sentences
Based on previous output, it looks like sentences are tab-delimited,so I tried using tabs as a delimiter on the original input.
SAS Programming
Romeo and Juliet as data: reading in sentences
SAS Programming
Romeo and Juliet as data: reading in sentences
SAS Programming
Romeo and Juliet as data
What can you do with this type of information?
There are many questions you could ask with this type of data.Here are some examples:
I What is the correlation between word length and wordfrequency?
I If we read in two of Shakespeare’s plays, we could comparedistributions of words and/or sentence lengths to see if thereare differences. This could be used to investigate stylisticchanges in Shakespeare’s plays over time (especially if read inseveral of his plays).
I It has sometimes been hypothesized that some ofShakespeare’s plays weren’t written by Shakespeare, but bysomeone else. These distributions could be used to look foroutliers in Shakespeare’s plays that suggest stylisticinconsistency.
SAS Programming
Romeo and Juliet as data
Other questions:
I We could look for stylistic differences between types of plays,categorizing plays as comedies, histories, and tragedies. Howoften do certain words, such as “love” and “death” occur intragedies versus comedies?
I We could define pairwise distances between plays based oneuclidean distances between word frequencies. Let wi1, . . . ,win
be the frequency of word i in play n, where we have a list ofall words occurring in all of Shakespeare’s plays. Then define
d(playi , playj) =
√√√√ n∑k=1
(wik − wjk)2
This distance can be used to cluster Shakespeare’s plays usingmultidimensional scaling to see if there are clusters based onstyle (comedy, tragedy, history) or historical period
SAS Programming
Textual analysis
Other types of questions and data you could try to address byanalysizing word or sentence frequencies include the following:
I Can you distinguish between male and female authors in aparticular genre (or say in e-mail, facebook posts, etc.)?
I You could try to track decline or increase of usage of certainwords, such as “thee” if you have documents from differenttime periods.
I Can you identify something about an author’s geographicinformation or age based on their use of particular words,slang, etc.? (e.g., in NM, people say “soda”, but in theMidwest, they often say “pop” or “soda pop”.)
I Instead of looking at individual words, you could look at pairsof adjacent words, so that certain pairs might be common forsome authors but not other
SAS Programming