Upload
sadatnfs
View
216
Download
0
Embed Size (px)
Citation preview
7/30/2019 R Guide - S. Mills
1/48
Data Mining 2011
SECTION 1
Software
1.1 Introduction to R http://cran.r-project.org
R is an interpretive statistical programming languge similar to the commercial products S andMatlab. These languages are based on matrices/vectors and enable us to get results in a concisemanner.
One of the best ways to develop an ability to program in R is to cut and paste sample code and thensee what happens when you change parts of the code.
A good source for such code is in the examples supplied in the R documentation. The
documentation can be obtained by gong to http://cran.r-project.org/ and, in the left hand margin,clicking on Documentation Manuals. This brings up the following :
Figure 1.
If we select An Introduction to R we will find a good manual for learning R.
While there are many fuctions for doing statistical calculations most problems require only a smallsubset of them. For this reason the functions (and data sets) are broken into libraries or packages. Ifwe select Packages we will see all the packages that are installed in our version of R. It is likely
that there is only a subset of all the possible packages installed and if we want to see what else is
available we can find many others at http://cran.r-project.org. These packages are the work ofstatisticians around the world (as is R itself) and are referred to as Contributed packages.
If we are not sure which of our packages might contain the functions that we need, we can type, for
Mills 2011 R & Data Visualization 1
7/30/2019 R Guide - S. Mills
2/48
Data Mining 2011
example
??logistic
and the output
Help files with alias or concept or title matching logistic using
fuzzy matching:MASS::polr Ordered Logistic or Probit Regression
nnet::multinom Fit Multinomial Log-linear Models
stats::glm Fitting Generalized Linear Models
stats::Logistic The Logistic Distribution
stats::SSfpl Self-Starting Nls Four-Parameter Logistic Model
stats::SSlogis Self-Starting Nls Logistic Model
survival::clogit Conditional logistic regression
Type ?PKG::FOO to inspect entries PKG::FOO, or TYPE?PKG::FOO for
entries like PKG::FOO-TYPE.
tells us the name of the package, the function within the package, and a brief description of what thefunction does..
1.1.1 Libraries (Packages)
Once we have found which library you need, we must load it before you use it. For example, if we
wish to do Principal Component Analysis, we would find it in the stats package. In order to use
it we type
library(stats)
and we can use any function in that library.
If we know the name of the function that we wish to use (and the library is loaded) we can type?prcomp
This will bring up a window with a description of the function, its usage, its arguments, return
values, and (usually) an example (or examples). Cutting and pasting these examples gives you anopportunity to explore the behaviour of the function. The documentation may also give referencesfor the concepts behind the function and point to other functions that are related to it. Starting with
version 2.4, the help window is a mini-browser for the entire package rather than just a text page forthe requested function.
1.1.2 Assignments, sequences
To get a start on using R, we can look at some simple examples.
If you wish to assign a number (or more complicated object) to a variable the usual method is to use- (although in most contexts it is also possible to use ) as in
R & Data Visualization 2 Mills 2011
7/30/2019 R Guide - S. Mills
3/48
Data Mining 2011
a - 5
R gives no response, but if we typea
[1] 5
shows that a has the been assigned the value 5.
A simpler method is to enclose the expression in parentheses(a - 5)
[1] 5
We can assign a vector to a variable. (In fact there are different way to do this depending on thenature of the vector.)(b - c(1, 3, 2, 6, 5, 3, 2))
[1] 1 3 2 6 5 3 2
In this example, c concatenates the set of comma-delimited numbers into a vector.
In the following, the vector is created by repeating a number or set of numbers
(c - rep(5, 7))[1] 5 5 5 5 5 5 5
(c.1 - rep(c(1,3,2),4))
[1] 1 3 2 1 3 2 1 3 2 1 3 2
The above shows a way of creating variable names with the use of the ..
We can create vectors by sequencing operations
(d - 1:6)
[1] 1 2 3 4 5 6
(e
- 6:1)[1] 6 5 4 3 2 1
(f - 1:10/10)
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
(g - seq(10, 9, -.2))
[1] 10.0 9.8 9.6 9.4 9.2 9.0
We can perform operations on these vectorsa - b
[1] 4 2 3 -1 0 2 3
a*b
[1] 5 15 10 30 25 15 10(No parentheses are needed because there is no assignment.)
1.1.3 Matrices
a%*%b
Error in a %*% b : non-conformable arguments
Mills 2011 R & Data Visualization 3
7/30/2019 R Guide - S. Mills
4/48
Data Mining 2011
The %*% represents matrix multiplication and the above tried to multiply a 1 1 and a 7 1 vectortogether.
We can use the t operator (transpose) to give a 1 1 and a 1 7.a%*%t(b)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 5 15 10 30 25 15 10
sum(a%*%t(b))
[1] 110
We can also assign a matrix to a variable.
Simple matrices could be created as
(m.1 - matrix(0, 3, 2))
[,1] [,2]
[1,] 0 0
[2,] 0 0[3,] 0 0
(m.2 - matrix(1:12, nrow3))
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
as well as some special matrices such as(I.3 - diag(1,3))
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
We could create a matrix from the vector b with(h - matrix(b, nrow1))
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 3 2 6 5 3 2
We can multiply matrices in different waysh*h
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 9 4 36 25 9 4
(the same as b*b).
h%*%t(h)
[1,] 88
(h.m - t(h)%*%h)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 3 2 6 5 3 2
[2,] 3 9 6 18 15 9 6
[3,] 2 6 4 12 10 6 4
[4,] 6 18 12 36 30 18 12
[5,] 5 15 10 30 25 15 10
[6,] 3 9 6 18 15 9 6
R & Data Visualization 4 Mills 2011
7/30/2019 R Guide - S. Mills
5/48
Data Mining 2011
[7,] 2 6 4 12 10 6 4
To access an entry (or submatrix)h.m[5, 2]
[1] 15
h.m[1:3, 5:4] # Note the reverse order of the columns
[,1] [,2][1,] 5 6
[2,] 15 18
[3,] 10 12
In the above input, the # indicates that the remainder of the line is a comment.
h.m[c(3,5,1), c(6,2,7)]
[,1] [,2] [,3]
[1,] 6 6 4
[2,] 15 15 10
[3,] 3 3 2
It is also possible to change the values(h.m[c(3,5,1), c(6,2,7)] - -10)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 -10 2 6 5 -10 -10
[2,] 3 9 6 18 15 9 6
[3,] 2 -10 4 12 10 -10 -10
[4,] 6 18 12 36 30 18 12
[5,] 5 -10 10 30 25 -10 -10
[6,] 3 9 6 18 15 9 6
[7,] 2 6 4 12 10 6 4
and determine some values -sum(h.m)
[1] 330
mean(h.m)
[1] 6.734694
apply(h.m, 1, sum) # Row sums
[1] -16 66 -2 132 40 66 44
apply(h.m, 1, mean) # Row means
[1] -2.2857143 9.4285714 -0.2857143 18.8571429 5.7142857 9.4285714 6.2857143
apply(h.m, 2, sum) # Column sums
[1] 22 12 44 132 110 12 -2
apply(h.m, 2, mean) # Column means
[1] 3.1428571 1.7142857 6.2857143 18.8571429 15.7142857 1.7142857 -0.2857143etc.
The elements of a matrix are not restricted to numerical values. For example,matrix(letters[1:6], ncol3)
[,1] [,2] [,3]
[1,] a c e
Mills 2011 R & Data Visualization 5
7/30/2019 R Guide - S. Mills
6/48
Data Mining 2011
[2,] b d f
Suppose we have a system of equations Ax b with(A - matrix(c(3, 2, 5, 4, 1, 9, -1, 6, 8), 3, 3))
[,1] [,2] [,3]
[1,] 3 4 -1
[2,] 2 1 6[3,] 5 9 8
and(bT - c(-1, 3, 2))
[1] -1 3 2
we can create the augmented matrix by binding bT to Acbind(A, bT)
bT
[1,] 3 4 -1 -1
[2,] 2 1 6 3
[3,] 5 9 8 2
(We could also rbind.)
The entries in the matrix can be of different types(Mixed - matrix(c(Height, Width, 25, 30), 2, 2))
[,1] [,2]
[1,] Height 25
[2,] Width 30
but the entries are all made the same type - in this case strings.
If we try to multiply the entries in the second column
Mixed[1,2]*Mixed[2,2]Error in Mixed[1, 2] * Mixed[2, 2] : non-numeric argument to binary operator
It is possible to convert a string to a number withas.numeric(Mixed[1,2])*as.numeric(Mixed[2,2])
[1] 750
1.1.4 Lists
A matrix can only be used for rectangulararrays. The list is a data structure that is more flexible.
We can create a simple list(L.1 - list(first.name John, last.name Smith, sn 345678, mark A-))
$first.name
[1] John
$last.name
[1] Smith
$sn
[1] 345678
$mark
R & Data Visualization 6 Mills 2011
7/30/2019 R Guide - S. Mills
7/48
Data Mining 2011
[1] A-
We can refer to the components of the list byL.1[2]
$last.name
[1] Smith
L.1[[2]][1] Smith
(Notice that the first form gives the name of the component.)
L.1$last.name
[1] Smith
It is better to refer to the components by name because that means that if the order is changed tolist(last.nameSmith, first.nameJohn, sn345678, markA-) that
we still get the correct values.
Many functions have return values in the form of lists.
We can also build a list by appending components. This is often useful in situations in which we areiteratingL.2 - {}
L.2 - c(L.2, list(1))
L.2 - c(L.2, list(x))
L.2 - c(L.2, list(x^2/2!))
(L.2 - c(L.2, list(x^3/3!)))
[[1]]
[1] 1
[[2]][1] x
[[3]]
[1] x^2/2!
[[4]]
[1] x^3/3!
It is sometimes useful to remove things from the list structure and that can be done byunlist(L.2)
[1] 1 x x^2/2! x^3/3!
1.1.5 paste
In the above, we use the c to concatenate a set of numbers into a vector. If we wish to concatenate
strings (numbers get converted to strings), we use paste as in
paste(John, Smith, 345678)
[1] John Smith 345678
Notice that there is a space between the names. We can change that with(str.1 - paste(John, Smith, 345678, sep,))
Mills 2011 R & Data Visualization 7
7/30/2019 R Guide - S. Mills
8/48
Data Mining 2011
[1] John,Smith,345678
(str.2 - paste(Jane, Jones, 234567, sep,))
[1] Jane,Jones,234567
In other words, sep, controls what separates the quantities that are being pasted together (itdefaults to a space but can be more than a single character). If we do not want the space, we canremove it with the sep . On the other hand we may wish to insert some other character(str.3 - paste(D:, DATA, Data Mining R-code,sep/))
[1] D:/DATA/Data Mining R-code
If we have a vector of strings, paste does nothing unless we tell it to collapse the vector (and whatto put between the elements).(str.4 - paste(unlist(L.2), collapse ))
[1] 1 x x^2/2! x^3/3!
1.1.6 stringsplitThere may be times when we need to unpaste a string. We do this withstrsplit(str.1, ,)
[[1]]
[1] John Smith 345678
which produces a list of one element, orstrsplit(rbind(str.1, str.2), ,)
[[1]]
[1] John Smith 345678
[[2]]
[1] Jane Jones 234567
which produces a list element for each row of the matrix.
If we try the same thing onstrsplit(str.4, )
[[1]]
[1] 1 x x^2/2! x^3/3!
nothing happens. The reason is that is a metacharacter in regular expressions - along with \ |
( ) [ { ^$ * ? and we need to change it to an ordinary character with
strsplit(str.4, \ \ )
[[1]]
[1] 1 x x^2/2! x^3/3!
1.1.7 Control Structures
R has the usual control structures.
If we leta - 1
R & Data Visualization 8 Mills 2011
7/30/2019 R Guide - S. Mills
9/48
Data Mining 2011
b - 2
if (a b) print(a b)
[1] a b
if (a b)
print(a b)
else
Error: syntax error in else
print(a
b)tells us that there is a syntax error in else.
The correct form isif (a b) {
print(a b)
} else {
print(a b)
}
[1] a b
Note that the { and } are used as the beginning and ending of blocks of code.for (i i n 1:5) {
print (i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Instead of1:5 we could have things such as i in c(5, 3, 7, 2, -4, -9).n - 1
f
- 1while (n 5) {
f - f * n
n - n 1
print(f)
}
[1] 1
[1] 2
[1] 6
[1] 24
R also has repeat, break, and next.
1.1.8 Functions
In the preceding, we have used several of the built-in functions of R (paste, strsplit, print).We will often need to write our own functions, so we need to look at the structure of functions.
Mills 2011 R & Data Visualization 9
7/30/2019 R Guide - S. Mills
10/48
Data Mining 2011
Suppose that we have a random set of numbers and wish to find the mean (there is, of course, abuilt-in function for this)
(num - runif(20, -1, 1)) # 20 random numbers from a uniform distribution
[1] -0.439628823 0.233594691 -0.266024740 0.007379845 0.466801666
[6] -0.092201637 0.683620457 -0.003777758 0.040208154 0.276043486
[11] -0.236744290 0.258244825 0.262341596 -0.518254284 -0.584123774
[16] -0.300963114 0.129197590 0.941738119 0.073712909 -0.991614198
We could write the function asmy.mean - function (x) {
len - length(x)
sum(x)/len
}
Note that this says that the name my.mean has the function assigned to it. We can see what the
function is if we enter the name
my.mean
function (x) {len - length(x)
sum(x)/len
}
We call the function in the usual mannermy.mean(num)
[1] -0.003022464
A type of function that we will often need is the recursive function. A common example that is used
in recursion is the factorial function (it is NOT the best way to find the factorial, but it is easy toprogram and to understand.)
If you are not familiar with recursion in programming, the following might help.fact - function (m) {
if (m 1) {
f - fact(m - 1) * m
} else {
return(1)
}
f
}
1.1.9 Debugging
We will use thedebug(fact)
to enable us to trace through the factorial function.
(We can display the values of the variables within the function by typing the name of the variable. Ifwe have a variable called n we need to type print(n) - this is why I use m as the variable. If you
R & Data Visualization 10 Mills 2011
7/30/2019 R Guide - S. Mills
11/48
Data Mining 2011
have seen enough, you can use c to continue without debugging or Q to quit.)
The text in ( ) are comments on the process.fact(5)
debugging in: fact(5) (tells us that we have just entered fact)
debug: {
if (m 1) { (the next block of the function to be evaluated)
f - fact(m - 1) * m}
else {
return(1)
}
f
}
Browse[1] m;f;n (a command to print the value of m & f and take the next
[1] 5 (m)
[1] 0 (f)
debug: if (m 1) { (next block - we stepped past the {)
f - fact(m - 1) * m
} else {
return(1)
}Browse[1] m;f;n
[1] 5
[1] 0
debug: f - fact(m - 1) * m (m 1 so we execute this)
Browse[1] m;f;n
[1] 5
[1] 0
debugging in: fact(m - 1) (we have stepped into fact again - with m 4 - see below
debug: {
if (m 1) {
f - fact(m - 1) * m
}
else {
return(1)
}
f
}
Browse[1] m;f;n
[1] 4 (m 4)
[1] 0
debug: if (m 1) {
f - fact(m - 1) * m
} else {
return(1)
}
Browse[1] m;f;n
[1] 4
[1] 0
debug: f - fact(m - 1) * m (m 1 so we execute this)
Browse[1] m;f;n
[1] 4
[1] 0
debugging in: fact(m - 1) (we have stepped into fact again - with m 3 - see below
debug: {
if (m 1) {
f - fact(m - 1) * m
}
else {
return(1)
}
Mills 2011 R & Data Visualization 11
7/30/2019 R Guide - S. Mills
12/48
7/30/2019 R Guide - S. Mills
13/48
Data Mining 2011
debug: return(1) (this time m 1 so we do not take the fact path)
Browse[1] n
exiting from: fact(m - 1) (this is the first time that we have done this
we return a 1 to the function that called this
and use that value as the multiplier of m 2)
debug: f
Browse[1] m;f;n
[1] 2
[1] 2exiting from: fact(m - 1) (return from fact with the value 2 2 1
and use this as the multiplier of m 3)
debug: f
Browse[1] m;f;n
[1] 3
[1] 6
exiting from: fact(m - 1) (return from fact with the value 6 3 2 1
and use this as the multiplier of m 4)
debug: f
Browse[1] m;f;n
[1] 4
[1] 24
exiting from: fact(m - 1) (return from fact with the value 24 4 3 2 1
and use this as the multiplier of m 5)
debug: f
Browse[1] m;f;n
[1] 5
[1] 120
exiting from: fact(5) (return from fact with the value 120 5 4 3 2 1)
[1] 120
If you have a long complicated function with a small section that you wish to investigate, it is
possible to insert the command browser() into the code. In this case, the use ofcontinue will
allow the code to be executed until you reach the browser() command again.
This gives a brief look at some of the concepts in R. We will look at others as we need them.
Mills 2011 R & Data Visualization 13
7/30/2019 R Guide - S. Mills
14/48
Data Mining 2011
1.2 Data Visualization in R
It is very important to gain a feel for the data that we are investigating.
One way to do this is by visualization.
We will do this by starting with a simple dataset that has some nice features.
Flea Beetles
This data is from a paper by A. A. Lubischew, On the Use of Discriminant Functions in
Taxonomy, Biometrics, Dec 1962, pp.455-477.
There are three species of flea-beetles: C. concinna, Hp. heptapotamica, and Hk. heikertingeri,and 6 measurements on each.
tars1 - width of the first joint of the first tarsus in microns (the sum of measurements for both
tarsi).tars2 - the same for the second joint.
head - the maximal width of the head between the external edges of the eyes in 0.01 mm.
aede1 - the maximal width of the aedeagus in the fore-part in microns.
aede2 - the front angle of the aedeagus ( 1 unit 7.5 degrees).
aede3 - the aedeagus width from the side in microns.
1.2.1 Reading Data
The first thing we have to do is get the data (for this data set we will read in a text file). The
following illustrates how to read the file(s). (Note the use of the UNIX type path separator with /
rather than \.
drive - D:
code.dir - paste(drive, DATA, Data Mining R-Code, sep/)
data.dir - paste(drive, DATA, Data Mining Data, sep/)
# Set the files to be read
d.file - paste(data.dir, fleas, flea.dat, sep/)
[1] D:/DATA/Data Mining Data/fleas/flea.dat
d.col - paste(data.dir, fleas, flea.col, sep/)
[1] D:/DATA/Data Mining Data/fleas/flea.col
We now have paths for two files: d.file points to the data and d.col points to the column
headers (variable names).
The function scan can be used to read in the data. If the data has characters in it, we need to
indicate that with
R & Data Visualization 14 Mills 2011
7/30/2019 R Guide - S. Mills
15/48
Data Mining 2011
(headers - scan(d.col))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected a real, got tars1
(headers - scan(d.col, ))
Read 6 items
[1] tars1 tars2 head aede1 aede2 aede3
(n.var - length(headers)) # The vector length gives the number of variables
[1] 6
For reading in the data we can do a scan to read the data into a vector, and then convert the vectorto a matrix with n.var columns. Because data files are typically stored by rows, and R does the
conversion to a matrix by column, we need to indicate that with byrowT.
d.flea.s - matrix(scan(d.file), ncoln.var, byrowT)
Read 444 items
d.flea.s[1:5,] # This displays the first 5 rows and all the columns.
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 191 131 53 150 15 104
[2,] 185 134 50 147 13 105
[3,] 200 137 52 144 14 102
[4,] 173 127 50 144 16 97
[5,] 171 118 49 153 13 106
Note that if the file contains a mixture of numbers and text, we need to use the in the scand.flea.str - matrix(scan(d.file, ), ncoln.var, byrowT)
Read 444 items
d.flea.str[1:5,]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 191 131 53 150 15 104
[2,] 185 134 50 147 13 105
[3,] 200 137 52 144 14 102[4,] 173 127 50 144 16 97
[5,] 171 118 49 153 13 106
We can see that the numbers are read in as strings. In order to do arithmetic on them, they have to beconverted to numbers.
The command as.numeric(...)will convert a string to a numberas.numeric(d.flea.str[1,1])
[1] 191
but when it is applied to an array it makes the array into a vector.
To correct this we could tryd.flea.s - matrix(as.numeric(d.flea.str), ncoln.var)
d.flea.s[1:5,]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 191 131 53 150 15 104
[2,] 185 134 50 147 13 105
[3,] 200 137 52 144 14 102
[4,] 173 127 50 144 16 97
[5,] 171 118 49 153 13 106
Mills 2011 R & Data Visualization 15
7/30/2019 R Guide - S. Mills
16/48
Data Mining 2011
which appears to correct the problem.
A better way in many cases is to read the data as a tabled.flea - read.table(d.file)
d.flea[1:5,]
V1 V2 V3 V4 V5 V6
1 191 131 53 150 15 104
2 185 134 50 147 13 105
3 200 137 52 144 14 102
4 1 73 1 27 5 0 144 1 6 97
5 171 118 49 153 13 106
Note the different form for the row and column headers.
The latter form is better in that it gives names to the rows and columns - not just positions.
Consider taking a sub-matrixd.flea.s[c(5,10,15,20), c(2,4)]
[,1] [,2][1,] 118 153
[2,] 115 142
[3,] 130 147
[4,] 121 147
We have no way of identifying from where the entries came. On the other handd.flea[c(5,10,15,20), c(2,4)]
V2 V4
5 118 153
10 115 142
15 130 147
20 121 147
retains the information as to the rows and columns.
It is possible to improve on the first case by assigning row and column name informationdimnames(d.flea.s) - list(1:dim(d.flea.s)[1], headers)
To see what is happening in this consider
dim(d.flea.s)
[1] 74 6
This gives the dimension of the matrix so 1:dim(d.flea.s)[1] creates a vector of integers
from 1 to the number of rows in the matrix. dimnames assigns the values in the list as row
and column labels.d.flea.s[c(5,10,15,20),]
tars2 aede1
5 118 153
10 115 142
15 130 147
20 121 147
This is much more useful and is similar to the information displayed by the data frame version. In
fact it is more informative because it uses the true header information. To improve the data frame wecan replace the generic column headers by the correct values (the row headers are good) by
R & Data Visualization 16 Mills 2011
7/30/2019 R Guide - S. Mills
17/48
Data Mining 2011
colnames(d.flea) - headers
d.flea[c(5,10,15,20), c(2,4)]
tars2 aede1
5 118 153
10 115 142
15 130 147
20 121 147
Now that we have all the data, we can use further information to specify the species for the cases.For some purposes we may find it best to have characters to represent the species while for others,numerical values may be best. We will create both
flea.species - c(rep(C,21),rep(Hp,22),rep(Hk,31))
species - c(rep(1,21),rep(2,22),rep(3,31))
Here we have used information that was not contained in the data to set the species. This
information is found in(d.row - paste(d.data.dir, d.basename, .row, sep ))
(row.headers - noquote(scan(d.row, )))
Read 74 items[1] Concinna Concinna Concinna Concinna Concinna Concinna Concinna
[8] Concinna Concinna Concinna Concinna Concinna Concinna Concinna
[15] Concinna Concinna Concinna Concinna Concinna Concinna Concinna
[22] Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot.
[29] Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot.
[36] Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot.
[43] Heptapot. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.
[50] Heikert. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.
[57] Heikert. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.
[64] Heikert. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.
[71] Heikert. Heikert. Heikert. Heikert.
A further refinement is to bind things together in what is called a data frame. As it happens the
table version is a data frame.is.data.frame(d.flea)
[1] TRUE
Many functions require the use of a data frame. (It might be best to also bind in the species
information, but which one depends on what we are doing.)df.flea - data.frame(d.flea.s)
We can read in some functions that we will use.source(paste(d.code.dir, DispStr.r, sep ))
source(paste(d.code.dir, pairs_ext.r, sep
))source(paste(d.code.dir, MakeStereo.r, sep ))
source reads code from a file just as though it had been typed or pasted into R. Now we can lookat the data as something other than just numbers - i.e. visualization of the data.
1.2.2 Scatterplot Matrices
Mills 2011 R & Data Visualization 17
7/30/2019 R Guide - S. Mills
18/48
7/30/2019 R Guide - S. Mills
19/48
Data Mining 2011
tars1
110 140 120 150 60 100
120
220
110
140
tars2
head45
55
120
aede1
aede28
14
120 200
60
110
45 55 8 12 16
aede3
tars1
110 140
0 . 0 2 6 0 . 0 9 6
120 150
0.33 0.78
60 100
120
220
0.57
110
140
tars20.67 0.56 0.12 0.49
head 0.59 0.3145
55
0.52
120
aede10.25 0.78
aede2
8
14
0.48
120 200
60
110
45 55 8 12 16
aede3
Figure 2 The first command gives the standardpairs plot. It shows duplicate plots.
Figure 3. The second command also shows
histograms and correlations (the size of the
number relates to the degree of correlation)
It should be noted that high correlations (visual or numeric) indicate some linear relationship
between pairs of variables but low correlations tell us nothing - the data may be related in anonlinear fashion or related in combination with other variables.
At this point we might consider how we can see what is going on within these functions. We willconsider two closely related methods - debug(fun) and browser().
Considerdebug(panel.cor)
pairs(d.flea, upper.panelpanel.cor, diag.panelpanel.hist)
debugging in: lower.panel(as.vector(x[, j]), as.vector(x[, i]), ...)
debug: {
usr - par(usr)
on.exit(par(usr))
par(usr c(0, 1, 0, 1))
r - abs(cor(x, y))
txt - format(c(r, 0.123456789), digits digits)[1]
txt - paste(prefix, txt, sep )if (missing(cex.cor))
cex.cor - 0.8/strwidth(txt) * r
text(0.5, 0.5, txt, cex cex.cor)
}
(On start-up, the function that is being debugged is displayed. We can step through by typing n ornext which executes the current command and displays the next one.)
Browse[1] ndebug: usr - par(usr)
Mills 2011 R & Data Visualization 19
7/30/2019 R Guide - S. Mills
20/48
Data Mining 2011
Browse[1] ndebug: on.exit(par(usr))
Browse[1] ndebug: par(usr c(0, 1, 0, 1))
Browse[1] ndebug: r - abs(cor(x, y))
Browse[1] ndebug: txt - format(c(r, 0.123456789), digits digits)[1]
Browse[1] x[1] 131 134 137 127 118 118 134 129 131 115 143 131 130 133 130 131 127 126 140
[20] 121 136 141 119 130 113 121 115 127 123 119 120 131 127 116 123 135 132 131
[39] 116 121 146 119 127 107 122 114 131 108 118 122 127 125 124 129 126 122 116
[58] 123 122 123 109 124 114 120 114 119 111 112 130 120 119 114 110 124
Browse[1] y[1] 191 185 200 173 171 160 188 186 174 163 190 174 201 190 182 184 177 178 210
[20] 182 186 158 146 151 122 138 132 131 135 125 130 130 138 130 143 154 147 141
[39] 131 144 137 143 135 186 211 201 242 184 211 217 223 208 199 211 218 203 192
[58] 195 211 187 192 223 188 216 185 178 187 187 201 187 210 196 195 187
Browse[1] r[1] 0.02634653
Browse[1] n
debug: txt - paste(prefix, txt, sep )Browse[1] ndebug: if (missing(cex.cor)) cex.cor - 0.8/strwidth(txt) * r
Browse[1] txt[1] 0.026
Browse[1] Q
To stop the debugging, we can type Q.
The next time we call a function that has been set for debugging, it will again be debugged. To turn
the debugging off, we typeundebug(panel.cor)
If we wish to debug a function that we have written, we can put a browser() statement inside thefunction body.
The advantage of this is that when we have several spots at which we want to look at the behaviour,we can type c to continue the execution until we hit the next browser() statement - quite useful
with loops.
In this case we know the species corresponding to each case so it is instructive to look at therelationship among the variables and species by the use of colour col. The colour can be given by anumber or by a name such as red.
pairs(d.flea, col species 1)
R & Data Visualization 20 Mills 2011
7/30/2019 R Guide - S. Mills
21/48
Data Mining 2011
tars1
110 140 120 150 60 100
120
220
110
140
tars2
head 45
55
120
aede1
aede28
14
120 200
60
110
45 55 8 12 16
aede3
Figure 4.
We notice that in several plots the species are not mixed together e.g. tars1 vs. aede2, tars1
vs. aede1 etc.
1.2.3 Conditional Plots
To investigate some of the more complicated relationships we can look at conditional plotting.
There is more than one version of this. The first one is part of the standard package. Note thearguments. The aede3 ~tars1 | aede1 says that we are plotting aede3 against tars1
conditioned against aede1. That is, we will get several plots corresponding to different ranges of
aede1. Note that ~ is frequently used to indicate a formula. The data
df.flea allows us touse the variable names in the formula because those names are part of the data frame. (Theoverlap0.1 will be explained later.)
coplot(aede3 ~tars1 | aede1, data df.flea)
coplot(aede3 ~tars1 | aede1, data df.flea, overlap 0.1)
Mills 2011 R & Data Visualization 21
7/30/2019 R Guide - S. Mills
22/48
Data Mining 2011
60
80
110
120 160 200 240
120 160 200 240 120 160 200 240
60
80
110
tars1
aede3
120 130 140 150
Given : aede1
60
80
110
120 160 200 240
120 160 200 240 120 160 200 240
60
80
110
tars1
aede3
120 130 140 150
Given : aede1
Figure 5. Figure 6.
In both the above figures, the lower six panels show the pairwise plots for aede3 against tars1
for different ranges ofaede1 as shown in the upper panel. The defaults for this function are toselect 6 different subsets of the third variable with an equal number of cases in each. In addition anoverlap of0.5 is allowed. The second example has reduced the overlap to 0.1. We get a differentview if we colour our points by species.
coplot(aede3 ~tars1 | aede1, data df.flea, overlap 0.1, col species 1, pch
16)
60
80
110
120 160 200 240
120 160 200 240 120 160 200 240
60
80
110
tars1
aede3
120 130 140 150
Given : aede1
Figure 7.
This further illustrates the relationships noted earlier.
Another version of this is found in the lattice package, but before using this we might consider afunction that will allow us to condition on fixed interval lengths rather than fixed count.
R & Data Visualization 22 Mills 2011
7/30/2019 R Guide - S. Mills
23/48
Data Mining 2011
library(lattice)
equal.space - function(data, count) {
# range(data) gives the max and min of the variable data.
# diff takes the difference between the two values so
# diffs gives the width of each interval.
diffs - diff(range(data))/count
# min(data)diffs*(0:(count-1)) gives the starting values
# for the intervals.
# min(data)diffs*(1:count) gives the ending values
# for the intervals.
# cbind treats two(or more) vectors as column vectors
# and binds them as columns of a matrix.
intervals - cbind(min(data)diffs*(0:(count-1)),
min(data)diffs*(1:count))
# shingle takes the interval structure and the data
# and breaks the data into the appropriate groups.
return (shingle(data, intervals))
}
The following uses the conditional plotting from the lattice package with
a) equal cases in each grouping and
b) equal spacing in each grouping.
C1 - equal.count(df.flea$aede1, number 6, overlap 0.1)
xyplot(aede3 ~tars1 | C1, data df.flea, pch 19)
C2 - equal.space(df.flea$aede1, 6)
xyplot(aede3 ~tars1 | C2, data df.flea, pch 19)
tars1
aede3
60
80
100
120
120 160 200 240
C1 C1
120 160 200 240
C1
C1
120 160 200 240
C1
60
80
100
120
C1
tars1
aede3
60
80
100
120
120 160 200 240
C2 C2
120 160 200 240
C2
C2
120 160 200 240
C2
60
80
100
120
C2
Figure 8. Equal cases in each grouping Figure 9.Equal spacing in eachgrouping
This version does not show the values of the conditioning variable.
Mills 2011 R & Data Visualization 23
7/30/2019 R Guide - S. Mills
24/48
Data Mining 2011
It is also possible to condition against two variables, but before doing that we will create asynthetic data set. For now we will not go into detail about the nature of the data.
source(paste(d.R - code.dir, ellipseOutline .r, sep))
ec.t1 -
for(
t in-
20:
20)
ec.t1 - rbind(ec.t1, cbind(ellipse.outline(20,20,10,5,t,0,(200-t^2)/10),t))
}
ec.t1 - data.frame(ec.t1[sample(dim(ec.t1)[1], dim(ec.t1)[1]),])
We can plot the scatterplot matrixpairs(ec.t1, upper.panel panel.cor, diag.panel panel.hist)
x
-10 0 5
0.00 1 . 4 e - 2 1
-20 0 20
-40
0
40
0.76
-10
0
5y
7 . 8 e - 2 2 0.00
z
-20
0
20
3 . 6 e - 2 1
-40 0 40
-20
0
20
-20 0 20
t
Figure 10.
In the lines that follow, note the use of the $ symbol. In this case it is used to reference the columnsof a data frame; in other cases it references parts of other objects.
X - equal.space(ec.t1$x, 25)
Y - equal.space(ec.t1$y, 25)Z - equal.space(ec.t1$z, 25)
T - equal.space(ec.t1$t, 25)
In the following, note the use of aspect. R, like many other languages, tries to use as much of the
plotting region as possible. While this works well if the data has no intrinsic shape, it is a severeproblem in other situations. For example, if you try to plot an ellipse you will get a circle. To avoidthis you need to force the plot routine to use equal scales along the axes. This is often done by use of
R & Data Visualization 24 Mills 2011
7/30/2019 R Guide - S. Mills
25/48
Data Mining 2011
the aspect ratio (as below) but different plot routines use different methods ( and for some it is up toyou to find a way to force the appropriate scaling). Note the use of x11() as a method of creatinganother plot window rather than plotting over the current one.xyplot(z ~x | Y, data ec.t1, pch., main z ~x | Y,
aspect diff(range(ec.t1$z))/diff(range(ec.t1$x)))
x11()
xyplot(y ~x | Z, data ec.t1, pch., main y ~x | Z,
aspect diff(range(ec.t1$y))/diff(range(ec.t1$x)))
x11()
xyplot(z ~y | X, data ec.t1, pch., main z ~y | X,
aspect diff(range(ec.t1$z))/diff(range(ec.t1$y)))
x11()
xyplot(z ~x | T, data ec.t1, pch., main z ~x | T,
aspect diff(range(ec.t1$z))/diff(range(ec.t1$x)))
z ~ x | Y
x
z
-20
0
20
-40 0 2040
Y Y
-40 0 20 40
Y Y
-40 0 2040
Y
Y Y Y Y
-20
0
20
Y-20
0
20Y Y Y Y Y
Y Y Y Y
-20
0
20
Y-20
0
20Y
-40 0 2040
Y Y
-40 0 2040
Y Y
y ~ x | Z
x
y
-105
-40 0 2040
Z Z
-40 0 2040
Z
Z Z-15Z
-105
Z Z Z
Z Z
-15
Z-10
5Z Z Z
Z Z
-15
Z-10
5Z Z Z
Z Z
-15
Z-10
5Z
Figure 11. Figure 12.
Mills 2011 R & Data Visualization 25
7/30/2019 R Guide - S. Mills
26/48
Data Mining 2011
z ~ y | X
y
z
-20
-10
0
10
20
-10 05
X X
-10 05
X X
-10 05
X X
-10 05
X X
-10 05
X
X X X X X X X X
-20
-10
0
1020
X-20
-10
0
10
20
X
-10 05
X X
-10 05
X X
-10 05
X X
z ~ x | T
x
z
-20
0
20
-40 0 2040
T T
-40 0 20 40
T T
-40 0 20
T
T T T T T-20
0
20T T T T T
T T T T T-20
0
20
T
-40 0 2040
T T
-40 0 2040
T T
Figure 13. Figure 14.
Z5 - equal.space(ec.t1$z, 5)
T5 - equal.space(ec.t1$t, 5)
xyplot(z ~x | T5*Z5, data ec.t1, main z ~x | T5*Z5, pch.,
aspect diff(range(ec.t1$z))/diff(range(ec.t1$x)))
z ~ x | T 5*Z5
x
z
-200
20
-40 0 20
T5Z5
T5Z5
-4 0 0 2 0
T5Z5
T5Z5
-40 0 20
T5Z5
T5
Z5
T5
Z5
T5
Z5
T5
Z5
-20
020
T5
Z5-20
020
T5Z5
T5Z5
T5Z5
T5Z5
T5Z5
T5Z5
T5Z5
T5Z5
T5Z5
-20020
T5Z5
-200
20T5Z5
-40 0 20
T5Z5
T5Z5
-4 0 0 2 0
T5Z5
T5Z5
Figure15.
r - 1
R & Data Visualization 26 Mills 2011
7/30/2019 R Guide - S. Mills
27/48
Data Mining 2011
c - 1
for (i in -20:15) { # Loop through i from -20 to 15
ind - ec.t1$ti # Get the cases for which the t value i
X - ec.t1$x[ind] # And the corresponding x,y,z values
Y - ec.t1$y[ind]
Z - ec.t1$z[ind]
# In the following - ( ?cloud)
# print - displays the
# cloud - a function that creates a cloud of points,
# with xlim, ylim, zlim (the range of values on the axes)
# set to the maximum range (x) to give proper scaling.
# subpanel - the function use to plot the points.
# groups - allows classes to be identified.
# screen - sets the viewpoint.
# split - c(col, row, cols, rows)
# more -
print(cloud(Z ~X*Y, xlim range(ec.t1$x),
ylim range(ec.t1$x),zlim range(ec.t1$x),
subpanel panel.superpose, groupsrep(1, dim(ec.t1)[1]),
screen list(z 10, x -80, y 0), data ec.t1),
split c(c, r, 6, 6), more TRUE)
c - c1
if (c%%6 1) { # Remainder mod 6
c - 1
r - r1
}
}
Y
Z
Y
Z
Y
Z
Y Y
Z
Y
Z
Y
Z
Y
Z
Y
Z
Y Y
Z
Y
Z
Y
Z
Y
Z
Y
Z
Y Y
Z
Y
Z
Y
Z
Y
Z
Y
Z
Y Y
Z
Y
Z
Y
Z
Y
Z
Y
Z
Y Y
Z
Y
Z
Y
Z
Y
Z
Y
Z
Y Y
Z
Y
Z
Figure 16.
Mills 2011 R & Data Visualization 27
7/30/2019 R Guide - S. Mills
28/48
Data Mining 2011
1.3 Data Visualization in Ggobi
For high dimensional data, dynamic graphics will reveal more relationships.
For that purpose we will use Ggobi. This is a package that may be called from R or used alone.
library(rggobi)g - ggobi(d.flea)
Figure 17. GGobi console Figure 18. Scatterplot
1.3.1 Scatterplot Matrix
Ggobi starts with a console and a simple scatterplot as shown although we can also have ascatterplot matrix display.display(g[1], Scatterplot Matrix)
or [Display][New scatterplot matrix], from the Ggobi console.
R & Data Visualization 28 Mills 2011
7/30/2019 R Guide - S. Mills
29/48
Data Mining 2011
Figure 19. Scatterplot Matrix
1.3.2 Grand Tour
In order to investigate the data, we will start with a grand tourdisplay(g[1], 2D Tour)
or [View][2D Tour]
Figure 20. Figure 21. 2D Tour
This shows the console and the opening display. Note the circle with the lines. This represents the
projection of the six axes on the two dimensional display. The process used in the grand tour is thata projection direction is selected and then a new direction is selected and the projection is changedsmoothly in that direction. This allows the user to see the data from all directions (although it is
Mills 2011 R & Data Visualization 29
7/30/2019 R Guide - S. Mills
30/48
Data Mining 2011
possible to move the projected direction by use of the mouse).
This gives a 2D tour of the 6 dimensional data. The portion of each variable in the view is shown bythe representation of the axis in the bottom corner (and on the console).
1.3.3 BrushingAs the tour runs, 3 clusters will appear. When they do, you can click [Pause] and apply brushing, -
[Interaction][Brush].- to group cases.
Figure 22. Brushing
You can change the colour and glyph (symbol) of data points.
The process involves selecting the colour and glyph and moving the brush over the points (we can
select [Persistent] - if not selected, the brushing is transient).
We can or [Choose color & glyph] as shown below.
R & Data Visualization 30 Mills 2011
7/30/2019 R Guide - S. Mills
31/48
Data Mining 2011
Figure 23. Select a red (3rd smallest size)
Figure 24. One cluster brushed
- red box for point brushing
Figure 25. Other clusters brushed
In Figure 24, one apparent cluster is set to a red plus, while in Figure 25 another cluster is a yellowcross and the third is a green circle.
We can now return to a grand tour and see if the points of the same colour move together. Notice
that for much of the time, the projection of the clusters are mixed together.
If we feel that we have the clusters properly coloured, we can again [ Pause] and from R find out
Mills 2011 R & Data Visualization 31
7/30/2019 R Guide - S. Mills
32/48
Data Mining 2011
what colours the points are(old.col - glyph_colour(g[1]))
F F F F F F F F F F C F F F F F F F F F F C C C C C C C C C C C C C C C C C C C C C C A
9 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 3
A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
It turns out that we know the species of the flea beetles so we can compare the clustering that we
observed with the true classification.(noquote(rbind(flea.species,old.col)))
F F F F F F F F F F C F F F F F F F F F F C C C C C C C C C C C
flea.s p e c i e s C C C C C C C C C C C C C C C C C C C C C H p H p H p H p H p H p H p H p H p H p H p
old.col 9 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5 5
C C C C C C C C C C C A A A A A A A A A A A A A A
flea.species Hp Hp Hp Hp Hp Hp Hp Hp Hp Hp Hp Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk
old.col 5 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3
A A A A A A A A A A A A A A A A A
flea.species Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk
old.col 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
(notice that the 11th case is of class C but is the same colour as the Hk but all the others have the
colour corresponding to a class of flea beetle) and then reset the colours and glyphs for the next part.
glyph_colour(g[1]) - rep(2, 74) # 74 purple
glyph_type(g[1]) - rep(4, 74) # 74 circles
1.3.4 Parallel Coordinates Plot
Another type of plot is the Parallel Coordinates Plot. This can be a good method for investigatinghigh dimensional data. Consider the point (3, 1, 2) as shown in Figure 25 below. In parallelcoordinates, it would be
Figure 26. The point (3, 1, 2) Figure 27. The point (3, 1, 2) in paralle
If we want 4 (or more) dimensions, we need only add another (or more) parallel line(s).display(g[1], Parallel Coordinates Display)
R & Data Visualization 32 Mills 2011
7/30/2019 R Guide - S. Mills
33/48
Data Mining 2011
or [Display][New parallel coordinates plot].
Figure 28. Parallel Coordinates Plot
Each data point has a value on each of the axes which are plotted vertically rather than at right
angles to each other.
1.3.4.1 Parallel Coordinates Brushing
The value of brushing is greatly enhanced when we have more than one display. Below we see ascatterplot with 5 points being brushed and we see that the parallel coordinates display has 5 points(lines) coloured yellow. This shows which points correspond.
Figure 29. Brushing
1.3.4.2 Parallel Coordinates Linked Brushing
Mills 2011 R & Data Visualization 33
7/30/2019 R Guide - S. Mills
34/48
Data Mining 2011
Figure 30. Linked brushing
For the next part, we set the colours to correspond to the speciesglyph_colour(g[1]) - c(rep(6,21),rep(4,22),rep(9,31))
It is also possible to get information about the clustering from parallel coordinates. We will start bymoving the axes - put the mouse on the white frame and drag aede3 to the first position (a cornerwill appear as the cursor).
Figure31.
Repeat until we have the axes arranged as below.
R & Data Visualization 34 Mills 2011
7/30/2019 R Guide - S. Mills
35/48
Data Mining 2011
Figure 32.
When we look at this we see that there seem to be values of aede3 and tars1 that split the data.
1.3.4.1 Parallel Coordinates Identification
Before we find the values, we will introduce [Tools][DataViewer] which allows us to look at our
data.
Figure 33.
Next we will use [Interaction][Identify]
Mills 2011 R & Data Visualization 35
7/30/2019 R Guide - S. Mills
36/48
Data Mining 2011
Figure 34. Linked identification
Figure 35. Linked identification
R & Data Visualization 36 Mills 2011
7/30/2019 R Guide - S. Mills
37/48
Data Mining 2011
Figure 36. Identification using record label Figure 37. Linked identification
Mills 2011 R & Data Visualization 37
7/30/2019 R Guide - S. Mills
38/48
Data Mining 2011
Figure 38. Identification using data value (tars1)
Figure 39. Linked identification using data value (tars1)
In the above we see that if tars1 160 we have one group (blue) split from the other two
R & Data Visualization 38 Mills 2011
7/30/2019 R Guide - S. Mills
39/48
Data Mining 2011
Figure 40. Linked identification using data value (aede3)
In the above we see that if aede3 95 we have one group (yellow) almost split from the other two.
It appears that we can do an almost perfect split with this information. (We will see this forms thebasis for a recursive splitting process that we will see later.)
We can do similar things from R.
cols - rep(6, 74)
cols[which(d.flea[,6] 95)] - 9
cols[which(d.flea[,1] 160)] - 4
glyph_colour(g[1]) - cols
1.3.5 Stereo
An interesting view of the data can be obtained by looking at the data from slightly shiftedviewpoints -
make.Stereo(d.flea[,c(1,5,6)], species, Main Flea beetles , asp F ,
Xlab tars1 , Ylab aede2 , Zlab aede3 )
Mills 2011 R & Data Visualization 39
7/30/2019 R Guide - S. Mills
40/48
Data Mining 2011
Figure 41. Stereo projectionmake.Stereo(d.flea[,c(6, 5, 1)], species, Main Flea beetles , asp F ,
Zlab tars1 , Ylab aede2 , Xlab aede3 )
Figure 42. Stereo projection
1.3.6 RGL
A relatively new package islibrary(rgl)
which allows interactive visualization.
Considerplot3d(d.flea[,1], d.flea[,5], d.flea[,6], xlabtars1,ylabaede2,zlabaede3,
colspecies1, size0.5, types)
R & Data Visualization 40 Mills 2011
7/30/2019 R Guide - S. Mills
41/48
Data Mining 2011
Now, apply the following codefor (j in seq(0, 90, 10)) {
for(i i n 0:360) {
rgl.viewpoint(i, j);
}
}
The rgl.viewpoint(i, j) changes the point from which you view the object, making it
appear that the object is rotating. The first argument (i in this case) is the spherical coordinatesangle , while the second is . Hence, this code rotates the object about the z-axis (i in
0:360) for 0 to /2.
It is also possible to
hold the left mouse button down to rotate the image the image;
hold the right mouse button down (or use the mouse wheel) to zoom in/out.
1.4 Examples
1.4.1 Randu
Mills 2011 R & Data Visualization 41
7/30/2019 R Guide - S. Mills
42/48
Data Mining 2011
Randu is a random number generator that had a slight flaw.d.file - paste(data.dir, randu.dat, sep /)
d.randu - read.table(d.file)
pairs(d.randu, upper.panelpanel.cor, diag.panelpanel.hist)
Figure 43. Scatterplot matrix for randu
Looking at the scatterplotmatrix everything seems random but in Ggobi...g - ggobi(d.randu)
with [View][2D tour]
Figure 44. randu projection Figure 45. An interesting projection
1.4.2 Prim7
Prim7 contains 500 observations taken from a high energy particle physics scattering experimentwhich yields four particles. The reaction can be described completely by seven (7) independent
R & Data Visualization 42 Mills 2011
7/30/2019 R Guide - S. Mills
43/48
Data Mining 2011
measurements. The important features of the data are short-lived intermediate reaction stages whichappear as protuberant arms in the point cloud.
prim - read.table(paste(data.dir, prim7.dat,sep/))
g - ggobi(prim)
Figure 46. Prim7
We will look at this by looking at clusters that have been found in the past. It would be possible to
do this by careful brushing and observation but the following sets the appropriate colours for thedata.new.col - rep(1, 500)
col.2 - c(2,3,4,14,15,16,17,18,21,23,30,34,37,41,43,46,49,50,53,54,55,
57,58,63,65,66,69,70,72,73,74,75,77,78,79,85,86,88,90,91,92,
93,94,95,99,100,102,104,105,106,107,109,110,113,114,116,120,
121,124,125,126,127,129,130,133,139,140,141,143,145,147,150,
152,153,157,158,159,160,161,164,166,169,172,175,176,177,178,
180,185,194,195,198,200,203,204,209,210,211,212,218,219,220,
222,223,226,228,229,233,234,236,238,240,242,244,245,246,248,
249,252,253,257,259,263,264,265,266,267,269,270,273,277,278,
280,281,282,283,284,286,292,294,296,297,300,305,310,311,314,
315,317,323,331,332,333,334,335,341,342,343,346,351,356,359,
360,361,362,365,370,372,374,375,377,378,379,380,383,386,388,
389,390,391,393,397,398,400,402,403,405,407,408,413,414,415,
417,418,419,420,425,427,428,429,430,432,433,434,436,437,438,
440,444,445,447,448,452,453,454,455,456,463,465,467,470,471,
473,476,477,478,480,481,482,484,485,487,488,489,490,491,494,497)
col.3 - c(11,20,27,33,47,51,60,61,62,98,115,118,119,132,155,186,191,
193,202,205,207,208,213,225,230,231,232,235,239,243,250,251,
268,272,295,312,316,338,339,345,349,354,358,364,366,376,381,
395,401,421,422,446,460,496)
col.5 - c(5,8,13,19,26,32,39,48,56,71,81,96,111,136,137,144,149,156,
162,165,188,199,201,216,255,262,274,279,289,291,301,320,322,
326,327,329,344,348,353,363,367,369,384,399,404,406,411,423,
441,442,443,469,474,479,483,495,499,500)
Mills 2011 R & Data Visualization 43
7/30/2019 R Guide - S. Mills
44/48
Data Mining 2011
col.8 - c(7,29,31,36,89,101,117,131,138,154,173,187,190,192,196,197,
206,247,254,256,258,287,290,298,299,309,324,325,385,387,464)
col.9 - c(1,12,22,24,25,44,45,52,64,83,103,108,122,123,134,135,146,151,
167,168,170,174,179,181,184,221,224,237,261,271,285,293,304,
306,307,308,319,328,337,352,355,357,368,396,410,424,426,435,
439,449,451,458,461,462,466,472,475,493)
Now set all of the points belonging to one cluster to the colour with value 2.new.col[col.2] - 2
glyph_colour(g[1]) - new.col
Figure 47. Prim7 with first cluster brushed Figure 48. Prim7 with 2 clusters brushedNow run though the other clusters.new.col[col.3] - 3
glyph_colour(g[1]) - new.col
new.col[col.5] - 5
glyph_colour(g[1]) - new.col
R & Data Visualization 44 Mills 2011
7/30/2019 R Guide - S. Mills
45/48
Data Mining 2011
Figure 49. Prim7 with 3 clusters brushed Figure 50. Prim7 with 4 clusters brushednew.col[col.8] - 8
glyph_colour(g[1]) - new.col
new.col[col.9] - 9
glyph_colour(g[1]) - new.col
Figure 51. Prim7 with all clusters brushed Figure 52.
It is also possible to put lines on the data to outline possible structures. The lines are thosedetermined by researchers
prim.lin - read.table(paste(data.dir, prim7.lines,sep/))
edges(g[1]) - prim.lin
We can add lines by use of [Interaction][Edit Edges] and dragging the cursor from one point toanother.
Mills 2011 R & Data Visualization 45
7/30/2019 R Guide - S. Mills
46/48
Data Mining 2011
Figure 53. Line added
With careful exploration in a grand tour, and using projection pursuit, it is possible to discover
structure in the data.
1.4.3 6-Dimensional Cube
The following shows how we might investigate the nature of the projection of a 6-dimensional cube.g - ggobi(paste(data.dir, cube6.xml,sep/))
When Ggobi starts up, it will show 4 points in xy Plot mode. On the Scatterplot window select
[Edges] and [Attach edge set ...] to show the edges.
Figure 54. Figure 55.
Now on the console select [View][2D tour] which will show the projection. To inverstigate, we turn
off the last 3 dimensions (D4, D5, D6) by clicking the selection box for each
R & Data Visualization 46 Mills 2011
7/30/2019 R Guide - S. Mills
47/48
Data Mining 2011
Figure 56. Figure 57.
Next, after pausing, select [Interaction][Brush], set [Point brushing] to Off, [Edge brushing] to
Color only and select [Persistent]. Next move the brush to colour all the lines, return to the
[Interaction][2D tour], and deselect the [Pause].
Figure 58. Figure 59. Yellow for edge brushing
We now add in the D4, D5, D6 dimensions one at a time, each time brushing the new edges that
arise (in a different colour obtained by selecting [Choose color & glyph]).
Mills 2011 R & Data Visualization 47
7/30/2019 R Guide - S. Mills
48/48
Data Mining 2011
Figure 60. Showing 5 of the 8 4thdimension
lines brushed
Figure 61. Showing 10 of the 16 5 thdimension
lines brushed
Figure 62. All 6 dimensions projected Figure 63. All 6 dimensions projected