R Guide - S. Mills

Embed Size (px)

Citation preview

  • 7/30/2019 R Guide - S. Mills

    1/48

    Data Mining 2011

    SECTION 1

    Software

    1.1 Introduction to R http://cran.r-project.org

    R is an interpretive statistical programming languge similar to the commercial products S andMatlab. These languages are based on matrices/vectors and enable us to get results in a concisemanner.

    One of the best ways to develop an ability to program in R is to cut and paste sample code and thensee what happens when you change parts of the code.

    A good source for such code is in the examples supplied in the R documentation. The

    documentation can be obtained by gong to http://cran.r-project.org/ and, in the left hand margin,clicking on Documentation Manuals. This brings up the following :

    Figure 1.

    If we select An Introduction to R we will find a good manual for learning R.

    While there are many fuctions for doing statistical calculations most problems require only a smallsubset of them. For this reason the functions (and data sets) are broken into libraries or packages. Ifwe select Packages we will see all the packages that are installed in our version of R. It is likely

    that there is only a subset of all the possible packages installed and if we want to see what else is

    available we can find many others at http://cran.r-project.org. These packages are the work ofstatisticians around the world (as is R itself) and are referred to as Contributed packages.

    If we are not sure which of our packages might contain the functions that we need, we can type, for

    Mills 2011 R & Data Visualization 1

  • 7/30/2019 R Guide - S. Mills

    2/48

    Data Mining 2011

    example

    ??logistic

    and the output

    Help files with alias or concept or title matching logistic using

    fuzzy matching:MASS::polr Ordered Logistic or Probit Regression

    nnet::multinom Fit Multinomial Log-linear Models

    stats::glm Fitting Generalized Linear Models

    stats::Logistic The Logistic Distribution

    stats::SSfpl Self-Starting Nls Four-Parameter Logistic Model

    stats::SSlogis Self-Starting Nls Logistic Model

    survival::clogit Conditional logistic regression

    Type ?PKG::FOO to inspect entries PKG::FOO, or TYPE?PKG::FOO for

    entries like PKG::FOO-TYPE.

    tells us the name of the package, the function within the package, and a brief description of what thefunction does..

    1.1.1 Libraries (Packages)

    Once we have found which library you need, we must load it before you use it. For example, if we

    wish to do Principal Component Analysis, we would find it in the stats package. In order to use

    it we type

    library(stats)

    and we can use any function in that library.

    If we know the name of the function that we wish to use (and the library is loaded) we can type?prcomp

    This will bring up a window with a description of the function, its usage, its arguments, return

    values, and (usually) an example (or examples). Cutting and pasting these examples gives you anopportunity to explore the behaviour of the function. The documentation may also give referencesfor the concepts behind the function and point to other functions that are related to it. Starting with

    version 2.4, the help window is a mini-browser for the entire package rather than just a text page forthe requested function.

    1.1.2 Assignments, sequences

    To get a start on using R, we can look at some simple examples.

    If you wish to assign a number (or more complicated object) to a variable the usual method is to use- (although in most contexts it is also possible to use ) as in

    R & Data Visualization 2 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    3/48

    Data Mining 2011

    a - 5

    R gives no response, but if we typea

    [1] 5

    shows that a has the been assigned the value 5.

    A simpler method is to enclose the expression in parentheses(a - 5)

    [1] 5

    We can assign a vector to a variable. (In fact there are different way to do this depending on thenature of the vector.)(b - c(1, 3, 2, 6, 5, 3, 2))

    [1] 1 3 2 6 5 3 2

    In this example, c concatenates the set of comma-delimited numbers into a vector.

    In the following, the vector is created by repeating a number or set of numbers

    (c - rep(5, 7))[1] 5 5 5 5 5 5 5

    (c.1 - rep(c(1,3,2),4))

    [1] 1 3 2 1 3 2 1 3 2 1 3 2

    The above shows a way of creating variable names with the use of the ..

    We can create vectors by sequencing operations

    (d - 1:6)

    [1] 1 2 3 4 5 6

    (e

    - 6:1)[1] 6 5 4 3 2 1

    (f - 1:10/10)

    [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    (g - seq(10, 9, -.2))

    [1] 10.0 9.8 9.6 9.4 9.2 9.0

    We can perform operations on these vectorsa - b

    [1] 4 2 3 -1 0 2 3

    a*b

    [1] 5 15 10 30 25 15 10(No parentheses are needed because there is no assignment.)

    1.1.3 Matrices

    a%*%b

    Error in a %*% b : non-conformable arguments

    Mills 2011 R & Data Visualization 3

  • 7/30/2019 R Guide - S. Mills

    4/48

    Data Mining 2011

    The %*% represents matrix multiplication and the above tried to multiply a 1 1 and a 7 1 vectortogether.

    We can use the t operator (transpose) to give a 1 1 and a 1 7.a%*%t(b)

    [,1] [,2] [,3] [,4] [,5] [,6] [,7]

    [1,] 5 15 10 30 25 15 10

    sum(a%*%t(b))

    [1] 110

    We can also assign a matrix to a variable.

    Simple matrices could be created as

    (m.1 - matrix(0, 3, 2))

    [,1] [,2]

    [1,] 0 0

    [2,] 0 0[3,] 0 0

    (m.2 - matrix(1:12, nrow3))

    [,1] [,2] [,3] [,4]

    [1,] 1 4 7 10

    [2,] 2 5 8 11

    [3,] 3 6 9 12

    as well as some special matrices such as(I.3 - diag(1,3))

    [,1] [,2] [,3]

    [1,] 1 0 0

    [2,] 0 1 0

    [3,] 0 0 1

    We could create a matrix from the vector b with(h - matrix(b, nrow1))

    [,1] [,2] [,3] [,4] [,5] [,6] [,7]

    [1,] 1 3 2 6 5 3 2

    We can multiply matrices in different waysh*h

    [,1] [,2] [,3] [,4] [,5] [,6] [,7]

    [1,] 1 9 4 36 25 9 4

    (the same as b*b).

    h%*%t(h)

    [1,] 88

    (h.m - t(h)%*%h)

    [,1] [,2] [,3] [,4] [,5] [,6] [,7]

    [1,] 1 3 2 6 5 3 2

    [2,] 3 9 6 18 15 9 6

    [3,] 2 6 4 12 10 6 4

    [4,] 6 18 12 36 30 18 12

    [5,] 5 15 10 30 25 15 10

    [6,] 3 9 6 18 15 9 6

    R & Data Visualization 4 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    5/48

    Data Mining 2011

    [7,] 2 6 4 12 10 6 4

    To access an entry (or submatrix)h.m[5, 2]

    [1] 15

    h.m[1:3, 5:4] # Note the reverse order of the columns

    [,1] [,2][1,] 5 6

    [2,] 15 18

    [3,] 10 12

    In the above input, the # indicates that the remainder of the line is a comment.

    h.m[c(3,5,1), c(6,2,7)]

    [,1] [,2] [,3]

    [1,] 6 6 4

    [2,] 15 15 10

    [3,] 3 3 2

    It is also possible to change the values(h.m[c(3,5,1), c(6,2,7)] - -10)

    [,1] [,2] [,3] [,4] [,5] [,6] [,7]

    [1,] 1 -10 2 6 5 -10 -10

    [2,] 3 9 6 18 15 9 6

    [3,] 2 -10 4 12 10 -10 -10

    [4,] 6 18 12 36 30 18 12

    [5,] 5 -10 10 30 25 -10 -10

    [6,] 3 9 6 18 15 9 6

    [7,] 2 6 4 12 10 6 4

    and determine some values -sum(h.m)

    [1] 330

    mean(h.m)

    [1] 6.734694

    apply(h.m, 1, sum) # Row sums

    [1] -16 66 -2 132 40 66 44

    apply(h.m, 1, mean) # Row means

    [1] -2.2857143 9.4285714 -0.2857143 18.8571429 5.7142857 9.4285714 6.2857143

    apply(h.m, 2, sum) # Column sums

    [1] 22 12 44 132 110 12 -2

    apply(h.m, 2, mean) # Column means

    [1] 3.1428571 1.7142857 6.2857143 18.8571429 15.7142857 1.7142857 -0.2857143etc.

    The elements of a matrix are not restricted to numerical values. For example,matrix(letters[1:6], ncol3)

    [,1] [,2] [,3]

    [1,] a c e

    Mills 2011 R & Data Visualization 5

  • 7/30/2019 R Guide - S. Mills

    6/48

    Data Mining 2011

    [2,] b d f

    Suppose we have a system of equations Ax b with(A - matrix(c(3, 2, 5, 4, 1, 9, -1, 6, 8), 3, 3))

    [,1] [,2] [,3]

    [1,] 3 4 -1

    [2,] 2 1 6[3,] 5 9 8

    and(bT - c(-1, 3, 2))

    [1] -1 3 2

    we can create the augmented matrix by binding bT to Acbind(A, bT)

    bT

    [1,] 3 4 -1 -1

    [2,] 2 1 6 3

    [3,] 5 9 8 2

    (We could also rbind.)

    The entries in the matrix can be of different types(Mixed - matrix(c(Height, Width, 25, 30), 2, 2))

    [,1] [,2]

    [1,] Height 25

    [2,] Width 30

    but the entries are all made the same type - in this case strings.

    If we try to multiply the entries in the second column

    Mixed[1,2]*Mixed[2,2]Error in Mixed[1, 2] * Mixed[2, 2] : non-numeric argument to binary operator

    It is possible to convert a string to a number withas.numeric(Mixed[1,2])*as.numeric(Mixed[2,2])

    [1] 750

    1.1.4 Lists

    A matrix can only be used for rectangulararrays. The list is a data structure that is more flexible.

    We can create a simple list(L.1 - list(first.name John, last.name Smith, sn 345678, mark A-))

    $first.name

    [1] John

    $last.name

    [1] Smith

    $sn

    [1] 345678

    $mark

    R & Data Visualization 6 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    7/48

    Data Mining 2011

    [1] A-

    We can refer to the components of the list byL.1[2]

    $last.name

    [1] Smith

    L.1[[2]][1] Smith

    (Notice that the first form gives the name of the component.)

    L.1$last.name

    [1] Smith

    It is better to refer to the components by name because that means that if the order is changed tolist(last.nameSmith, first.nameJohn, sn345678, markA-) that

    we still get the correct values.

    Many functions have return values in the form of lists.

    We can also build a list by appending components. This is often useful in situations in which we areiteratingL.2 - {}

    L.2 - c(L.2, list(1))

    L.2 - c(L.2, list(x))

    L.2 - c(L.2, list(x^2/2!))

    (L.2 - c(L.2, list(x^3/3!)))

    [[1]]

    [1] 1

    [[2]][1] x

    [[3]]

    [1] x^2/2!

    [[4]]

    [1] x^3/3!

    It is sometimes useful to remove things from the list structure and that can be done byunlist(L.2)

    [1] 1 x x^2/2! x^3/3!

    1.1.5 paste

    In the above, we use the c to concatenate a set of numbers into a vector. If we wish to concatenate

    strings (numbers get converted to strings), we use paste as in

    paste(John, Smith, 345678)

    [1] John Smith 345678

    Notice that there is a space between the names. We can change that with(str.1 - paste(John, Smith, 345678, sep,))

    Mills 2011 R & Data Visualization 7

  • 7/30/2019 R Guide - S. Mills

    8/48

    Data Mining 2011

    [1] John,Smith,345678

    (str.2 - paste(Jane, Jones, 234567, sep,))

    [1] Jane,Jones,234567

    In other words, sep, controls what separates the quantities that are being pasted together (itdefaults to a space but can be more than a single character). If we do not want the space, we canremove it with the sep . On the other hand we may wish to insert some other character(str.3 - paste(D:, DATA, Data Mining R-code,sep/))

    [1] D:/DATA/Data Mining R-code

    If we have a vector of strings, paste does nothing unless we tell it to collapse the vector (and whatto put between the elements).(str.4 - paste(unlist(L.2), collapse ))

    [1] 1 x x^2/2! x^3/3!

    1.1.6 stringsplitThere may be times when we need to unpaste a string. We do this withstrsplit(str.1, ,)

    [[1]]

    [1] John Smith 345678

    which produces a list of one element, orstrsplit(rbind(str.1, str.2), ,)

    [[1]]

    [1] John Smith 345678

    [[2]]

    [1] Jane Jones 234567

    which produces a list element for each row of the matrix.

    If we try the same thing onstrsplit(str.4, )

    [[1]]

    [1] 1 x x^2/2! x^3/3!

    nothing happens. The reason is that is a metacharacter in regular expressions - along with \ |

    ( ) [ { ^$ * ? and we need to change it to an ordinary character with

    strsplit(str.4, \ \ )

    [[1]]

    [1] 1 x x^2/2! x^3/3!

    1.1.7 Control Structures

    R has the usual control structures.

    If we leta - 1

    R & Data Visualization 8 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    9/48

    Data Mining 2011

    b - 2

    if (a b) print(a b)

    [1] a b

    if (a b)

    print(a b)

    else

    Error: syntax error in else

    print(a

    b)tells us that there is a syntax error in else.

    The correct form isif (a b) {

    print(a b)

    } else {

    print(a b)

    }

    [1] a b

    Note that the { and } are used as the beginning and ending of blocks of code.for (i i n 1:5) {

    print (i)

    }

    [1] 1

    [1] 2

    [1] 3

    [1] 4

    [1] 5

    Instead of1:5 we could have things such as i in c(5, 3, 7, 2, -4, -9).n - 1

    f

    - 1while (n 5) {

    f - f * n

    n - n 1

    print(f)

    }

    [1] 1

    [1] 2

    [1] 6

    [1] 24

    R also has repeat, break, and next.

    1.1.8 Functions

    In the preceding, we have used several of the built-in functions of R (paste, strsplit, print).We will often need to write our own functions, so we need to look at the structure of functions.

    Mills 2011 R & Data Visualization 9

  • 7/30/2019 R Guide - S. Mills

    10/48

    Data Mining 2011

    Suppose that we have a random set of numbers and wish to find the mean (there is, of course, abuilt-in function for this)

    (num - runif(20, -1, 1)) # 20 random numbers from a uniform distribution

    [1] -0.439628823 0.233594691 -0.266024740 0.007379845 0.466801666

    [6] -0.092201637 0.683620457 -0.003777758 0.040208154 0.276043486

    [11] -0.236744290 0.258244825 0.262341596 -0.518254284 -0.584123774

    [16] -0.300963114 0.129197590 0.941738119 0.073712909 -0.991614198

    We could write the function asmy.mean - function (x) {

    len - length(x)

    sum(x)/len

    }

    Note that this says that the name my.mean has the function assigned to it. We can see what the

    function is if we enter the name

    my.mean

    function (x) {len - length(x)

    sum(x)/len

    }

    We call the function in the usual mannermy.mean(num)

    [1] -0.003022464

    A type of function that we will often need is the recursive function. A common example that is used

    in recursion is the factorial function (it is NOT the best way to find the factorial, but it is easy toprogram and to understand.)

    If you are not familiar with recursion in programming, the following might help.fact - function (m) {

    if (m 1) {

    f - fact(m - 1) * m

    } else {

    return(1)

    }

    f

    }

    1.1.9 Debugging

    We will use thedebug(fact)

    to enable us to trace through the factorial function.

    (We can display the values of the variables within the function by typing the name of the variable. Ifwe have a variable called n we need to type print(n) - this is why I use m as the variable. If you

    R & Data Visualization 10 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    11/48

    Data Mining 2011

    have seen enough, you can use c to continue without debugging or Q to quit.)

    The text in ( ) are comments on the process.fact(5)

    debugging in: fact(5) (tells us that we have just entered fact)

    debug: {

    if (m 1) { (the next block of the function to be evaluated)

    f - fact(m - 1) * m}

    else {

    return(1)

    }

    f

    }

    Browse[1] m;f;n (a command to print the value of m & f and take the next

    [1] 5 (m)

    [1] 0 (f)

    debug: if (m 1) { (next block - we stepped past the {)

    f - fact(m - 1) * m

    } else {

    return(1)

    }Browse[1] m;f;n

    [1] 5

    [1] 0

    debug: f - fact(m - 1) * m (m 1 so we execute this)

    Browse[1] m;f;n

    [1] 5

    [1] 0

    debugging in: fact(m - 1) (we have stepped into fact again - with m 4 - see below

    debug: {

    if (m 1) {

    f - fact(m - 1) * m

    }

    else {

    return(1)

    }

    f

    }

    Browse[1] m;f;n

    [1] 4 (m 4)

    [1] 0

    debug: if (m 1) {

    f - fact(m - 1) * m

    } else {

    return(1)

    }

    Browse[1] m;f;n

    [1] 4

    [1] 0

    debug: f - fact(m - 1) * m (m 1 so we execute this)

    Browse[1] m;f;n

    [1] 4

    [1] 0

    debugging in: fact(m - 1) (we have stepped into fact again - with m 3 - see below

    debug: {

    if (m 1) {

    f - fact(m - 1) * m

    }

    else {

    return(1)

    }

    Mills 2011 R & Data Visualization 11

  • 7/30/2019 R Guide - S. Mills

    12/48

  • 7/30/2019 R Guide - S. Mills

    13/48

    Data Mining 2011

    debug: return(1) (this time m 1 so we do not take the fact path)

    Browse[1] n

    exiting from: fact(m - 1) (this is the first time that we have done this

    we return a 1 to the function that called this

    and use that value as the multiplier of m 2)

    debug: f

    Browse[1] m;f;n

    [1] 2

    [1] 2exiting from: fact(m - 1) (return from fact with the value 2 2 1

    and use this as the multiplier of m 3)

    debug: f

    Browse[1] m;f;n

    [1] 3

    [1] 6

    exiting from: fact(m - 1) (return from fact with the value 6 3 2 1

    and use this as the multiplier of m 4)

    debug: f

    Browse[1] m;f;n

    [1] 4

    [1] 24

    exiting from: fact(m - 1) (return from fact with the value 24 4 3 2 1

    and use this as the multiplier of m 5)

    debug: f

    Browse[1] m;f;n

    [1] 5

    [1] 120

    exiting from: fact(5) (return from fact with the value 120 5 4 3 2 1)

    [1] 120

    If you have a long complicated function with a small section that you wish to investigate, it is

    possible to insert the command browser() into the code. In this case, the use ofcontinue will

    allow the code to be executed until you reach the browser() command again.

    This gives a brief look at some of the concepts in R. We will look at others as we need them.

    Mills 2011 R & Data Visualization 13

  • 7/30/2019 R Guide - S. Mills

    14/48

    Data Mining 2011

    1.2 Data Visualization in R

    It is very important to gain a feel for the data that we are investigating.

    One way to do this is by visualization.

    We will do this by starting with a simple dataset that has some nice features.

    Flea Beetles

    This data is from a paper by A. A. Lubischew, On the Use of Discriminant Functions in

    Taxonomy, Biometrics, Dec 1962, pp.455-477.

    There are three species of flea-beetles: C. concinna, Hp. heptapotamica, and Hk. heikertingeri,and 6 measurements on each.

    tars1 - width of the first joint of the first tarsus in microns (the sum of measurements for both

    tarsi).tars2 - the same for the second joint.

    head - the maximal width of the head between the external edges of the eyes in 0.01 mm.

    aede1 - the maximal width of the aedeagus in the fore-part in microns.

    aede2 - the front angle of the aedeagus ( 1 unit 7.5 degrees).

    aede3 - the aedeagus width from the side in microns.

    1.2.1 Reading Data

    The first thing we have to do is get the data (for this data set we will read in a text file). The

    following illustrates how to read the file(s). (Note the use of the UNIX type path separator with /

    rather than \.

    drive - D:

    code.dir - paste(drive, DATA, Data Mining R-Code, sep/)

    data.dir - paste(drive, DATA, Data Mining Data, sep/)

    # Set the files to be read

    d.file - paste(data.dir, fleas, flea.dat, sep/)

    [1] D:/DATA/Data Mining Data/fleas/flea.dat

    d.col - paste(data.dir, fleas, flea.col, sep/)

    [1] D:/DATA/Data Mining Data/fleas/flea.col

    We now have paths for two files: d.file points to the data and d.col points to the column

    headers (variable names).

    The function scan can be used to read in the data. If the data has characters in it, we need to

    indicate that with

    R & Data Visualization 14 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    15/48

    Data Mining 2011

    (headers - scan(d.col))

    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

    scan() expected a real, got tars1

    (headers - scan(d.col, ))

    Read 6 items

    [1] tars1 tars2 head aede1 aede2 aede3

    (n.var - length(headers)) # The vector length gives the number of variables

    [1] 6

    For reading in the data we can do a scan to read the data into a vector, and then convert the vectorto a matrix with n.var columns. Because data files are typically stored by rows, and R does the

    conversion to a matrix by column, we need to indicate that with byrowT.

    d.flea.s - matrix(scan(d.file), ncoln.var, byrowT)

    Read 444 items

    d.flea.s[1:5,] # This displays the first 5 rows and all the columns.

    [,1] [,2] [,3] [,4] [,5] [,6]

    [1,] 191 131 53 150 15 104

    [2,] 185 134 50 147 13 105

    [3,] 200 137 52 144 14 102

    [4,] 173 127 50 144 16 97

    [5,] 171 118 49 153 13 106

    Note that if the file contains a mixture of numbers and text, we need to use the in the scand.flea.str - matrix(scan(d.file, ), ncoln.var, byrowT)

    Read 444 items

    d.flea.str[1:5,]

    [,1] [,2] [,3] [,4] [,5] [,6]

    [1,] 191 131 53 150 15 104

    [2,] 185 134 50 147 13 105

    [3,] 200 137 52 144 14 102[4,] 173 127 50 144 16 97

    [5,] 171 118 49 153 13 106

    We can see that the numbers are read in as strings. In order to do arithmetic on them, they have to beconverted to numbers.

    The command as.numeric(...)will convert a string to a numberas.numeric(d.flea.str[1,1])

    [1] 191

    but when it is applied to an array it makes the array into a vector.

    To correct this we could tryd.flea.s - matrix(as.numeric(d.flea.str), ncoln.var)

    d.flea.s[1:5,]

    [,1] [,2] [,3] [,4] [,5] [,6]

    [1,] 191 131 53 150 15 104

    [2,] 185 134 50 147 13 105

    [3,] 200 137 52 144 14 102

    [4,] 173 127 50 144 16 97

    [5,] 171 118 49 153 13 106

    Mills 2011 R & Data Visualization 15

  • 7/30/2019 R Guide - S. Mills

    16/48

    Data Mining 2011

    which appears to correct the problem.

    A better way in many cases is to read the data as a tabled.flea - read.table(d.file)

    d.flea[1:5,]

    V1 V2 V3 V4 V5 V6

    1 191 131 53 150 15 104

    2 185 134 50 147 13 105

    3 200 137 52 144 14 102

    4 1 73 1 27 5 0 144 1 6 97

    5 171 118 49 153 13 106

    Note the different form for the row and column headers.

    The latter form is better in that it gives names to the rows and columns - not just positions.

    Consider taking a sub-matrixd.flea.s[c(5,10,15,20), c(2,4)]

    [,1] [,2][1,] 118 153

    [2,] 115 142

    [3,] 130 147

    [4,] 121 147

    We have no way of identifying from where the entries came. On the other handd.flea[c(5,10,15,20), c(2,4)]

    V2 V4

    5 118 153

    10 115 142

    15 130 147

    20 121 147

    retains the information as to the rows and columns.

    It is possible to improve on the first case by assigning row and column name informationdimnames(d.flea.s) - list(1:dim(d.flea.s)[1], headers)

    To see what is happening in this consider

    dim(d.flea.s)

    [1] 74 6

    This gives the dimension of the matrix so 1:dim(d.flea.s)[1] creates a vector of integers

    from 1 to the number of rows in the matrix. dimnames assigns the values in the list as row

    and column labels.d.flea.s[c(5,10,15,20),]

    tars2 aede1

    5 118 153

    10 115 142

    15 130 147

    20 121 147

    This is much more useful and is similar to the information displayed by the data frame version. In

    fact it is more informative because it uses the true header information. To improve the data frame wecan replace the generic column headers by the correct values (the row headers are good) by

    R & Data Visualization 16 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    17/48

    Data Mining 2011

    colnames(d.flea) - headers

    d.flea[c(5,10,15,20), c(2,4)]

    tars2 aede1

    5 118 153

    10 115 142

    15 130 147

    20 121 147

    Now that we have all the data, we can use further information to specify the species for the cases.For some purposes we may find it best to have characters to represent the species while for others,numerical values may be best. We will create both

    flea.species - c(rep(C,21),rep(Hp,22),rep(Hk,31))

    species - c(rep(1,21),rep(2,22),rep(3,31))

    Here we have used information that was not contained in the data to set the species. This

    information is found in(d.row - paste(d.data.dir, d.basename, .row, sep ))

    (row.headers - noquote(scan(d.row, )))

    Read 74 items[1] Concinna Concinna Concinna Concinna Concinna Concinna Concinna

    [8] Concinna Concinna Concinna Concinna Concinna Concinna Concinna

    [15] Concinna Concinna Concinna Concinna Concinna Concinna Concinna

    [22] Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot.

    [29] Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot.

    [36] Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot.

    [43] Heptapot. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.

    [50] Heikert. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.

    [57] Heikert. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.

    [64] Heikert. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.

    [71] Heikert. Heikert. Heikert. Heikert.

    A further refinement is to bind things together in what is called a data frame. As it happens the

    table version is a data frame.is.data.frame(d.flea)

    [1] TRUE

    Many functions require the use of a data frame. (It might be best to also bind in the species

    information, but which one depends on what we are doing.)df.flea - data.frame(d.flea.s)

    We can read in some functions that we will use.source(paste(d.code.dir, DispStr.r, sep ))

    source(paste(d.code.dir, pairs_ext.r, sep

    ))source(paste(d.code.dir, MakeStereo.r, sep ))

    source reads code from a file just as though it had been typed or pasted into R. Now we can lookat the data as something other than just numbers - i.e. visualization of the data.

    1.2.2 Scatterplot Matrices

    Mills 2011 R & Data Visualization 17

  • 7/30/2019 R Guide - S. Mills

    18/48

  • 7/30/2019 R Guide - S. Mills

    19/48

    Data Mining 2011

    tars1

    110 140 120 150 60 100

    120

    220

    110

    140

    tars2

    head45

    55

    120

    aede1

    aede28

    14

    120 200

    60

    110

    45 55 8 12 16

    aede3

    tars1

    110 140

    0 . 0 2 6 0 . 0 9 6

    120 150

    0.33 0.78

    60 100

    120

    220

    0.57

    110

    140

    tars20.67 0.56 0.12 0.49

    head 0.59 0.3145

    55

    0.52

    120

    aede10.25 0.78

    aede2

    8

    14

    0.48

    120 200

    60

    110

    45 55 8 12 16

    aede3

    Figure 2 The first command gives the standardpairs plot. It shows duplicate plots.

    Figure 3. The second command also shows

    histograms and correlations (the size of the

    number relates to the degree of correlation)

    It should be noted that high correlations (visual or numeric) indicate some linear relationship

    between pairs of variables but low correlations tell us nothing - the data may be related in anonlinear fashion or related in combination with other variables.

    At this point we might consider how we can see what is going on within these functions. We willconsider two closely related methods - debug(fun) and browser().

    Considerdebug(panel.cor)

    pairs(d.flea, upper.panelpanel.cor, diag.panelpanel.hist)

    debugging in: lower.panel(as.vector(x[, j]), as.vector(x[, i]), ...)

    debug: {

    usr - par(usr)

    on.exit(par(usr))

    par(usr c(0, 1, 0, 1))

    r - abs(cor(x, y))

    txt - format(c(r, 0.123456789), digits digits)[1]

    txt - paste(prefix, txt, sep )if (missing(cex.cor))

    cex.cor - 0.8/strwidth(txt) * r

    text(0.5, 0.5, txt, cex cex.cor)

    }

    (On start-up, the function that is being debugged is displayed. We can step through by typing n ornext which executes the current command and displays the next one.)

    Browse[1] ndebug: usr - par(usr)

    Mills 2011 R & Data Visualization 19

  • 7/30/2019 R Guide - S. Mills

    20/48

    Data Mining 2011

    Browse[1] ndebug: on.exit(par(usr))

    Browse[1] ndebug: par(usr c(0, 1, 0, 1))

    Browse[1] ndebug: r - abs(cor(x, y))

    Browse[1] ndebug: txt - format(c(r, 0.123456789), digits digits)[1]

    Browse[1] x[1] 131 134 137 127 118 118 134 129 131 115 143 131 130 133 130 131 127 126 140

    [20] 121 136 141 119 130 113 121 115 127 123 119 120 131 127 116 123 135 132 131

    [39] 116 121 146 119 127 107 122 114 131 108 118 122 127 125 124 129 126 122 116

    [58] 123 122 123 109 124 114 120 114 119 111 112 130 120 119 114 110 124

    Browse[1] y[1] 191 185 200 173 171 160 188 186 174 163 190 174 201 190 182 184 177 178 210

    [20] 182 186 158 146 151 122 138 132 131 135 125 130 130 138 130 143 154 147 141

    [39] 131 144 137 143 135 186 211 201 242 184 211 217 223 208 199 211 218 203 192

    [58] 195 211 187 192 223 188 216 185 178 187 187 201 187 210 196 195 187

    Browse[1] r[1] 0.02634653

    Browse[1] n

    debug: txt - paste(prefix, txt, sep )Browse[1] ndebug: if (missing(cex.cor)) cex.cor - 0.8/strwidth(txt) * r

    Browse[1] txt[1] 0.026

    Browse[1] Q

    To stop the debugging, we can type Q.

    The next time we call a function that has been set for debugging, it will again be debugged. To turn

    the debugging off, we typeundebug(panel.cor)

    If we wish to debug a function that we have written, we can put a browser() statement inside thefunction body.

    The advantage of this is that when we have several spots at which we want to look at the behaviour,we can type c to continue the execution until we hit the next browser() statement - quite useful

    with loops.

    In this case we know the species corresponding to each case so it is instructive to look at therelationship among the variables and species by the use of colour col. The colour can be given by anumber or by a name such as red.

    pairs(d.flea, col species 1)

    R & Data Visualization 20 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    21/48

    Data Mining 2011

    tars1

    110 140 120 150 60 100

    120

    220

    110

    140

    tars2

    head 45

    55

    120

    aede1

    aede28

    14

    120 200

    60

    110

    45 55 8 12 16

    aede3

    Figure 4.

    We notice that in several plots the species are not mixed together e.g. tars1 vs. aede2, tars1

    vs. aede1 etc.

    1.2.3 Conditional Plots

    To investigate some of the more complicated relationships we can look at conditional plotting.

    There is more than one version of this. The first one is part of the standard package. Note thearguments. The aede3 ~tars1 | aede1 says that we are plotting aede3 against tars1

    conditioned against aede1. That is, we will get several plots corresponding to different ranges of

    aede1. Note that ~ is frequently used to indicate a formula. The data

    df.flea allows us touse the variable names in the formula because those names are part of the data frame. (Theoverlap0.1 will be explained later.)

    coplot(aede3 ~tars1 | aede1, data df.flea)

    coplot(aede3 ~tars1 | aede1, data df.flea, overlap 0.1)

    Mills 2011 R & Data Visualization 21

  • 7/30/2019 R Guide - S. Mills

    22/48

    Data Mining 2011

    60

    80

    110

    120 160 200 240

    120 160 200 240 120 160 200 240

    60

    80

    110

    tars1

    aede3

    120 130 140 150

    Given : aede1

    60

    80

    110

    120 160 200 240

    120 160 200 240 120 160 200 240

    60

    80

    110

    tars1

    aede3

    120 130 140 150

    Given : aede1

    Figure 5. Figure 6.

    In both the above figures, the lower six panels show the pairwise plots for aede3 against tars1

    for different ranges ofaede1 as shown in the upper panel. The defaults for this function are toselect 6 different subsets of the third variable with an equal number of cases in each. In addition anoverlap of0.5 is allowed. The second example has reduced the overlap to 0.1. We get a differentview if we colour our points by species.

    coplot(aede3 ~tars1 | aede1, data df.flea, overlap 0.1, col species 1, pch

    16)

    60

    80

    110

    120 160 200 240

    120 160 200 240 120 160 200 240

    60

    80

    110

    tars1

    aede3

    120 130 140 150

    Given : aede1

    Figure 7.

    This further illustrates the relationships noted earlier.

    Another version of this is found in the lattice package, but before using this we might consider afunction that will allow us to condition on fixed interval lengths rather than fixed count.

    R & Data Visualization 22 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    23/48

    Data Mining 2011

    library(lattice)

    equal.space - function(data, count) {

    # range(data) gives the max and min of the variable data.

    # diff takes the difference between the two values so

    # diffs gives the width of each interval.

    diffs - diff(range(data))/count

    # min(data)diffs*(0:(count-1)) gives the starting values

    # for the intervals.

    # min(data)diffs*(1:count) gives the ending values

    # for the intervals.

    # cbind treats two(or more) vectors as column vectors

    # and binds them as columns of a matrix.

    intervals - cbind(min(data)diffs*(0:(count-1)),

    min(data)diffs*(1:count))

    # shingle takes the interval structure and the data

    # and breaks the data into the appropriate groups.

    return (shingle(data, intervals))

    }

    The following uses the conditional plotting from the lattice package with

    a) equal cases in each grouping and

    b) equal spacing in each grouping.

    C1 - equal.count(df.flea$aede1, number 6, overlap 0.1)

    xyplot(aede3 ~tars1 | C1, data df.flea, pch 19)

    C2 - equal.space(df.flea$aede1, 6)

    xyplot(aede3 ~tars1 | C2, data df.flea, pch 19)

    tars1

    aede3

    60

    80

    100

    120

    120 160 200 240

    C1 C1

    120 160 200 240

    C1

    C1

    120 160 200 240

    C1

    60

    80

    100

    120

    C1

    tars1

    aede3

    60

    80

    100

    120

    120 160 200 240

    C2 C2

    120 160 200 240

    C2

    C2

    120 160 200 240

    C2

    60

    80

    100

    120

    C2

    Figure 8. Equal cases in each grouping Figure 9.Equal spacing in eachgrouping

    This version does not show the values of the conditioning variable.

    Mills 2011 R & Data Visualization 23

  • 7/30/2019 R Guide - S. Mills

    24/48

    Data Mining 2011

    It is also possible to condition against two variables, but before doing that we will create asynthetic data set. For now we will not go into detail about the nature of the data.

    source(paste(d.R - code.dir, ellipseOutline .r, sep))

    ec.t1 -

    for(

    t in-

    20:

    20)

    ec.t1 - rbind(ec.t1, cbind(ellipse.outline(20,20,10,5,t,0,(200-t^2)/10),t))

    }

    ec.t1 - data.frame(ec.t1[sample(dim(ec.t1)[1], dim(ec.t1)[1]),])

    We can plot the scatterplot matrixpairs(ec.t1, upper.panel panel.cor, diag.panel panel.hist)

    x

    -10 0 5

    0.00 1 . 4 e - 2 1

    -20 0 20

    -40

    0

    40

    0.76

    -10

    0

    5y

    7 . 8 e - 2 2 0.00

    z

    -20

    0

    20

    3 . 6 e - 2 1

    -40 0 40

    -20

    0

    20

    -20 0 20

    t

    Figure 10.

    In the lines that follow, note the use of the $ symbol. In this case it is used to reference the columnsof a data frame; in other cases it references parts of other objects.

    X - equal.space(ec.t1$x, 25)

    Y - equal.space(ec.t1$y, 25)Z - equal.space(ec.t1$z, 25)

    T - equal.space(ec.t1$t, 25)

    In the following, note the use of aspect. R, like many other languages, tries to use as much of the

    plotting region as possible. While this works well if the data has no intrinsic shape, it is a severeproblem in other situations. For example, if you try to plot an ellipse you will get a circle. To avoidthis you need to force the plot routine to use equal scales along the axes. This is often done by use of

    R & Data Visualization 24 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    25/48

    Data Mining 2011

    the aspect ratio (as below) but different plot routines use different methods ( and for some it is up toyou to find a way to force the appropriate scaling). Note the use of x11() as a method of creatinganother plot window rather than plotting over the current one.xyplot(z ~x | Y, data ec.t1, pch., main z ~x | Y,

    aspect diff(range(ec.t1$z))/diff(range(ec.t1$x)))

    x11()

    xyplot(y ~x | Z, data ec.t1, pch., main y ~x | Z,

    aspect diff(range(ec.t1$y))/diff(range(ec.t1$x)))

    x11()

    xyplot(z ~y | X, data ec.t1, pch., main z ~y | X,

    aspect diff(range(ec.t1$z))/diff(range(ec.t1$y)))

    x11()

    xyplot(z ~x | T, data ec.t1, pch., main z ~x | T,

    aspect diff(range(ec.t1$z))/diff(range(ec.t1$x)))

    z ~ x | Y

    x

    z

    -20

    0

    20

    -40 0 2040

    Y Y

    -40 0 20 40

    Y Y

    -40 0 2040

    Y

    Y Y Y Y

    -20

    0

    20

    Y-20

    0

    20Y Y Y Y Y

    Y Y Y Y

    -20

    0

    20

    Y-20

    0

    20Y

    -40 0 2040

    Y Y

    -40 0 2040

    Y Y

    y ~ x | Z

    x

    y

    -105

    -40 0 2040

    Z Z

    -40 0 2040

    Z

    Z Z-15Z

    -105

    Z Z Z

    Z Z

    -15

    Z-10

    5Z Z Z

    Z Z

    -15

    Z-10

    5Z Z Z

    Z Z

    -15

    Z-10

    5Z

    Figure 11. Figure 12.

    Mills 2011 R & Data Visualization 25

  • 7/30/2019 R Guide - S. Mills

    26/48

    Data Mining 2011

    z ~ y | X

    y

    z

    -20

    -10

    0

    10

    20

    -10 05

    X X

    -10 05

    X X

    -10 05

    X X

    -10 05

    X X

    -10 05

    X

    X X X X X X X X

    -20

    -10

    0

    1020

    X-20

    -10

    0

    10

    20

    X

    -10 05

    X X

    -10 05

    X X

    -10 05

    X X

    z ~ x | T

    x

    z

    -20

    0

    20

    -40 0 2040

    T T

    -40 0 20 40

    T T

    -40 0 20

    T

    T T T T T-20

    0

    20T T T T T

    T T T T T-20

    0

    20

    T

    -40 0 2040

    T T

    -40 0 2040

    T T

    Figure 13. Figure 14.

    Z5 - equal.space(ec.t1$z, 5)

    T5 - equal.space(ec.t1$t, 5)

    xyplot(z ~x | T5*Z5, data ec.t1, main z ~x | T5*Z5, pch.,

    aspect diff(range(ec.t1$z))/diff(range(ec.t1$x)))

    z ~ x | T 5*Z5

    x

    z

    -200

    20

    -40 0 20

    T5Z5

    T5Z5

    -4 0 0 2 0

    T5Z5

    T5Z5

    -40 0 20

    T5Z5

    T5

    Z5

    T5

    Z5

    T5

    Z5

    T5

    Z5

    -20

    020

    T5

    Z5-20

    020

    T5Z5

    T5Z5

    T5Z5

    T5Z5

    T5Z5

    T5Z5

    T5Z5

    T5Z5

    T5Z5

    -20020

    T5Z5

    -200

    20T5Z5

    -40 0 20

    T5Z5

    T5Z5

    -4 0 0 2 0

    T5Z5

    T5Z5

    Figure15.

    r - 1

    R & Data Visualization 26 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    27/48

    Data Mining 2011

    c - 1

    for (i in -20:15) { # Loop through i from -20 to 15

    ind - ec.t1$ti # Get the cases for which the t value i

    X - ec.t1$x[ind] # And the corresponding x,y,z values

    Y - ec.t1$y[ind]

    Z - ec.t1$z[ind]

    # In the following - ( ?cloud)

    # print - displays the

    # cloud - a function that creates a cloud of points,

    # with xlim, ylim, zlim (the range of values on the axes)

    # set to the maximum range (x) to give proper scaling.

    # subpanel - the function use to plot the points.

    # groups - allows classes to be identified.

    # screen - sets the viewpoint.

    # split - c(col, row, cols, rows)

    # more -

    print(cloud(Z ~X*Y, xlim range(ec.t1$x),

    ylim range(ec.t1$x),zlim range(ec.t1$x),

    subpanel panel.superpose, groupsrep(1, dim(ec.t1)[1]),

    screen list(z 10, x -80, y 0), data ec.t1),

    split c(c, r, 6, 6), more TRUE)

    c - c1

    if (c%%6 1) { # Remainder mod 6

    c - 1

    r - r1

    }

    }

    Y

    Z

    Y

    Z

    Y

    Z

    Y Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y

    Z

    Y Y

    Z

    Y

    Z

    Figure 16.

    Mills 2011 R & Data Visualization 27

  • 7/30/2019 R Guide - S. Mills

    28/48

    Data Mining 2011

    1.3 Data Visualization in Ggobi

    For high dimensional data, dynamic graphics will reveal more relationships.

    For that purpose we will use Ggobi. This is a package that may be called from R or used alone.

    library(rggobi)g - ggobi(d.flea)

    Figure 17. GGobi console Figure 18. Scatterplot

    1.3.1 Scatterplot Matrix

    Ggobi starts with a console and a simple scatterplot as shown although we can also have ascatterplot matrix display.display(g[1], Scatterplot Matrix)

    or [Display][New scatterplot matrix], from the Ggobi console.

    R & Data Visualization 28 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    29/48

    Data Mining 2011

    Figure 19. Scatterplot Matrix

    1.3.2 Grand Tour

    In order to investigate the data, we will start with a grand tourdisplay(g[1], 2D Tour)

    or [View][2D Tour]

    Figure 20. Figure 21. 2D Tour

    This shows the console and the opening display. Note the circle with the lines. This represents the

    projection of the six axes on the two dimensional display. The process used in the grand tour is thata projection direction is selected and then a new direction is selected and the projection is changedsmoothly in that direction. This allows the user to see the data from all directions (although it is

    Mills 2011 R & Data Visualization 29

  • 7/30/2019 R Guide - S. Mills

    30/48

    Data Mining 2011

    possible to move the projected direction by use of the mouse).

    This gives a 2D tour of the 6 dimensional data. The portion of each variable in the view is shown bythe representation of the axis in the bottom corner (and on the console).

    1.3.3 BrushingAs the tour runs, 3 clusters will appear. When they do, you can click [Pause] and apply brushing, -

    [Interaction][Brush].- to group cases.

    Figure 22. Brushing

    You can change the colour and glyph (symbol) of data points.

    The process involves selecting the colour and glyph and moving the brush over the points (we can

    select [Persistent] - if not selected, the brushing is transient).

    We can or [Choose color & glyph] as shown below.

    R & Data Visualization 30 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    31/48

    Data Mining 2011

    Figure 23. Select a red (3rd smallest size)

    Figure 24. One cluster brushed

    - red box for point brushing

    Figure 25. Other clusters brushed

    In Figure 24, one apparent cluster is set to a red plus, while in Figure 25 another cluster is a yellowcross and the third is a green circle.

    We can now return to a grand tour and see if the points of the same colour move together. Notice

    that for much of the time, the projection of the clusters are mixed together.

    If we feel that we have the clusters properly coloured, we can again [ Pause] and from R find out

    Mills 2011 R & Data Visualization 31

  • 7/30/2019 R Guide - S. Mills

    32/48

    Data Mining 2011

    what colours the points are(old.col - glyph_colour(g[1]))

    F F F F F F F F F F C F F F F F F F F F F C C C C C C C C C C C C C C C C C C C C C C A

    9 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 3

    A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

    3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

    It turns out that we know the species of the flea beetles so we can compare the clustering that we

    observed with the true classification.(noquote(rbind(flea.species,old.col)))

    F F F F F F F F F F C F F F F F F F F F F C C C C C C C C C C C

    flea.s p e c i e s C C C C C C C C C C C C C C C C C C C C C H p H p H p H p H p H p H p H p H p H p H p

    old.col 9 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5 5

    C C C C C C C C C C C A A A A A A A A A A A A A A

    flea.species Hp Hp Hp Hp Hp Hp Hp Hp Hp Hp Hp Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk

    old.col 5 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3

    A A A A A A A A A A A A A A A A A

    flea.species Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk

    old.col 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

    (notice that the 11th case is of class C but is the same colour as the Hk but all the others have the

    colour corresponding to a class of flea beetle) and then reset the colours and glyphs for the next part.

    glyph_colour(g[1]) - rep(2, 74) # 74 purple

    glyph_type(g[1]) - rep(4, 74) # 74 circles

    1.3.4 Parallel Coordinates Plot

    Another type of plot is the Parallel Coordinates Plot. This can be a good method for investigatinghigh dimensional data. Consider the point (3, 1, 2) as shown in Figure 25 below. In parallelcoordinates, it would be

    Figure 26. The point (3, 1, 2) Figure 27. The point (3, 1, 2) in paralle

    If we want 4 (or more) dimensions, we need only add another (or more) parallel line(s).display(g[1], Parallel Coordinates Display)

    R & Data Visualization 32 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    33/48

    Data Mining 2011

    or [Display][New parallel coordinates plot].

    Figure 28. Parallel Coordinates Plot

    Each data point has a value on each of the axes which are plotted vertically rather than at right

    angles to each other.

    1.3.4.1 Parallel Coordinates Brushing

    The value of brushing is greatly enhanced when we have more than one display. Below we see ascatterplot with 5 points being brushed and we see that the parallel coordinates display has 5 points(lines) coloured yellow. This shows which points correspond.

    Figure 29. Brushing

    1.3.4.2 Parallel Coordinates Linked Brushing

    Mills 2011 R & Data Visualization 33

  • 7/30/2019 R Guide - S. Mills

    34/48

    Data Mining 2011

    Figure 30. Linked brushing

    For the next part, we set the colours to correspond to the speciesglyph_colour(g[1]) - c(rep(6,21),rep(4,22),rep(9,31))

    It is also possible to get information about the clustering from parallel coordinates. We will start bymoving the axes - put the mouse on the white frame and drag aede3 to the first position (a cornerwill appear as the cursor).

    Figure31.

    Repeat until we have the axes arranged as below.

    R & Data Visualization 34 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    35/48

    Data Mining 2011

    Figure 32.

    When we look at this we see that there seem to be values of aede3 and tars1 that split the data.

    1.3.4.1 Parallel Coordinates Identification

    Before we find the values, we will introduce [Tools][DataViewer] which allows us to look at our

    data.

    Figure 33.

    Next we will use [Interaction][Identify]

    Mills 2011 R & Data Visualization 35

  • 7/30/2019 R Guide - S. Mills

    36/48

    Data Mining 2011

    Figure 34. Linked identification

    Figure 35. Linked identification

    R & Data Visualization 36 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    37/48

    Data Mining 2011

    Figure 36. Identification using record label Figure 37. Linked identification

    Mills 2011 R & Data Visualization 37

  • 7/30/2019 R Guide - S. Mills

    38/48

    Data Mining 2011

    Figure 38. Identification using data value (tars1)

    Figure 39. Linked identification using data value (tars1)

    In the above we see that if tars1 160 we have one group (blue) split from the other two

    R & Data Visualization 38 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    39/48

    Data Mining 2011

    Figure 40. Linked identification using data value (aede3)

    In the above we see that if aede3 95 we have one group (yellow) almost split from the other two.

    It appears that we can do an almost perfect split with this information. (We will see this forms thebasis for a recursive splitting process that we will see later.)

    We can do similar things from R.

    cols - rep(6, 74)

    cols[which(d.flea[,6] 95)] - 9

    cols[which(d.flea[,1] 160)] - 4

    glyph_colour(g[1]) - cols

    1.3.5 Stereo

    An interesting view of the data can be obtained by looking at the data from slightly shiftedviewpoints -

    make.Stereo(d.flea[,c(1,5,6)], species, Main Flea beetles , asp F ,

    Xlab tars1 , Ylab aede2 , Zlab aede3 )

    Mills 2011 R & Data Visualization 39

  • 7/30/2019 R Guide - S. Mills

    40/48

    Data Mining 2011

    Figure 41. Stereo projectionmake.Stereo(d.flea[,c(6, 5, 1)], species, Main Flea beetles , asp F ,

    Zlab tars1 , Ylab aede2 , Xlab aede3 )

    Figure 42. Stereo projection

    1.3.6 RGL

    A relatively new package islibrary(rgl)

    which allows interactive visualization.

    Considerplot3d(d.flea[,1], d.flea[,5], d.flea[,6], xlabtars1,ylabaede2,zlabaede3,

    colspecies1, size0.5, types)

    R & Data Visualization 40 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    41/48

    Data Mining 2011

    Now, apply the following codefor (j in seq(0, 90, 10)) {

    for(i i n 0:360) {

    rgl.viewpoint(i, j);

    }

    }

    The rgl.viewpoint(i, j) changes the point from which you view the object, making it

    appear that the object is rotating. The first argument (i in this case) is the spherical coordinatesangle , while the second is . Hence, this code rotates the object about the z-axis (i in

    0:360) for 0 to /2.

    It is also possible to

    hold the left mouse button down to rotate the image the image;

    hold the right mouse button down (or use the mouse wheel) to zoom in/out.

    1.4 Examples

    1.4.1 Randu

    Mills 2011 R & Data Visualization 41

  • 7/30/2019 R Guide - S. Mills

    42/48

    Data Mining 2011

    Randu is a random number generator that had a slight flaw.d.file - paste(data.dir, randu.dat, sep /)

    d.randu - read.table(d.file)

    pairs(d.randu, upper.panelpanel.cor, diag.panelpanel.hist)

    Figure 43. Scatterplot matrix for randu

    Looking at the scatterplotmatrix everything seems random but in Ggobi...g - ggobi(d.randu)

    with [View][2D tour]

    Figure 44. randu projection Figure 45. An interesting projection

    1.4.2 Prim7

    Prim7 contains 500 observations taken from a high energy particle physics scattering experimentwhich yields four particles. The reaction can be described completely by seven (7) independent

    R & Data Visualization 42 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    43/48

    Data Mining 2011

    measurements. The important features of the data are short-lived intermediate reaction stages whichappear as protuberant arms in the point cloud.

    prim - read.table(paste(data.dir, prim7.dat,sep/))

    g - ggobi(prim)

    Figure 46. Prim7

    We will look at this by looking at clusters that have been found in the past. It would be possible to

    do this by careful brushing and observation but the following sets the appropriate colours for thedata.new.col - rep(1, 500)

    col.2 - c(2,3,4,14,15,16,17,18,21,23,30,34,37,41,43,46,49,50,53,54,55,

    57,58,63,65,66,69,70,72,73,74,75,77,78,79,85,86,88,90,91,92,

    93,94,95,99,100,102,104,105,106,107,109,110,113,114,116,120,

    121,124,125,126,127,129,130,133,139,140,141,143,145,147,150,

    152,153,157,158,159,160,161,164,166,169,172,175,176,177,178,

    180,185,194,195,198,200,203,204,209,210,211,212,218,219,220,

    222,223,226,228,229,233,234,236,238,240,242,244,245,246,248,

    249,252,253,257,259,263,264,265,266,267,269,270,273,277,278,

    280,281,282,283,284,286,292,294,296,297,300,305,310,311,314,

    315,317,323,331,332,333,334,335,341,342,343,346,351,356,359,

    360,361,362,365,370,372,374,375,377,378,379,380,383,386,388,

    389,390,391,393,397,398,400,402,403,405,407,408,413,414,415,

    417,418,419,420,425,427,428,429,430,432,433,434,436,437,438,

    440,444,445,447,448,452,453,454,455,456,463,465,467,470,471,

    473,476,477,478,480,481,482,484,485,487,488,489,490,491,494,497)

    col.3 - c(11,20,27,33,47,51,60,61,62,98,115,118,119,132,155,186,191,

    193,202,205,207,208,213,225,230,231,232,235,239,243,250,251,

    268,272,295,312,316,338,339,345,349,354,358,364,366,376,381,

    395,401,421,422,446,460,496)

    col.5 - c(5,8,13,19,26,32,39,48,56,71,81,96,111,136,137,144,149,156,

    162,165,188,199,201,216,255,262,274,279,289,291,301,320,322,

    326,327,329,344,348,353,363,367,369,384,399,404,406,411,423,

    441,442,443,469,474,479,483,495,499,500)

    Mills 2011 R & Data Visualization 43

  • 7/30/2019 R Guide - S. Mills

    44/48

    Data Mining 2011

    col.8 - c(7,29,31,36,89,101,117,131,138,154,173,187,190,192,196,197,

    206,247,254,256,258,287,290,298,299,309,324,325,385,387,464)

    col.9 - c(1,12,22,24,25,44,45,52,64,83,103,108,122,123,134,135,146,151,

    167,168,170,174,179,181,184,221,224,237,261,271,285,293,304,

    306,307,308,319,328,337,352,355,357,368,396,410,424,426,435,

    439,449,451,458,461,462,466,472,475,493)

    Now set all of the points belonging to one cluster to the colour with value 2.new.col[col.2] - 2

    glyph_colour(g[1]) - new.col

    Figure 47. Prim7 with first cluster brushed Figure 48. Prim7 with 2 clusters brushedNow run though the other clusters.new.col[col.3] - 3

    glyph_colour(g[1]) - new.col

    new.col[col.5] - 5

    glyph_colour(g[1]) - new.col

    R & Data Visualization 44 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    45/48

    Data Mining 2011

    Figure 49. Prim7 with 3 clusters brushed Figure 50. Prim7 with 4 clusters brushednew.col[col.8] - 8

    glyph_colour(g[1]) - new.col

    new.col[col.9] - 9

    glyph_colour(g[1]) - new.col

    Figure 51. Prim7 with all clusters brushed Figure 52.

    It is also possible to put lines on the data to outline possible structures. The lines are thosedetermined by researchers

    prim.lin - read.table(paste(data.dir, prim7.lines,sep/))

    edges(g[1]) - prim.lin

    We can add lines by use of [Interaction][Edit Edges] and dragging the cursor from one point toanother.

    Mills 2011 R & Data Visualization 45

  • 7/30/2019 R Guide - S. Mills

    46/48

    Data Mining 2011

    Figure 53. Line added

    With careful exploration in a grand tour, and using projection pursuit, it is possible to discover

    structure in the data.

    1.4.3 6-Dimensional Cube

    The following shows how we might investigate the nature of the projection of a 6-dimensional cube.g - ggobi(paste(data.dir, cube6.xml,sep/))

    When Ggobi starts up, it will show 4 points in xy Plot mode. On the Scatterplot window select

    [Edges] and [Attach edge set ...] to show the edges.

    Figure 54. Figure 55.

    Now on the console select [View][2D tour] which will show the projection. To inverstigate, we turn

    off the last 3 dimensions (D4, D5, D6) by clicking the selection box for each

    R & Data Visualization 46 Mills 2011

  • 7/30/2019 R Guide - S. Mills

    47/48

    Data Mining 2011

    Figure 56. Figure 57.

    Next, after pausing, select [Interaction][Brush], set [Point brushing] to Off, [Edge brushing] to

    Color only and select [Persistent]. Next move the brush to colour all the lines, return to the

    [Interaction][2D tour], and deselect the [Pause].

    Figure 58. Figure 59. Yellow for edge brushing

    We now add in the D4, D5, D6 dimensions one at a time, each time brushing the new edges that

    arise (in a different colour obtained by selecting [Choose color & glyph]).

    Mills 2011 R & Data Visualization 47

  • 7/30/2019 R Guide - S. Mills

    48/48

    Data Mining 2011

    Figure 60. Showing 5 of the 8 4thdimension

    lines brushed

    Figure 61. Showing 10 of the 16 5 thdimension

    lines brushed

    Figure 62. All 6 dimensions projected Figure 63. All 6 dimensions projected