R Guide - S. Mills

7/30/2019 R Guide - S. Mills

1/48

Data Mining 2011

SECTION 1

Software

1.1 Introduction to R http://cran.r-project.org

R is an interpretive statistical programming languge similar to the commercial products S andMatlab. These languages are based on matrices/vectors and enable us to get results in a concisemanner.

One of the best ways to develop an ability to program in R is to cut and paste sample code and thensee what happens when you change parts of the code.

A good source for such code is in the examples supplied in the R documentation. The

documentation can be obtained by gong to http://cran.r-project.org/ and, in the left hand margin,clicking on Documentation Manuals. This brings up the following :

Figure 1.

If we select An Introduction to R we will find a good manual for learning R.

While there are many fuctions for doing statistical calculations most problems require only a smallsubset of them. For this reason the functions (and data sets) are broken into libraries or packages. Ifwe select Packages we will see all the packages that are installed in our version of R. It is likely

that there is only a subset of all the possible packages installed and if we want to see what else is

available we can find many others at http://cran.r-project.org. These packages are the work ofstatisticians around the world (as is R itself) and are referred to as Contributed packages.

If we are not sure which of our packages might contain the functions that we need, we can type, for

Mills 2011 R & Data Visualization 1


2/48

Data Mining 2011

example

??logistic

and the output

Help files with alias or concept or title matching logistic using

fuzzy matching:MASS::polr Ordered Logistic or Probit Regression

nnet::multinom Fit Multinomial Log-linear Models

stats::glm Fitting Generalized Linear Models

stats::Logistic The Logistic Distribution

stats::SSfpl Self-Starting Nls Four-Parameter Logistic Model

stats::SSlogis Self-Starting Nls Logistic Model

survival::clogit Conditional logistic regression

Type ?PKG::FOO to inspect entries PKG::FOO, or TYPE?PKG::FOO for

entries like PKG::FOO-TYPE.

tells us the name of the package, the function within the package, and a brief description of what thefunction does..

1.1.1 Libraries (Packages)

Once we have found which library you need, we must load it before you use it. For example, if we

wish to do Principal Component Analysis, we would find it in the stats package. In order to use

it we type

library(stats)

and we can use any function in that library.

If we know the name of the function that we wish to use (and the library is loaded) we can type?prcomp

This will bring up a window with a description of the function, its usage, its arguments, return

values, and (usually) an example (or examples). Cutting and pasting these examples gives you anopportunity to explore the behaviour of the function. The documentation may also give referencesfor the concepts behind the function and point to other functions that are related to it. Starting with

version 2.4, the help window is a mini-browser for the entire package rather than just a text page forthe requested function.

1.1.2 Assignments, sequences

To get a start on using R, we can look at some simple examples.

If you wish to assign a number (or more complicated object) to a variable the usual method is to use- (although in most contexts it is also possible to use ) as in

R & Data Visualization 2 Mills 2011


3/48

Data Mining 2011

a - 5

R gives no response, but if we typea

[1] 5

shows that a has the been assigned the value 5.

A simpler method is to enclose the expression in parentheses(a - 5)

[1] 5

We can assign a vector to a variable. (In fact there are different way to do this depending on thenature of the vector.)(b - c(1, 3, 2, 6, 5, 3, 2))

[1] 1 3 2 6 5 3 2

In this example, c concatenates the set of comma-delimited numbers into a vector.

In the following, the vector is created by repeating a number or set of numbers

(c - rep(5, 7))[1] 5 5 5 5 5 5 5

(c.1 - rep(c(1,3,2),4))

[1] 1 3 2 1 3 2 1 3 2 1 3 2

The above shows a way of creating variable names with the use of the ..

We can create vectors by sequencing operations

(d - 1:6)

[1] 1 2 3 4 5 6

(e

- 6:1)[1] 6 5 4 3 2 1

(f - 1:10/10)

[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

(g - seq(10, 9, -.2))

[1] 10.0 9.8 9.6 9.4 9.2 9.0

We can perform operations on these vectorsa - b

[1] 4 2 3 -1 0 2 3

a*b

[1] 5 15 10 30 25 15 10(No parentheses are needed because there is no assignment.)

1.1.3 Matrices

a%*%b

Error in a %*% b : non-conformable arguments



4/48

Data Mining 2011

The %*% represents matrix multiplication and the above tried to multiply a 1 1 and a 7 1 vectortogether.

We can use the t operator (transpose) to give a 1 1 and a 1 7.a%*%t(b)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 5 15 10 30 25 15 10

sum(a%*%t(b))

[1] 110

We can also assign a matrix to a variable.

Simple matrices could be created as

(m.1 - matrix(0, 3, 2))

[,1] [,2]

[1,] 0 0

[2,] 0 0[3,] 0 0

(m.2 - matrix(1:12, nrow3))

[,1] [,2] [,3] [,4]

[1,] 1 4 7 10

[2,] 2 5 8 11

[3,] 3 6 9 12

as well as some special matrices such as(I.3 - diag(1,3))

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 0 1 0

[3,] 0 0 1

We could create a matrix from the vector b with(h - matrix(b, nrow1))

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 1 3 2 6 5 3 2

We can multiply matrices in different waysh*h

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 1 9 4 36 25 9 4

(the same as b*b).

h%*%t(h)

[1,] 88

(h.m - t(h)%*%h)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 1 3 2 6 5 3 2

[2,] 3 9 6 18 15 9 6

[3,] 2 6 4 12 10 6 4

[4,] 6 18 12 36 30 18 12

[5,] 5 15 10 30 25 15 10

[6,] 3 9 6 18 15 9 6



5/48

Data Mining 2011

[7,] 2 6 4 12 10 6 4

To access an entry (or submatrix)h.m[5, 2]

[1] 15

h.m[1:3, 5:4] # Note the reverse order of the columns

[,1] [,2][1,] 5 6

[2,] 15 18

[3,] 10 12

In the above input, the # indicates that the remainder of the line is a comment.

h.m[c(3,5,1), c(6,2,7)]

[,1] [,2] [,3]

[1,] 6 6 4

[2,] 15 15 10

[3,] 3 3 2

It is also possible to change the values(h.m[c(3,5,1), c(6,2,7)] - -10)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]

[1,] 1 -10 2 6 5 -10 -10

[2,] 3 9 6 18 15 9 6

[3,] 2 -10 4 12 10 -10 -10

[4,] 6 18 12 36 30 18 12

[5,] 5 -10 10 30 25 -10 -10

[6,] 3 9 6 18 15 9 6

[7,] 2 6 4 12 10 6 4

and determine some values -sum(h.m)

[1] 330

mean(h.m)

[1] 6.734694

apply(h.m, 1, sum) # Row sums

[1] -16 66 -2 132 40 66 44

apply(h.m, 1, mean) # Row means

[1] -2.2857143 9.4285714 -0.2857143 18.8571429 5.7142857 9.4285714 6.2857143

apply(h.m, 2, sum) # Column sums

[1] 22 12 44 132 110 12 -2

apply(h.m, 2, mean) # Column means

[1] 3.1428571 1.7142857 6.2857143 18.8571429 15.7142857 1.7142857 -0.2857143etc.

The elements of a matrix are not restricted to numerical values. For example,matrix(letters[1:6], ncol3)

[,1] [,2] [,3]

[1,] a c e



6/48

Data Mining 2011

[2,] b d f

Suppose we have a system of equations Ax b with(A - matrix(c(3, 2, 5, 4, 1, 9, -1, 6, 8), 3, 3))

[,1] [,2] [,3]

[1,] 3 4 -1

[2,] 2 1 6[3,] 5 9 8

and(bT - c(-1, 3, 2))

[1] -1 3 2

we can create the augmented matrix by binding bT to Acbind(A, bT)

bT

[1,] 3 4 -1 -1

[2,] 2 1 6 3

[3,] 5 9 8 2

(We could also rbind.)

The entries in the matrix can be of different types(Mixed - matrix(c(Height, Width, 25, 30), 2, 2))

[,1] [,2]

[1,] Height 25

[2,] Width 30

but the entries are all made the same type - in this case strings.

If we try to multiply the entries in the second column

Mixed[1,2]*Mixed[2,2]Error in Mixed[1, 2] * Mixed[2, 2] : non-numeric argument to binary operator

It is possible to convert a string to a number withas.numeric(Mixed[1,2])*as.numeric(Mixed[2,2])

[1] 750

1.1.4 Lists

A matrix can only be used for rectangulararrays. The list is a data structure that is more flexible.

We can create a simple list(L.1 - list(first.name John, last.name Smith, sn 345678, mark A-))

$first.name

[1] John

$last.name

[1] Smith

$sn

[1] 345678

$mark



7/48

Data Mining 2011

[1] A-

We can refer to the components of the list byL.1[2]

$last.name

[1] Smith

L.1[[2]][1] Smith

(Notice that the first form gives the name of the component.)

L.1$last.name

[1] Smith

It is better to refer to the components by name because that means that if the order is changed tolist(last.nameSmith, first.nameJohn, sn345678, markA-) that

we still get the correct values.

Many functions have return values in the form of lists.

We can also build a list by appending components. This is often useful in situations in which we areiteratingL.2 - {}

L.2 - c(L.2, list(1))

L.2 - c(L.2, list(x))

L.2 - c(L.2, list(x^2/2!))

(L.2 - c(L.2, list(x^3/3!)))

[[1]]

[1] 1

[[2]][1] x

[[3]]

[1] x^2/2!

[[4]]

[1] x^3/3!

It is sometimes useful to remove things from the list structure and that can be done byunlist(L.2)

[1] 1 x x^2/2! x^3/3!

1.1.5 paste

In the above, we use the c to concatenate a set of numbers into a vector. If we wish to concatenate

strings (numbers get converted to strings), we use paste as in

paste(John, Smith, 345678)

[1] John Smith 345678

Notice that there is a space between the names. We can change that with(str.1 - paste(John, Smith, 345678, sep,))



8/48

Data Mining 2011

[1] John,Smith,345678

(str.2 - paste(Jane, Jones, 234567, sep,))

[1] Jane,Jones,234567

In other words, sep, controls what separates the quantities that are being pasted together (itdefaults to a space but can be more than a single character). If we do not want the space, we canremove it with the sep . On the other hand we may wish to insert some other character(str.3 - paste(D:, DATA, Data Mining R-code,sep/))

[1] D:/DATA/Data Mining R-code

If we have a vector of strings, paste does nothing unless we tell it to collapse the vector (and whatto put between the elements).(str.4 - paste(unlist(L.2), collapse ))

[1] 1 x x^2/2! x^3/3!

1.1.6 stringsplitThere may be times when we need to unpaste a string. We do this withstrsplit(str.1, ,)

[[1]]


which produces a list of one element, orstrsplit(rbind(str.1, str.2), ,)

[[1]]


[[2]]

[1] Jane Jones 234567

which produces a list element for each row of the matrix.

If we try the same thing onstrsplit(str.4, )

[[1]]

[1] 1 x x^2/2! x^3/3!

nothing happens. The reason is that is a metacharacter in regular expressions - along with \ |

( ) [ { ^$ * ? and we need to change it to an ordinary character with

strsplit(str.4, \ \ )

[[1]]

[1] 1 x x^2/2! x^3/3!

1.1.7 Control Structures

R has the usual control structures.

If we leta - 1



9/48

Data Mining 2011

b - 2

if (a b) print(a b)

[1] a b

if (a b)

print(a b)

else

Error: syntax error in else

print(a

b)tells us that there is a syntax error in else.

The correct form isif (a b) {

print(a b)

} else {

print(a b)

}

[1] a b

Note that the { and } are used as the beginning and ending of blocks of code.for (i i n 1:5) {

print (i)

}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

Instead of1:5 we could have things such as i in c(5, 3, 7, 2, -4, -9).n - 1

f

- 1while (n 5) {

f - f * n

n - n 1

print(f)

}

[1] 1

[1] 2

[1] 6

[1] 24

R also has repeat, break, and next.

1.1.8 Functions

In the preceding, we have used several of the built-in functions of R (paste, strsplit, print).We will often need to write our own functions, so we need to look at the structure of functions.



10/48

Data Mining 2011

Suppose that we have a random set of numbers and wish to find the mean (there is, of course, abuilt-in function for this)

(num - runif(20, -1, 1)) # 20 random numbers from a uniform distribution

[1] -0.439628823 0.233594691 -0.266024740 0.007379845 0.466801666

[6] -0.092201637 0.683620457 -0.003777758 0.040208154 0.276043486

[11] -0.236744290 0.258244825 0.262341596 -0.518254284 -0.584123774

[16] -0.300963114 0.129197590 0.941738119 0.073712909 -0.991614198

We could write the function asmy.mean - function (x) {

len - length(x)

sum(x)/len

}

Note that this says that the name my.mean has the function assigned to it. We can see what the

function is if we enter the name

my.mean

function (x) {len - length(x)

sum(x)/len

}

We call the function in the usual mannermy.mean(num)

[1] -0.003022464

A type of function that we will often need is the recursive function. A common example that is used

in recursion is the factorial function (it is NOT the best way to find the factorial, but it is easy toprogram and to understand.)

If you are not familiar with recursion in programming, the following might help.fact - function (m) {

if (m 1) {

f - fact(m - 1) * m

} else {

return(1)

}

f

}

1.1.9 Debugging

We will use thedebug(fact)

to enable us to trace through the factorial function.

(We can display the values of the variables within the function by typing the name of the variable. Ifwe have a variable called n we need to type print(n) - this is why I use m as the variable. If you



11/48

Data Mining 2011

have seen enough, you can use c to continue without debugging or Q to quit.)

The text in ( ) are comments on the process.fact(5)

debugging in: fact(5) (tells us that we have just entered fact)

debug: {

if (m 1) { (the next block of the function to be evaluated)

f - fact(m - 1) * m}

else {

return(1)

}

f

}

Browse[1] m;f;n (a command to print the value of m & f and take the next

[1] 5 (m)

[1] 0 (f)

debug: if (m 1) { (next block - we stepped past the {)

f - fact(m - 1) * m

} else {

return(1)

}Browse[1] m;f;n

[1] 5

[1] 0

debug: f - fact(m - 1) * m (m 1 so we execute this)

Browse[1] m;f;n

[1] 5

[1] 0

debugging in: fact(m - 1) (we have stepped into fact again - with m 4 - see below

debug: {

if (m 1) {

f - fact(m - 1) * m

}

else {

return(1)

}

f

}

Browse[1] m;f;n

[1] 4 (m 4)

[1] 0

debug: if (m 1) {

f - fact(m - 1) * m

} else {

return(1)

}

Browse[1] m;f;n

[1] 4

[1] 0

debug: f - fact(m - 1) * m (m 1 so we execute this)

Browse[1] m;f;n

[1] 4

[1] 0

debugging in: fact(m - 1) (we have stepped into fact again - with m 3 - see below

debug: {

if (m 1) {

f - fact(m - 1) * m

}

else {

return(1)

}



12/48


13/48

Data Mining 2011

debug: return(1) (this time m 1 so we do not take the fact path)

Browse[1] n

exiting from: fact(m - 1) (this is the first time that we have done this

we return a 1 to the function that called this

and use that value as the multiplier of m 2)

debug: f

Browse[1] m;f;n

[1] 2

[1] 2exiting from: fact(m - 1) (return from fact with the value 2 2 1

and use this as the multiplier of m 3)

debug: f

Browse[1] m;f;n

[1] 3

[1] 6

exiting from: fact(m - 1) (return from fact with the value 6 3 2 1


debug: f

Browse[1] m;f;n

[1] 4

[1] 24

exiting from: fact(m - 1) (return from fact with the value 24 4 3 2 1


debug: f

Browse[1] m;f;n

[1] 5

[1] 120

exiting from: fact(5) (return from fact with the value 120 5 4 3 2 1)

[1] 120

If you have a long complicated function with a small section that you wish to investigate, it is

possible to insert the command browser() into the code. In this case, the use ofcontinue will

allow the code to be executed until you reach the browser() command again.

This gives a brief look at some of the concepts in R. We will look at others as we need them.



14/48

Data Mining 2011

1.2 Data Visualization in R

It is very important to gain a feel for the data that we are investigating.

One way to do this is by visualization.

We will do this by starting with a simple dataset that has some nice features.

Flea Beetles

This data is from a paper by A. A. Lubischew, On the Use of Discriminant Functions in

Taxonomy, Biometrics, Dec 1962, pp.455-477.

There are three species of flea-beetles: C. concinna, Hp. heptapotamica, and Hk. heikertingeri,and 6 measurements on each.

tars1 - width of the first joint of the first tarsus in microns (the sum of measurements for both

tarsi).tars2 - the same for the second joint.

head - the maximal width of the head between the external edges of the eyes in 0.01 mm.

aede1 - the maximal width of the aedeagus in the fore-part in microns.

aede2 - the front angle of the aedeagus ( 1 unit 7.5 degrees).

aede3 - the aedeagus width from the side in microns.

1.2.1 Reading Data

The first thing we have to do is get the data (for this data set we will read in a text file). The

following illustrates how to read the file(s). (Note the use of the UNIX type path separator with /

rather than \.

drive - D:

code.dir - paste(drive, DATA, Data Mining R-Code, sep/)

data.dir - paste(drive, DATA, Data Mining Data, sep/)

# Set the files to be read

d.file - paste(data.dir, fleas, flea.dat, sep/)

[1] D:/DATA/Data Mining Data/fleas/flea.dat

d.col - paste(data.dir, fleas, flea.col, sep/)

[1] D:/DATA/Data Mining Data/fleas/flea.col

We now have paths for two files: d.file points to the data and d.col points to the column

headers (variable names).

The function scan can be used to read in the data. If the data has characters in it, we need to

indicate that with



15/48

Data Mining 2011

(headers - scan(d.col))

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :

scan() expected a real, got tars1

(headers - scan(d.col, ))

Read 6 items

[1] tars1 tars2 head aede1 aede2 aede3

(n.var - length(headers)) # The vector length gives the number of variables

[1] 6

For reading in the data we can do a scan to read the data into a vector, and then convert the vectorto a matrix with n.var columns. Because data files are typically stored by rows, and R does the

conversion to a matrix by column, we need to indicate that with byrowT.

d.flea.s - matrix(scan(d.file), ncoln.var, byrowT)

Read 444 items

d.flea.s[1:5,] # This displays the first 5 rows and all the columns.

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 191 131 53 150 15 104

[2,] 185 134 50 147 13 105

[3,] 200 137 52 144 14 102

[4,] 173 127 50 144 16 97

[5,] 171 118 49 153 13 106

Note that if the file contains a mixture of numbers and text, we need to use the in the scand.flea.str - matrix(scan(d.file, ), ncoln.var, byrowT)

Read 444 items

d.flea.str[1:5,]

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 191 131 53 150 15 104

[2,] 185 134 50 147 13 105

[3,] 200 137 52 144 14 102[4,] 173 127 50 144 16 97

[5,] 171 118 49 153 13 106

We can see that the numbers are read in as strings. In order to do arithmetic on them, they have to beconverted to numbers.

The command as.numeric(...)will convert a string to a numberas.numeric(d.flea.str[1,1])

[1] 191

but when it is applied to an array it makes the array into a vector.

To correct this we could tryd.flea.s - matrix(as.numeric(d.flea.str), ncoln.var)

d.flea.s[1:5,]

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 191 131 53 150 15 104

[2,] 185 134 50 147 13 105

[3,] 200 137 52 144 14 102

[4,] 173 127 50 144 16 97

[5,] 171 118 49 153 13 106



16/48

Data Mining 2011

which appears to correct the problem.

A better way in many cases is to read the data as a tabled.flea - read.table(d.file)

d.flea[1:5,]

V1 V2 V3 V4 V5 V6

1 191 131 53 150 15 104

2 185 134 50 147 13 105

3 200 137 52 144 14 102

4 1 73 1 27 5 0 144 1 6 97

5 171 118 49 153 13 106

Note the different form for the row and column headers.

The latter form is better in that it gives names to the rows and columns - not just positions.

Consider taking a sub-matrixd.flea.s[c(5,10,15,20), c(2,4)]

[,1] [,2][1,] 118 153

[2,] 115 142

[3,] 130 147

[4,] 121 147

We have no way of identifying from where the entries came. On the other handd.flea[c(5,10,15,20), c(2,4)]

V2 V4

5 118 153

10 115 142

15 130 147

20 121 147

retains the information as to the rows and columns.

It is possible to improve on the first case by assigning row and column name informationdimnames(d.flea.s) - list(1:dim(d.flea.s)[1], headers)

To see what is happening in this consider

dim(d.flea.s)

[1] 74 6

This gives the dimension of the matrix so 1:dim(d.flea.s)[1] creates a vector of integers

from 1 to the number of rows in the matrix. dimnames assigns the values in the list as row

and column labels.d.flea.s[c(5,10,15,20),]

tars2 aede1

5 118 153

10 115 142

15 130 147

20 121 147

This is much more useful and is similar to the information displayed by the data frame version. In

fact it is more informative because it uses the true header information. To improve the data frame wecan replace the generic column headers by the correct values (the row headers are good) by



17/48

Data Mining 2011

colnames(d.flea) - headers

d.flea[c(5,10,15,20), c(2,4)]

tars2 aede1

5 118 153

10 115 142

15 130 147

20 121 147

Now that we have all the data, we can use further information to specify the species for the cases.For some purposes we may find it best to have characters to represent the species while for others,numerical values may be best. We will create both

flea.species - c(rep(C,21),rep(Hp,22),rep(Hk,31))

species - c(rep(1,21),rep(2,22),rep(3,31))

Here we have used information that was not contained in the data to set the species. This

information is found in(d.row - paste(d.data.dir, d.basename, .row, sep ))

(row.headers - noquote(scan(d.row, )))

Read 74 items[1] Concinna Concinna Concinna Concinna Concinna Concinna Concinna

[8] Concinna Concinna Concinna Concinna Concinna Concinna Concinna

[15] Concinna Concinna Concinna Concinna Concinna Concinna Concinna

[22] Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot. Heptapot.



[43] Heptapot. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.

[50] Heikert. Heikert. Heikert. Heikert. Heikert. Heikert. Heikert.



[71] Heikert. Heikert. Heikert. Heikert.

A further refinement is to bind things together in what is called a data frame. As it happens the

table version is a data frame.is.data.frame(d.flea)

[1] TRUE

Many functions require the use of a data frame. (It might be best to also bind in the species

information, but which one depends on what we are doing.)df.flea - data.frame(d.flea.s)

We can read in some functions that we will use.source(paste(d.code.dir, DispStr.r, sep ))

source(paste(d.code.dir, pairs_ext.r, sep

))source(paste(d.code.dir, MakeStereo.r, sep ))

source reads code from a file just as though it had been typed or pasted into R. Now we can lookat the data as something other than just numbers - i.e. visualization of the data.

1.2.2 Scatterplot Matrices



18/48


19/48

Data Mining 2011

tars1

110 140 120 150 60 100

120

220

110

140

tars2

head45

55

120

aede1

aede28

14

120 200

60

110

45 55 8 12 16

aede3

tars1

110 140

0 . 0 2 6 0 . 0 9 6

120 150

0.33 0.78

60 100

120

220

0.57

110

140

tars20.67 0.56 0.12 0.49

head 0.59 0.3145

55

0.52

120

aede10.25 0.78

aede2

8

14

0.48

120 200

60

110

45 55 8 12 16

aede3

Figure 2 The first command gives the standardpairs plot. It shows duplicate plots.

Figure 3. The second command also shows

histograms and correlations (the size of the

number relates to the degree of correlation)

It should be noted that high correlations (visual or numeric) indicate some linear relationship

between pairs of variables but low correlations tell us nothing - the data may be related in anonlinear fashion or related in combination with other variables.

At this point we might consider how we can see what is going on within these functions. We willconsider two closely related methods - debug(fun) and browser().

Considerdebug(panel.cor)

pairs(d.flea, upper.panelpanel.cor, diag.panelpanel.hist)

debugging in: lower.panel(as.vector(x[, j]), as.vector(x[, i]), ...)

debug: {

usr - par(usr)

on.exit(par(usr))

par(usr c(0, 1, 0, 1))

r - abs(cor(x, y))

txt - format(c(r, 0.123456789), digits digits)[1]

txt - paste(prefix, txt, sep )if (missing(cex.cor))

cex.cor - 0.8/strwidth(txt) * r

text(0.5, 0.5, txt, cex cex.cor)

}

(On start-up, the function that is being debugged is displayed. We can step through by typing n ornext which executes the current command and displays the next one.)

Browse[1] ndebug: usr - par(usr)



20/48

Data Mining 2011

Browse[1] ndebug: on.exit(par(usr))

Browse[1] ndebug: par(usr c(0, 1, 0, 1))

Browse[1] ndebug: r - abs(cor(x, y))

Browse[1] ndebug: txt - format(c(r, 0.123456789), digits digits)[1]

Browse[1] x[1] 131 134 137 127 118 118 134 129 131 115 143 131 130 133 130 131 127 126 140

[20] 121 136 141 119 130 113 121 115 127 123 119 120 131 127 116 123 135 132 131

[39] 116 121 146 119 127 107 122 114 131 108 118 122 127 125 124 129 126 122 116

[58] 123 122 123 109 124 114 120 114 119 111 112 130 120 119 114 110 124

Browse[1] y[1] 191 185 200 173 171 160 188 186 174 163 190 174 201 190 182 184 177 178 210

[20] 182 186 158 146 151 122 138 132 131 135 125 130 130 138 130 143 154 147 141

[39] 131 144 137 143 135 186 211 201 242 184 211 217 223 208 199 211 218 203 192

[58] 195 211 187 192 223 188 216 185 178 187 187 201 187 210 196 195 187

Browse[1] r[1] 0.02634653

Browse[1] n

debug: txt - paste(prefix, txt, sep )Browse[1] ndebug: if (missing(cex.cor)) cex.cor - 0.8/strwidth(txt) * r

Browse[1] txt[1] 0.026

Browse[1] Q

To stop the debugging, we can type Q.

The next time we call a function that has been set for debugging, it will again be debugged. To turn

the debugging off, we typeundebug(panel.cor)

If we wish to debug a function that we have written, we can put a browser() statement inside thefunction body.

The advantage of this is that when we have several spots at which we want to look at the behaviour,we can type c to continue the execution until we hit the next browser() statement - quite useful

with loops.

In this case we know the species corresponding to each case so it is instructive to look at therelationship among the variables and species by the use of colour col. The colour can be given by anumber or by a name such as red.

pairs(d.flea, col species 1)



21/48

Data Mining 2011

tars1

110 140 120 150 60 100

120

220

110

140

tars2

head 45

55

120

aede1

aede28

14

120 200

60

110

45 55 8 12 16

aede3

Figure 4.

We notice that in several plots the species are not mixed together e.g. tars1 vs. aede2, tars1

vs. aede1 etc.

1.2.3 Conditional Plots

To investigate some of the more complicated relationships we can look at conditional plotting.

There is more than one version of this. The first one is part of the standard package. Note thearguments. The aede3 ~tars1 | aede1 says that we are plotting aede3 against tars1

conditioned against aede1. That is, we will get several plots corresponding to different ranges of

aede1. Note that ~ is frequently used to indicate a formula. The data

df.flea allows us touse the variable names in the formula because those names are part of the data frame. (Theoverlap0.1 will be explained later.)

coplot(aede3 ~tars1 | aede1, data df.flea)

coplot(aede3 ~tars1 | aede1, data df.flea, overlap 0.1)



22/48

Data Mining 2011

60

80

110

120 160 200 240

120 160 200 240 120 160 200 240

60

80

110

tars1

aede3

120 130 140 150

Given : aede1

60

80

110

120 160 200 240

120 160 200 240 120 160 200 240

60

80

110

tars1

aede3

120 130 140 150

Given : aede1

Figure 5. Figure 6.

In both the above figures, the lower six panels show the pairwise plots for aede3 against tars1

for different ranges ofaede1 as shown in the upper panel. The defaults for this function are toselect 6 different subsets of the third variable with an equal number of cases in each. In addition anoverlap of0.5 is allowed. The second example has reduced the overlap to 0.1. We get a differentview if we colour our points by species.

coplot(aede3 ~tars1 | aede1, data df.flea, overlap 0.1, col species 1, pch

16)

60

80

110

120 160 200 240

120 160 200 240 120 160 200 240

60

80

110

tars1

aede3

120 130 140 150

Given : aede1

Figure 7.

This further illustrates the relationships noted earlier.

Another version of this is found in the lattice package, but before using this we might consider afunction that will allow us to condition on fixed interval lengths rather than fixed count.



23/48

Data Mining 2011

library(lattice)

equal.space - function(data, count) {

# range(data) gives the max and min of the variable data.

# diff takes the difference between the two values so

# diffs gives the width of each interval.

diffs - diff(range(data))/count

# min(data)diffs*(0:(count-1)) gives the starting values

# for the intervals.

# min(data)diffs*(1:count) gives the ending values

# for the intervals.

# cbind treats two(or more) vectors as column vectors

# and binds them as columns of a matrix.

intervals - cbind(min(data)diffs*(0:(count-1)),

min(data)diffs*(1:count))

# shingle takes the interval structure and the data

# and breaks the data into the appropriate groups.

return (shingle(data, intervals))

}

The following uses the conditional plotting from the lattice package with

a) equal cases in each grouping and

b) equal spacing in each grouping.

C1 - equal.count(df.flea$aede1, number 6, overlap 0.1)

xyplot(aede3 ~tars1 | C1, data df.flea, pch 19)

C2 - equal.space(df.flea$aede1, 6)

xyplot(aede3 ~tars1 | C2, data df.flea, pch 19)

tars1

aede3

60

80

100

120

120 160 200 240

C1 C1

120 160 200 240

C1

C1

120 160 200 240

C1

60

80

100

120

C1

tars1

aede3

60

80

100

120

120 160 200 240

C2 C2

120 160 200 240

C2

C2

120 160 200 240

C2

60

80

100

120

C2

Figure 8. Equal cases in each grouping Figure 9.Equal spacing in eachgrouping

This version does not show the values of the conditioning variable.



24/48

Data Mining 2011

It is also possible to condition against two variables, but before doing that we will create asynthetic data set. For now we will not go into detail about the nature of the data.

source(paste(d.R - code.dir, ellipseOutline .r, sep))

ec.t1 -

for(

t in-

20:

20)

ec.t1 - rbind(ec.t1, cbind(ellipse.outline(20,20,10,5,t,0,(200-t^2)/10),t))

}

ec.t1 - data.frame(ec.t1[sample(dim(ec.t1)[1], dim(ec.t1)[1]),])

We can plot the scatterplot matrixpairs(ec.t1, upper.panel panel.cor, diag.panel panel.hist)

x

-10 0 5

0.00 1 . 4 e - 2 1

-20 0 20

-40

0

40

0.76

-10

0

5y

7 . 8 e - 2 2 0.00

z

-20

0

20

3 . 6 e - 2 1

-40 0 40

-20

0

20

-20 0 20

t

Figure 10.

In the lines that follow, note the use of the $ symbol. In this case it is used to reference the columnsof a data frame; in other cases it references parts of other objects.

X - equal.space(ec.t1$x, 25)

Y - equal.space(ec.t1$y, 25)Z - equal.space(ec.t1$z, 25)

T - equal.space(ec.t1$t, 25)

In the following, note the use of aspect. R, like many other languages, tries to use as much of the

plotting region as possible. While this works well if the data has no intrinsic shape, it is a severeproblem in other situations. For example, if you try to plot an ellipse you will get a circle. To avoidthis you need to force the plot routine to use equal scales along the axes. This is often done by use of



25/48

Data Mining 2011

the aspect ratio (as below) but different plot routines use different methods ( and for some it is up toyou to find a way to force the appropriate scaling). Note the use of x11() as a method of creatinganother plot window rather than plotting over the current one.xyplot(z ~x | Y, data ec.t1, pch., main z ~x | Y,

aspect diff(range(ec.t1$z))/diff(range(ec.t1$x)))

x11()

xyplot(y ~x | Z, data ec.t1, pch., main y ~x | Z,

aspect diff(range(ec.t1$y))/diff(range(ec.t1$x)))

x11()

xyplot(z ~y | X, data ec.t1, pch., main z ~y | X,

aspect diff(range(ec.t1$z))/diff(range(ec.t1$y)))

x11()

xyplot(z ~x | T, data ec.t1, pch., main z ~x | T,


z ~ x | Y

x

z

-20

0

20

-40 0 2040

Y Y

-40 0 20 40

Y Y

-40 0 2040

Y

Y Y Y Y

-20

0

20

Y-20

0

20Y Y Y Y Y

Y Y Y Y

-20

0

20

Y-20

0

20Y

-40 0 2040

Y Y

-40 0 2040

Y Y

y ~ x | Z

x

y

-105

-40 0 2040

Z Z

-40 0 2040

Z

Z Z-15Z

-105

Z Z Z

Z Z

-15

Z-10

5Z Z Z

Z Z

-15

Z-10

5Z Z Z

Z Z

-15

Z-10

5Z

Figure 11. Figure 12.



26/48

Data Mining 2011

z ~ y | X

y

z

-20

-10

0

10

20

-10 05

X X

-10 05

X X

-10 05

X X

-10 05

X X

-10 05

X

X X X X X X X X

-20

-10

0

1020

X-20

-10

0

10

20

X

-10 05

X X

-10 05

X X

-10 05

X X

z ~ x | T

x

z

-20

0

20

-40 0 2040

T T

-40 0 20 40

T T

-40 0 20

T

T T T T T-20

0

20T T T T T

T T T T T-20

0

20

T

-40 0 2040

T T

-40 0 2040

T T


Z5 - equal.space(ec.t1$z, 5)

T5 - equal.space(ec.t1$t, 5)

xyplot(z ~x | T5*Z5, data ec.t1, main z ~x | T5*Z5, pch.,


z ~ x | T 5*Z5

x

z

-200

20

-40 0 20

T5Z5

T5Z5

-4 0 0 2 0

T5Z5

T5Z5

-40 0 20

T5Z5

T5

Z5

T5

Z5

T5

Z5

T5

Z5

-20

020

T5

Z5-20

020

T5Z5

T5Z5

T5Z5

T5Z5

T5Z5

T5Z5

T5Z5

T5Z5

T5Z5

-20020

T5Z5

-200

20T5Z5

-40 0 20

T5Z5

T5Z5

-4 0 0 2 0

T5Z5

T5Z5

Figure15.

r - 1



27/48

Data Mining 2011

c - 1

for (i in -20:15) { # Loop through i from -20 to 15

ind - ec.t1$ti # Get the cases for which the t value i

X - ec.t1$x[ind] # And the corresponding x,y,z values

Y - ec.t1$y[ind]

Z - ec.t1$z[ind]

# In the following - ( ?cloud)

# print - displays the

# cloud - a function that creates a cloud of points,

# with xlim, ylim, zlim (the range of values on the axes)

# set to the maximum range (x) to give proper scaling.

# subpanel - the function use to plot the points.

# groups - allows classes to be identified.

# screen - sets the viewpoint.

# split - c(col, row, cols, rows)

# more -

print(cloud(Z ~X*Y, xlim range(ec.t1$x),

ylim range(ec.t1$x),zlim range(ec.t1$x),

subpanel panel.superpose, groupsrep(1, dim(ec.t1)[1]),

screen list(z 10, x -80, y 0), data ec.t1),

split c(c, r, 6, 6), more TRUE)

c - c1

if (c%%6 1) { # Remainder mod 6

c - 1

r - r1

}

}

Y

Z

Y

Z

Y

Z

Y Y

Z

Y

Z

Y

Z

Y

Z

Y

Z

Y Y

Z

Y

Z

Y

Z

Y

Z

Y

Z

Y Y

Z

Y

Z

Y

Z

Y

Z

Y

Z

Y Y

Z

Y

Z

Y

Z

Y

Z

Y

Z

Y Y

Z

Y

Z

Y

Z

Y

Z

Y

Z

Y Y

Z

Y

Z

Figure 16.



28/48

Data Mining 2011

1.3 Data Visualization in Ggobi

For high dimensional data, dynamic graphics will reveal more relationships.

For that purpose we will use Ggobi. This is a package that may be called from R or used alone.

library(rggobi)g - ggobi(d.flea)

Figure 17. GGobi console Figure 18. Scatterplot

1.3.1 Scatterplot Matrix

Ggobi starts with a console and a simple scatterplot as shown although we can also have ascatterplot matrix display.display(g[1], Scatterplot Matrix)

or [Display][New scatterplot matrix], from the Ggobi console.



29/48

Data Mining 2011

Figure 19. Scatterplot Matrix

1.3.2 Grand Tour

In order to investigate the data, we will start with a grand tourdisplay(g[1], 2D Tour)

or [View][2D Tour]

Figure 20. Figure 21. 2D Tour

This shows the console and the opening display. Note the circle with the lines. This represents the

projection of the six axes on the two dimensional display. The process used in the grand tour is thata projection direction is selected and then a new direction is selected and the projection is changedsmoothly in that direction. This allows the user to see the data from all directions (although it is



30/48

Data Mining 2011

possible to move the projected direction by use of the mouse).

This gives a 2D tour of the 6 dimensional data. The portion of each variable in the view is shown bythe representation of the axis in the bottom corner (and on the console).

1.3.3 BrushingAs the tour runs, 3 clusters will appear. When they do, you can click [Pause] and apply brushing, -

[Interaction][Brush].- to group cases.

Figure 22. Brushing

You can change the colour and glyph (symbol) of data points.

The process involves selecting the colour and glyph and moving the brush over the points (we can

select [Persistent] - if not selected, the brushing is transient).

We can or [Choose color & glyph] as shown below.



31/48

Data Mining 2011

Figure 23. Select a red (3rd smallest size)

Figure 24. One cluster brushed

- red box for point brushing

Figure 25. Other clusters brushed

In Figure 24, one apparent cluster is set to a red plus, while in Figure 25 another cluster is a yellowcross and the third is a green circle.

We can now return to a grand tour and see if the points of the same colour move together. Notice

that for much of the time, the projection of the clusters are mixed together.

If we feel that we have the clusters properly coloured, we can again [ Pause] and from R find out



32/48

Data Mining 2011

what colours the points are(old.col - glyph_colour(g[1]))

F F F F F F F F F F C F F F F F F F F F F C C C C C C C C C C C C C C C C C C C C C C A

9 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 3

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

It turns out that we know the species of the flea beetles so we can compare the clustering that we

observed with the true classification.(noquote(rbind(flea.species,old.col)))

F F F F F F F F F F C F F F F F F F F F F C C C C C C C C C C C

flea.s p e c i e s C C C C C C C C C C C C C C C C C C C C C H p H p H p H p H p H p H p H p H p H p H p

old.col 9 9 9 9 9 9 9 9 9 9 5 9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5 5

C C C C C C C C C C C A A A A A A A A A A A A A A

flea.species Hp Hp Hp Hp Hp Hp Hp Hp Hp Hp Hp Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk

old.col 5 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3

A A A A A A A A A A A A A A A A A

flea.species Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk Hk

old.col 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

(notice that the 11th case is of class C but is the same colour as the Hk but all the others have the

colour corresponding to a class of flea beetle) and then reset the colours and glyphs for the next part.

glyph_colour(g[1]) - rep(2, 74) # 74 purple

glyph_type(g[1]) - rep(4, 74) # 74 circles

1.3.4 Parallel Coordinates Plot

Another type of plot is the Parallel Coordinates Plot. This can be a good method for investigatinghigh dimensional data. Consider the point (3, 1, 2) as shown in Figure 25 below. In parallelcoordinates, it would be

Figure 26. The point (3, 1, 2) Figure 27. The point (3, 1, 2) in paralle

If we want 4 (or more) dimensions, we need only add another (or more) parallel line(s).display(g[1], Parallel Coordinates Display)



33/48

Data Mining 2011

or [Display][New parallel coordinates plot].

Figure 28. Parallel Coordinates Plot

Each data point has a value on each of the axes which are plotted vertically rather than at right

angles to each other.

1.3.4.1 Parallel Coordinates Brushing

The value of brushing is greatly enhanced when we have more than one display. Below we see ascatterplot with 5 points being brushed and we see that the parallel coordinates display has 5 points(lines) coloured yellow. This shows which points correspond.

Figure 29. Brushing

1.3.4.2 Parallel Coordinates Linked Brushing



34/48

Data Mining 2011

Figure 30. Linked brushing

For the next part, we set the colours to correspond to the speciesglyph_colour(g[1]) - c(rep(6,21),rep(4,22),rep(9,31))

It is also possible to get information about the clustering from parallel coordinates. We will start bymoving the axes - put the mouse on the white frame and drag aede3 to the first position (a cornerwill appear as the cursor).

Figure31.

Repeat until we have the axes arranged as below.



35/48

Data Mining 2011

Figure 32.

When we look at this we see that there seem to be values of aede3 and tars1 that split the data.

1.3.4.1 Parallel Coordinates Identification

Before we find the values, we will introduce [Tools][DataViewer] which allows us to look at our

data.

Figure 33.

Next we will use [Interaction][Identify]



36/48

Data Mining 2011

Figure 34. Linked identification

Figure 35. Linked identification



37/48

Data Mining 2011

Figure 36. Identification using record label Figure 37. Linked identification



38/48

Data Mining 2011

Figure 38. Identification using data value (tars1)

Figure 39. Linked identification using data value (tars1)

In the above we see that if tars1 160 we have one group (blue) split from the other two



39/48

Data Mining 2011

Figure 40. Linked identification using data value (aede3)

In the above we see that if aede3 95 we have one group (yellow) almost split from the other two.

It appears that we can do an almost perfect split with this information. (We will see this forms thebasis for a recursive splitting process that we will see later.)

We can do similar things from R.

cols - rep(6, 74)

cols[which(d.flea[,6] 95)] - 9

cols[which(d.flea[,1] 160)] - 4

glyph_colour(g[1]) - cols

1.3.5 Stereo

An interesting view of the data can be obtained by looking at the data from slightly shiftedviewpoints -

make.Stereo(d.flea[,c(1,5,6)], species, Main Flea beetles , asp F ,

Xlab tars1 , Ylab aede2 , Zlab aede3 )



40/48

Data Mining 2011

Figure 41. Stereo projectionmake.Stereo(d.flea[,c(6, 5, 1)], species, Main Flea beetles , asp F ,

Zlab tars1 , Ylab aede2 , Xlab aede3 )

Figure 42. Stereo projection

1.3.6 RGL

A relatively new package islibrary(rgl)

which allows interactive visualization.

Considerplot3d(d.flea[,1], d.flea[,5], d.flea[,6], xlabtars1,ylabaede2,zlabaede3,

colspecies1, size0.5, types)



41/48

Data Mining 2011

Now, apply the following codefor (j in seq(0, 90, 10)) {

for(i i n 0:360) {

rgl.viewpoint(i, j);

}

}

The rgl.viewpoint(i, j) changes the point from which you view the object, making it

appear that the object is rotating. The first argument (i in this case) is the spherical coordinatesangle , while the second is . Hence, this code rotates the object about the z-axis (i in

0:360) for 0 to /2.

It is also possible to

hold the left mouse button down to rotate the image the image;

hold the right mouse button down (or use the mouse wheel) to zoom in/out.

1.4 Examples

1.4.1 Randu



42/48

Data Mining 2011

Randu is a random number generator that had a slight flaw.d.file - paste(data.dir, randu.dat, sep /)

d.randu - read.table(d.file)

pairs(d.randu, upper.panelpanel.cor, diag.panelpanel.hist)

Figure 43. Scatterplot matrix for randu

Looking at the scatterplotmatrix everything seems random but in Ggobi...g - ggobi(d.randu)

with [View][2D tour]

Figure 44. randu projection Figure 45. An interesting projection

1.4.2 Prim7

Prim7 contains 500 observations taken from a high energy particle physics scattering experimentwhich yields four particles. The reaction can be described completely by seven (7) independent



43/48

Data Mining 2011

measurements. The important features of the data are short-lived intermediate reaction stages whichappear as protuberant arms in the point cloud.

prim - read.table(paste(data.dir, prim7.dat,sep/))

g - ggobi(prim)

Figure 46. Prim7

We will look at this by looking at clusters that have been found in the past. It would be possible to

do this by careful brushing and observation but the following sets the appropriate colours for thedata.new.col - rep(1, 500)

col.2 - c(2,3,4,14,15,16,17,18,21,23,30,34,37,41,43,46,49,50,53,54,55,

57,58,63,65,66,69,70,72,73,74,75,77,78,79,85,86,88,90,91,92,

93,94,95,99,100,102,104,105,106,107,109,110,113,114,116,120,

121,124,125,126,127,129,130,133,139,140,141,143,145,147,150,

152,153,157,158,159,160,161,164,166,169,172,175,176,177,178,

180,185,194,195,198,200,203,204,209,210,211,212,218,219,220,

222,223,226,228,229,233,234,236,238,240,242,244,245,246,248,

249,252,253,257,259,263,264,265,266,267,269,270,273,277,278,

280,281,282,283,284,286,292,294,296,297,300,305,310,311,314,

315,317,323,331,332,333,334,335,341,342,343,346,351,356,359,

360,361,362,365,370,372,374,375,377,378,379,380,383,386,388,

389,390,391,393,397,398,400,402,403,405,407,408,413,414,415,

417,418,419,420,425,427,428,429,430,432,433,434,436,437,438,

440,444,445,447,448,452,453,454,455,456,463,465,467,470,471,

473,476,477,478,480,481,482,484,485,487,488,489,490,491,494,497)

col.3 - c(11,20,27,33,47,51,60,61,62,98,115,118,119,132,155,186,191,

193,202,205,207,208,213,225,230,231,232,235,239,243,250,251,

268,272,295,312,316,338,339,345,349,354,358,364,366,376,381,

395,401,421,422,446,460,496)

col.5 - c(5,8,13,19,26,32,39,48,56,71,81,96,111,136,137,144,149,156,

162,165,188,199,201,216,255,262,274,279,289,291,301,320,322,

326,327,329,344,348,353,363,367,369,384,399,404,406,411,423,

441,442,443,469,474,479,483,495,499,500)



44/48

Data Mining 2011

col.8 - c(7,29,31,36,89,101,117,131,138,154,173,187,190,192,196,197,

206,247,254,256,258,287,290,298,299,309,324,325,385,387,464)

col.9 - c(1,12,22,24,25,44,45,52,64,83,103,108,122,123,134,135,146,151,

167,168,170,174,179,181,184,221,224,237,261,271,285,293,304,

306,307,308,319,328,337,352,355,357,368,396,410,424,426,435,

439,449,451,458,461,462,466,472,475,493)

Now set all of the points belonging to one cluster to the colour with value 2.new.col[col.2] - 2

glyph_colour(g[1]) - new.col

Figure 47. Prim7 with first cluster brushed Figure 48. Prim7 with 2 clusters brushedNow run though the other clusters.new.col[col.3] - 3


new.col[col.5] - 5




45/48

Data Mining 2011

Figure 49. Prim7 with 3 clusters brushed Figure 50. Prim7 with 4 clusters brushednew.col[col.8] - 8


new.col[col.9] - 9


Figure 51. Prim7 with all clusters brushed Figure 52.

It is also possible to put lines on the data to outline possible structures. The lines are thosedetermined by researchers

prim.lin - read.table(paste(data.dir, prim7.lines,sep/))

edges(g[1]) - prim.lin

We can add lines by use of [Interaction][Edit Edges] and dragging the cursor from one point toanother.



46/48

Data Mining 2011

Figure 53. Line added

With careful exploration in a grand tour, and using projection pursuit, it is possible to discover

structure in the data.

1.4.3 6-Dimensional Cube

The following shows how we might investigate the nature of the projection of a 6-dimensional cube.g - ggobi(paste(data.dir, cube6.xml,sep/))

When Ggobi starts up, it will show 4 points in xy Plot mode. On the Scatterplot window select

[Edges] and [Attach edge set ...] to show the edges.


Now on the console select [View][2D tour] which will show the projection. To inverstigate, we turn

off the last 3 dimensions (D4, D5, D6) by clicking the selection box for each



47/48

Data Mining 2011


Next, after pausing, select [Interaction][Brush], set [Point brushing] to Off, [Edge brushing] to

Color only and select [Persistent]. Next move the brush to colour all the lines, return to the

[Interaction][2D tour], and deselect the [Pause].

Figure 58. Figure 59. Yellow for edge brushing

We now add in the D4, D5, D6 dimensions one at a time, each time brushing the new edges that

arise (in a different colour obtained by selecting [Choose color & glyph]).



48/48

Data Mining 2011

Figure 60. Showing 5 of the 8 4thdimension

lines brushed

Figure 61. Showing 10 of the 16 5 thdimension

lines brushed

Figure 62. All 6 dimensions projected Figure 63. All 6 dimensions projected

Documents

R Guide - S. Mills