57
Melt Cast The reshape package for restructuring datasets Alexander Zwart CSIRO CMIS 10 June, 2009

The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

Melt Cast

The reshape package for restructuring datasets

Alexander ZwartCSIRO CMIS10 June, 2009

Page 2: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Content

• What is the reshape package?• Some relevant R basics• Restructuring data – examples• The reshape package paradigm• The melt() function• The cast() function• Aggregating with cast()• More stuff – Tips and Caveats• Postscript – Handling NA’s

Page 3: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

What is the reshape package?

• It is not the reshape() function in base R!

• …though it has a similar purpose:• Rearrange or restructure a dataset!

• Also, summarise, tabulate, “aggregate” a dataset

• Based on a simple paradigm – easier to use than the reshape() function…

• Written by Hadley Wickham (plyr, ggplots, rggobi…)

• Mainly comprises functions melt() and cast(), but other utilities available as well

• A companion package to the newer “plyr” (not discussed here)

Page 4: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Some quick R Basics

• Basic types of information R can handle (data modes):• “numeric”

• 1.2, 3.14159, 1000.2

• “integer”• …, -3, -2, -1, 0, 1, 2, 3, …

• “logical”• TRUE, FALSE (or T, F)

• “character”• “a”, “A”, “My dog has fleas”

• “complex”• (never mind…☺)

> aa <- 3.0

> mode(aa)

[1] "numeric"

Page 5: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Quick R basics (Continued)

• Some basic R structures (data classes)• Vector – a 1D array, which can only store elements of the same

mode

• The class of a simple vector is given as its mode:> vv <- c(1.0, 2.0, 5.0)

> vv

[1] 1 2 5

> class(vv)

[1] "numeric “> length(vv)

[1] 3

• “Scalar” variables are simply vectors of length 1:> aa <- 3.55

> class(aa)

[1] "numeric"

> length(aa)

[1] 1

Page 6: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Quick R basics (Continued)

• Some basic R structures (classes) continued…

• Factor – a special type (class) of vector that represents a grouping – a categorical variable.

> ff <- factor(c("Med","Med","Hi","Lo","Lo","Lo"))

> ff

[1] Med Med Hi Lo Lo Lo

Levels: Hi Lo Med

> class(ff)

[1] "factor"

Page 7: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Quick R basics (Continued)

• Some basic R structures (classes) continued…• Data frame – a “dataset” – a set of vectors of the same length but possibly

different classes. • Can be regarded as a set of columns of the same length – each column

represents a named variable, and each row a case or subject:

> data(iris) ##Famous dataset, comes with R

> head(iris) ##Print just the first few rows…

Sepal.Length Sepal.Width Petal.Length Petal.Width Speci es

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

> sapply(iris,class) ##Show the class of each column in iris

Sepal.Length Sepal.Width Petal.Length Petal.Width Speci es

"numeric" "numeric" "numeric" "numeric" "factor"

Page 8: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Quick R basics (Continued)

• Data frames, continued…• Dataframes can also be read into R from a data file (scan(),

read.table(), read.csv(), etc etc)

• Or created using R code ( data.frame(), as.data.frame() )

• The melt and cast examples I will show you, create new data frames from existing data frames, but conversions from/to other forms (lists, matrices) are possible.

• See also plyr…

Page 9: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Time series example – Want to get from this :

• (A plant experiment:• 2 levels of watering (Low, High)

• 3 soil types (A, B, C)

• A response (plant height in cm?) measured at three times for each plant.

…)

Page 10: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

…to this :

Page 11: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Or, vice versa – from this :

Page 12: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

…to this!

Page 13: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

(…or maybe, to some other form…)

Page 14: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm:

• To the reshape package, a dataset consists of:(1) Columns containing “measurement” (numeric) variables – the

measurements are the things we want to restructure/reshape;

…or…

Page 15: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm (continued):

• To reshape, a dataset also contains:(2) “ID” columns that index the measurements:

Page 16: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm (continued):

• Also, if a dataset contains more than one measurement variable, then:

• The names of the measurement variables also index the measurements:

Page 17: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm (continued):

• Note that in both of these forms of the dataset…

• …the shaded cells together uniquely identify (or index) each measurement.

Page 18: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm (continued):

• In the stacked form:

• …all three ID columns index the rows of the measurement layout

Page 19: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm (continued):

• In the unstacked form:

• …Water and Soil index the rows of the measurement layout, while the Times index the columns.

• Although “Time” as an ID variable does not appear explicitly in this form of the dataset, we can nonetheless imagine that a variable “Time” (with levels “Time1”, “Time2”, “Time3”) is indexing the columns of numbers…

Page 20: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm (continued):

• Specifying which ID variables (Water, Soil, Time) are to index the rows, and which are to index the columns, allows a variety of different re-arrangements of the data.

• How does reshape know which columns are to be regarded as measurement variables, and which are ID variables?

• Default behaviour: columns of class “numeric” or “integer” are assumed to be measurement variables;

• …columns of class “logical” appear to be regarded as measurement variables, and converted to 0 (FALSE), 1 (TRUE);

• … columns of class factor and character are ID variables.

• Complex numbers? Uncertain.…

Page 21: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package paradigm (continued):

• melt() arguments id.var= or measure.var= can be used to explicitly specify either set; the other set is then assumed to be everything else…

• …so, we can override melt()’s default choices if needed.

• But: Caution is needed if your measurement variable set includes non-numeric columns (character or factor columns)

Page 22: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Reshape package – melt() function

• The function melt() in the reshape package, takes a data frame, and converts it to a stacked form:

• All measurement variables are stacked into a single column (melt() names this column “value”)

• Original ID columns are duplicated and stacked as needed

• If needed, a new column (called “variable” by default) is created to store the names of the original measurement variables.

• Sometimes the melted form is exactly what you want.

• More generally, the melted form is an intermediate step –function cast() takes a previously melted dataset, and restructures it into the shape you want (essentially, a tabulation step)

Page 23: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

melt() - example

> my.dfr <- data.frame(

+ Water = factor(c("High","Low","High","Low","Hig h","Low")),

+ Soil = factor(c("A", "A", "B", "B", "C", "C")) ,

+ Time1 = c(1.80, 1.13, 2.10, 1.64, 1.18, 1.64),

+ Time2 = c(2.35, 2.49, 2.99, 3.24, 2.15, 2.81),

+ Time3 = c(3.79, 3.74, 3.54, 3.72, 3.25, 3.12))

> my.dfr

Water Soil Time1 Time2 Time3

1 High A 1.80 2.35 3.79

2 Low A 1.13 2.49 3.74

3 High B 2.10 2.99 3.54

4 Low B 1.64 3.24 3.72

5 High C 1.18 2.15 3.25

6 Low C 1.64 2.81 3.12

> sapply(my.dfr,class) ##Print the class of each column

Water Soil Time1 Time2 Time3

"factor" "factor" "numeric" "numeric" "numeric"

Page 24: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

melt() – example (continued)

> require(reshape) ## require() is similar to library()Loading required package: reshapeLoading required package: plyr> mm <- melt(my.dfr)Using Water, Soil as id variables> mm

Water Soil variable value1 High A Time1 1.802 Low A Time1 1.133 High B Time1 2.104 Low B Time1 1.645 High C Time1 1.186 Low C Time1 1.647 High A Time2 2.358 Low A Time2 2.49

.

.

.17 High C Time3 3.2518 Low C Time3 3.12> sapply(mm,class) ##Print the class of each column

Water Soil variable value "factor" "factor" "factor" "numeric"

Page 25: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

(The entire melted dataset…)

> mm ##Just to show what the entire melted dataset looks like…

Water Soil variable value

1 High A Time1 1.80

2 Low A Time1 1.13

3 High B Time1 2.10

4 Low B Time1 1.64

5 High C Time1 1.18

6 Low C Time1 1.64

7 High A Time2 2.35

8 Low A Time2 2.49

9 High B Time2 2.99

10 Low B Time2 3.24

11 High C Time2 2.15

12 Low C Time2 2.81

13 High A Time3 3.79

Page 26: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

melt() – example (continued)

• Alternatively,

mm <- melt(my.dfr, id.var = c(”Water”,”Soil”))

…or…mm <- melt(my.dfr, id.var = 1:2) ##Column numbers

…or…mm <- melt(my.dfr, measure.var = c(”Time1”,

”Time2”, ”Time3”))

…or…mm <- melt(my.dfr, measure.var = 3:5) ##Column

numbers

…would all have achieved the same result.

• If all you want to do is stack the time data, then you

Page 27: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

melt() – example (continued)

• BTW, column “variable” in the melted dataset can be renamed:> names(mm)

[1] "Water" "Soil" "variable" "value"

> names(mm)[3] <- "Time"

> names(mm)

[1] "Water" "Soil" "Time" "value"

• … or you can request this in the melt() call:> mm <- melt(my.dfr,variable="Time")

Using Water, Soil as id variables

> names(mm)

[1] "Water" "Soil" "Time" "value“

• Don’t try to rename the “value” column! (But see the cast() argument value=)

Page 28: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

melt()

• Any questions so far?

Page 29: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

• Function cast() takes a previously melt()’ed dataset, and reshapes it according to your specifications.

• The new layout is given via the R formula notation:• row_ID1 + row_ID2 + … ~ col_ID1 + col_id2 + …

• Example – convert back to the original form:

> cc1 <- cast(mm,formula = Water + Soil ~ Time)

> cc1

Water Soil Time1 Time2 Time3

1 High A 1.80 2.35 3.79

2 High B 2.10 2.99 3.54

3 High C 1.18 2.15 3.25

4 Low A 1.13 2.49 3.74

5 Low B 1.64 3.24 3.72

6 Low C 1.64 2.81 3.12

cast()

Page 30: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Casting examples

• Not quite the same as the original – melt() and cast() have sorted the rows by the order of the factor levels, but otherwisethe dataset is the same.

• What about some different arrangements? – Let’s change the order of Soil and Water in the formula:

> cc2 <- cast(mm, formula = Soil + Water ~ Time)

> cc2

Soil Water Time1 Time2 Time3

1 A High 1.80 2.35 3.79

2 A Low 1.13 2.49 3.74

3 B High 2.10 2.99 3.54

4 B Low 1.64 3.24 3.72

5 C High 1.18 2.15 3.25

6 C Low 1.64 2.81 3.12

Page 31: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Casting examples (continued)

• Now let’s use Water and Time to index the rows, Soil to index the columns:

> cc3 <- cast(mm, formula = Water + Time ~ Soil)

> cc3

Water Time A B C

1 High Time1 1.80 2.10 1.18

2 High Time2 2.35 2.99 2.15

3 High Time3 3.79 3.54 3.25

4 Low Time1 1.13 1.64 1.64

5 Low Time2 2.49 3.24 2.81

6 Low Time3 3.74 3.72 3.12

Page 32: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Casting examples (continued)

• …etc:

> cc4 <- cast(mm, formula = Soil + Time ~ Water)

> cc4

Soil Time High Low

1 A Time1 1.80 1.13

2 A Time2 2.35 2.49

3 A Time3 3.79 3.74

4 B Time1 2.10 1.64

5 B Time2 2.99 3.24

6 B Time3 3.54 3.72

7 C Time1 1.18 1.64

8 C Time2 2.15 2.81

9 C Time3 3.25 3.12

Page 33: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Casting examples (continued)

• What if we index the columns by two ID variables instead of one??

> cc5 <- cast(mm, formula = Time ~ Soil + Water)

> cc5

Time A_High A_Low B_High B_Low C_High C_Low

1 Time1 1.80 1.13 2.10 1.64 1.18 1.64

2 Time2 2.35 2.49 2.99 3.24 2.15 2.81

3 Time3 3.79 3.74 3.54 3.72 3.25 3.12

• I.e., we cannot have multiple sets of column names, so a single set is constructed by concatenating the Soil and Water labels.

• (You might want to change these names to be more meaningful to you?)

• Note : Here I have effectively “transposed” the dataset layout.

Page 34: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

cast()

• Any questions?

Page 35: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation

• We might wish to produce a new dataset where the data are (say) averaged over the times; i.e., replace:

…with…

Page 36: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation & examples

• This is essentially a summary table, but in the context of producing a data frame from a data frame, we often use the term “aggregating the dataset”.

• Given the previous melt()’ed dataset, getting the aggregated dataset is easy:> cc6 <- cast(mm, formula = Soil + Water ~. , fun = mean)

> cc6

Soil Water (all)

1 A High 2.646667

2 A Low 2.453333

3 B High 2.876667

4 B Low 2.866667

5 C High 2.193333

6 C Low 2.523333

“.” means“no variable”

Page 37: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation – what’s going on?

• The formula “Soil + Water ~. ” means, “index the rows by Water and Soil, and index the columns with nothing” (i.e., just have a single column in the new dataset).

• Without Time indexing either the rows or the columns, the Soil and Water factors index groups of numbers (3 numbers in each group).

• In this case, cast() assumes that a summary statistic of some kind is to be applied, to reduce each group of numbers to a single number, which is presented in the resulting data frame.

• Argument fun.aggregate= (or fun= for short) can be used to specify the summary statistic function (in this case, mean() )

Page 38: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation examples

• Here’s the same output, but in a different layout:

> cc7 <- cast(mm, formula = Soil ~ Water , fun = mea n)

> cc7

Soil High Low

1 A 2.646667 2.453333

2 B 2.876667 2.866667

3 C 2.193333 2.523333

• Again, Time is missing, so averages over the times are again produced, but this time the rows are indexed by Soil, and the columns by Water, according to the formula= argument.

Page 39: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation examples

• …etc:

> cast(mm, formula = Water ~ . , fun=mean)

Water (all)

1 High 2.572222

2 Low 2.614444

> cast(mm, formula = . ~ Water , fun=mean)

value High Low

1 (all) 2.572222 2.614444

> cast(mm, formula = . ~ . , fun=mean)

value (all)

1 (all) 2.593333

Page 40: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aside:

• Note that cast()’s use of “.” is different from the convention normally assumed in statistical model formulae in R.

• In cast(), “.” in the formula means “no variable”.

• In statistical model formulas (for example, when fitting a modelusing lm() say), “.” means “all columns in the dataset that are not already present in the formula”. In update() calls, “.” means “what was previously in this part of this formula”.

• In a cast() formula, “…” is used to mean “all columns in the dataset that are not already present in the formula” (equivalent to “.” in a statistical model formula!

Page 41: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation – more than one function:

• Example – more than one summary function:

> cc8 <- cast(mm, formula = Soil ~ Water,fun=list(mean,sd))

> cc8

Soil High_mean High_sd Low_mean Low_sd

1 A 2.646667 1.027635 2.453333 1.305386

2 B 2.876667 0.726659 2.866667 1.089097

3 C 2.193333 1.035680 2.523333 0.780534

Page 42: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation – your own function:

• Example – use your own function:

> my.func <- function(vv) {

+ rres <- c(median(vv),mean(vv))

+ names(rres) <- c("Median","Mean")

+ return(rres)

+ }

> cc9 <- cast(mm, formula = Soil ~ Water , fun=my.fu nc)

> cc9

Soil High_Median High_Mean Low_Median Low_Mean

1 A 2.35 2.646667 2.49 2.453333

2 B 2.99 2.876667 3.24 2.866667

3 C 2.15 2.193333 2.81 2.523333

• For R experts – you can also use an “anonymous” function…

Page 43: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregation

• Any questions?

Page 44: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

What else?

• In aggregation, margins to the table can be produced (see the margins= argument to cast() ).

• “length” is the default function used in aggregation.

• If a data frame is already “stacked”, then you can use cast() (with argument value= ) directly, without first calling melt().

• Function recast() can do both melt() and cast() operations in one go (don’t try this until you are used to melt() and cast() ).

• There is a subset= argument to cast(), so you can subset before you reshape/aggregate.

• NA’s can be handled in sensible ways (see my Postscript)

Page 45: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Tips, Caveats

• Tip: Experiment with melt() and cast()! Get a feel for how these work…

• Tip: Check the documentation! More details, facilities, tricks…

• Caveat: Some of the published descriptions of reshape are a bit out of date – some minor differences (notably, integers are now measurement, formerly id…)

• Tip: Main paper/vignette: http://www.jstatsoft.org/v21/i12

• Tip: Main website: http://had.co.nz/reshape/

• Tip: Mailing list: http://groups.google.com.au/group/manipulatr?hl=en

Page 46: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Tips, Caveats (continued)

• Tip: After melting, look at the melted dataset:• Check that melt has used the right ID/Measurement split

• Figure out how you want to cast() the data.• Assuming mm is the melted dataset, then

• head(mm) ## The first few rows of mm, or…

• names(mm) ## Just the row names of mm

• mm$variable ## What’s in the “variable” column? (if## not renamed…), or…

• unique(mm$variable) ## If the melted dataset is## very long…

• Caveat: Don’t forget – the reshape package is cast() and melt(). The reshape() function in base R is something else!

Page 47: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Tips, Caveats (continued)

• Caveat: Note the assumption made in this talk – the columns to be re-organised (“measurement” variables) are assumed numeric (or integer) only.

• This has been relaxed in the newer versions of reshape, but there are practical complications:

• Measurement variables must be converted to the same class by the melting process.

• So, measurement variables that are mixed numeric and factor (or character) are converted to character by melt, then to factor bycast(). (Due to R technicalities, according to Hadley)

• Reshaping mixed classes is best avoided unless you know what you are doing…

Page 48: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Tips, Caveats (continued)

• Caveat: I’ve assumed that the id variables in the original melted table, uniquely index each measurement. What if this is not true?

• melt() will still work fine, but:

• the id columns in the melted dataframe will no longer uniquely index each measurement, so cast() will always aggregate!

• Caveat: The data frames produced by melt() and cast() are not ordinary data frames, and may sometimes conflict with commands that should work with a data frame.

• If this problem occurs, and cc (say) is the offending data frameproduced by cast() (or melt() ), then

cc1 <- as.data.frame(cc) ##use cc1 instead

…should fix the problem.

Page 49: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Postscript – handling NA’s (missing values)

> is.na(my.dfr$Time1[2]) <- TRUE ##Set 2nd element ofTime1 to NA

> is.na(my.dfr$Time2[5]) <- TRUE ##Set 5th element ofTime2 to NA

> is.na(my.dfr$Time3[5]) <- TRUE ##Set 5th element ofTime3 to NA

> my.dfr

Water Soil Time1 Time2 Time3

1 High A 1.80 2.35 3.79

2 Low A NA 2.49 3.74

3 High B 2.10 2.99 3.54

4 Low B 1.64 3.24 3.72

5 High C 1.18 NA NA

6 Low C 1.64 2.81 3.12

Page 50: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Melting, keeping the NA’s

> mm.with.NAs <- melt(my.dfr, variable="Time")Using Water, Soil as id variables

> mm.with.NAs ##NA's are preserved in the melted dataset:Water Soil Time value

1 High A Time1 1.802 Low A Time1 NA3 High B Time1 2.104 Low B Time1 1.645 High C Time1 1.186 Low C Time1 1.647 High A Time2 2.358 Low A Time2 2.499 High B Time2 2.9910 Low B Time2 3.2411 High C Time2 NA12 Low C Time2 2.8113 High A Time3 3.7914 Low A Time3 3.7415 High B Time3 3.5416 Low B Time3 3.7217 High C Time3 NA18 Low C Time3 3.12

Page 51: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Melting, dropping the NA’s

> mm.without.NAs <- melt(my.dfr, na.rm=TRUE, variable ="Time")Using Water, Soil as id variables> ##Or could use 'preserve.na=FALSE' as an alternativ e...

> mm.without.NAs ## NA's are dropped from the melted dataset:Water Soil Time value

1 High A Time1 1.802 High B Time1 2.103 Low B Time1 1.644 High C Time1 1.185 Low C Time1 1.646 High A Time2 2.357 Low A Time2 2.498 High B Time2 2.999 Low B Time2 3.2410 Low C Time2 2.8111 High A Time3 3.7912 Low A Time3 3.7413 High B Time3 3.5414 Low B Time3 3.7215 Low C Time3 3.12

Page 52: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Casting (reshaping) from the no-NA melt()

> cast(mm.without.NAs, formula = Water + Soil ~ Tim e)

Water Soil Time1 Time2 Time3

1 High A 1.80 2.35 3.79

2 High B 2.10 2.99 3.54

3 High C 1.18 NA NA

4 Low A NA 2.49 3.74

5 Low B 1.64 3.24 3.72

6 Low C 1.64 2.81 3.12

> ##NA's are re-introduced as needed!

Page 53: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Casting (aggregating) from the no-NA melt()

cast(mm.without.NAs, formula=Water+Soil~., fun=mean )

Water Soil (all)

1 High A 2.646667

2 High B 2.876667

3 High C 1.180000

4 Low A 3.115000

5 Low B 2.866667

6 Low C 2.523333

> ## No NA's in the data, so mean() is applied to the

> ## available data in each group - no problems...

Mean of two values

Mean of one value

Page 54: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregating from the with-NA melt() - Problem

> cast(mm.with.NAs, formula=Water+Soil~., fun=mean)

Water Soil (all)

1 High A 2.646667

2 High B 2.876667

3 High C NA

4 Low A NA

5 Low B 2.866667

6 Low C 2.523333

> ## NA's are present in the data, and the default beh aviour

> ## of mean() is to give an NA result when an NA is

> ## present in the data being averaged…

Page 55: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregating from the with-NA melt() - Solution

> cast(mm.with.NAs, formula=Water+Soil~., fun=mean, na.rm=TRUE)

Water Soil (all)

1 High A 2.646667

2 High B 2.876667

3 High C 1.180000

4 Low A 3.115000

5 Low B 2.866667

6 Low C 2.523333

> ## The ‘na.rm=TRUE’ option is not a cast() option.

> ## In fact, it is passed through to the mean()

> ## function…

Page 56: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

CSIRO. The reshape package for restructuring datasets

Aggregating from the with-NA melt() – Another solution…

> my.mean <- function(vv){mean(vv,na.rm=TRUE)}

> cast(mm.with.NAs, formula=Water+Soil~., fun=my.me an)

Water Soil (all)

1 High A 2.646667

2 High B 2.876667

3 High C 1.180000

4 Low A 3.115000

5 Low B 2.866667

6 Low C 2.523333

Page 57: The reshape package for restructuring datasets · CSIRO. The reshape package for restructuring datasets Quick R basics (Continued) • Some basic R structures (classes) continued…

Contact UsPhone: 1300 363 400 or +61 3 9545 2176

Email: [email protected] Web: www.csiro.au

Thank you

CSIRO CMISAlec Zwart

Phone: 02 6216 7010Email: [email protected]