Using the “R” Actor in Kepler for quality control

USING THE “R” ACTOR IN KEPLER FOR QUALITY CONTROL

John Porter, University of Virginia, [email protected]

R Basics R is an open source statistical language “Atomic” types: logical, integer, real,

complex, string (or character) and raw Data in R is stored in one of several types of

objects Scalar : myVar <- 10 Vectors: myVec <- c(10,20,30) Lists: myList <- c(10,”E”,12.3) Matrix: myMat <- cbind(myVec1,myVec2) Data Frames: myDf<-data.frame(myVec,MyList) Factors: myFac <- as.factor(myList)

R Workspaces

All the variables and functions defined during a session are part of the “Workspace”

R Workspaces can be saved for later use When you come back, everything is the

same as when the workspace was saved

Most Commonly Used Object Types Vectors – contain a single column

of one of the “atomic” types Often created using the concatenate

function myVec <- c(10,20,30)Individual elements can be accessed using indexesmyVec[2] is 20

Data Frames

Data Frames – table-style objects that contain named vectors inside them

myDF$RAIN refers to the “RAIN” vector, as does

myDF[ ,2]myDF[135,3] is 121.8

Reading Data into Data Frames A common way of creating data

frames is to read in a comma-separated-value (csv) file

myDf <- read.csv(“C:/ft_monro.csv”,header=TRUE)

read.csv

Note, regardless of operating system, R wants “/” – not “\”

Sample R Program for QA/QC# Select the Data Fileinfile1 <- file(“C:/downloads/ft_monroe.csv", open="r") # Read the datadataTable1 <-read.csv(infile1, ,skip=1 ,sep="," ,quot='"' , col.names=c( "YEAR", "RAIN", "RAIN_CM", "NOTES" ), check.names=TRUE)

attach(dataTable1)

# Run basic summary statisticssummary(as.factor(NOTES)) summary(as.numeric(YEAR)) summary(as.numeric(RAIN)) summary(as.numeric(RAIN_CM))

Quick Exercise – Run these in R# anything after a # sign on a line is just a COMMENT - it won't do anythingvarA <- 10 # sets up a vector with one element containing a 10varA # listing an object's name prints out the values varB <- c(10,20,30) # sets up a vector with 3 elements. c() is the concatenation functionvarBvarB[2] # now let's display ONLY the second element

# now let's do some math!mySumAB <- varA + varB # adding them together. # Note there is only 1 value in varAmySumAB # note the single value in varA repeated in the addition

R Data Structures

A lot of the “magic” in R is because of the object-oriented approach used

R objects contain a lot more than just the data values

A command that does one thing to a scalar (single value) does something else with a vector (a list of values) – all because R functions “understand” the difference!

Conversions

Conversions are possible between different modes or types of objects using conversion functions as.numeric(varA)

makes varA a number – if it can! as.integer( ) as.character( ) as.factor() as.matrix() as.data.frame()

Using Data FramesA <- c(10,20,30)B <- c(4,6,3)C <- c(‘A’,’B’,’C’) # put letters in quotesDf <-data.frame(C,A,B)Df # list whole data frameDf$A # list the A vectorDf[,3] # list the 3rd vector (B)Df[1,] # list all columns for row 1Df[Df$A > 10,] # list rows where A>10

Data Frames

Results of Data Frame manipulations

R Help

R has a number of ways of calling up help ??sqrt - does a “fuzzy” search for

functions like “sqrt” ?sqrt – does an exact search for the

function sqrt() and displays documentation

There are also manuals and extensive on-line tutorials (but Google is frequently the best way to find help)

R & Kepler

Kepler uses the “RExpression Actor” to run R code from inside Kepler

Typically run with an SDF Director with a single iteration for most analyses You only need them done once! Don’t forget to set the iteration count –

the default is to loop forever!

The default RExpression has no inputs and two outputs

graphicsFileName & output

Typical connections for basic RExpression Actor

Adding Ports

To make Rexpression actors really useful, it is helpful to be able to have them intercommunicate with other Kepler actors beyond simply listing output or showing graphs

To allow this intercommunication we need to add additional Input and Output ports The names of the ports will automatically

be connected to objects with the same name in the R program

Hook up some input and output actors

R Program to Test

Remember – names of ports translate into names of objects in R

Results of Running Workflow

R Listing Output

“myOutValue

”displayed

R for Checking EML Data

But there are some TRICKS you should know!

Trick 1 – select the right object type for the EMLactor By Default the EML Actor only

connects to the output ports the FIRST LINE OF DATA “as field”.

If you want to have an output port represent the data as a VECTOR you need to select “As Column Vector”

If you want to get a Data Frame instead of individual columns, you need to select “As ColumnBased Record”

Setting Data Output Format in EML actor

Trick 2 – Trap R errors

Normally if there is a problem with your R program you get a cryptic message from Kepler

try() and geterrmessage() in R

Runs the “errorplot()”* function and reports any error messages

that occur when you run it* There is no “errorplot()” function in R

Now we get an informative message

Correct the command and see the output

QA/QC – Quality Assurance and Quality Control Error types

Errors of Commission – data contains wrong values

Errors of Omission – data that should be there is missing

We will mostly be talking today about errors of commission

Porter’s Rule of Data Quality There is no non-trivial dataset

that does not contain some errors

Goal of QA/QC: reduce errors to the maximum possible extent, or at least to the level that they don’t adversely effect the conclusions reached through analysis of the data

QA/QC – Possible Tests Identification and removal of duplicates Correct Domain

Numerical Range (e.g., -20 < Temperature < 50)

Correct Codes (e.g., HOGI, not HOG1) Graphs

Time-series plots Plots between variables

Detections of “spikes” in time series Customized criteria (e.g., month specific

range checks)

Exercise – A succession of workflows for QA Open your Virtual Machine Open a Web Browser and go to: http://tinyurl.com/7po5ffb Open the LocalData.zip file Extract All Files to directory C:\ You should then have a C:\localData

directory containing the files for this exercise

http://tinyurl.com/7po5ffb

http://tinyurl.com/7po5ffb

1_Ft_Monroe_simple_summary.kar

A dead-simple workflow

Kepler Stuff to Note

Annotations allow you to add titles and other useful instructions to your workflow display

Kepler Stuff to Note

Parameters let you easily show and change values that will be used elsewhere in the workflow

Kepler Parameters

Customize Name lets you set the NAME of the parameter and what should display on the screen

Remember thename – that is how you will refer to the parameter later.

Using a Parameter Value Add a $ to the front of a parameter in

a Kepler settings box to insert the value of the parameter – so the Data File: is c:/localData/ft_monro.csv

Brief Exercise

Experiment with editing connections in this workflow to display different graphs

Then open the 3_ft_monro_badData.kar workflow – it has a corrupted version of this data

R stuff to Note

This workflow uses both a Data Frame (table) and vectors (single columns) In the dataFrame you can subset

lines using: dataFrame[(dataFrame$RAIN < 0), ] Be sure to put the trailing comma! dataFrame$RAIN < 0 generates a logical

vector of TRUE and FALSE values – one for each line

QA/QC in R

summary(dataFrame)

print("Here are Duplicated Data Lines")dataFrame[duplicated(dataFrame),]

print("now list out of range checks")dataFrame[(dataFrame$RAIN < 0 | dataFrame$RAIN_CM < 0),]dataFrame[(dataFrame$RAIN > 150 | dataFrame$RAIN_CM > 300),]print("now list unit conversion errors")dataFrame[(abs((dataFrame$RAIN*2.54)- dataFrame$RAIN_CM)>0.1),]

Examine the workflow on the bad data and change it! Try setting different values for the

range checks Try different graphs (as you did for

the good data) Try listing all the data that was NOT

duplicated (note in R the “not “ operator is “!“)

use R help and Google as needed

R+Kepler vs. R Alone Given that “R” runs just fine alone, why

use Kepler? Allows use of OTHER Kepler actors, Data

Turbine E.g., EMLData, editors, graphical tools

Allows code to be segmented for easier editing in the future

Reusability – ability to copy and paste parts of Kepler workflows

Use spatial arrangement to help guide the user

Downsides Complicates debugging

A more complex and general workflow

4_BasicEMLQA

Workflow Steps Read an EML metadata file Convert it using a XSLT stylesheet into

an R program Edit the R program to point to the data Ingest the data into a data frame Summarize the data “Tweak “ the data to add a date-time

vector for time plots and fix some conversion problems and re-summarize the data

Run some plots

Passing R Workspaces

This workflow, instead of passing data from actor-to-actor, passes the name of the R Workspace

Subsequent actors re-open the R Workspace without needing to ingest the data again

This is very efficient, but this method only works for connecting R actors

R code for passing on R workspaces

Set Port Variable to the name of the workflow

Remember to save the

workspace!

Saving workspace for

later use

Loading the Saved

WorkspaceName of Port connected to

WorkingDir port (above)

A conversion problem

Temperature and Humidity values have

some severe problems reading in!

What happened?

R Factors Factors are the way R deals with categorical or

nominal data (e.g., typically, non-numeric data) Internally Factors are made up of two vectors:

Values – the actual values stored in the factor – often referred to as “levels”

Indexes – an integer vector containing numbers that are used to specify the ORDERing of the values

DANGER – sometimes when you read in data from a file, errors or odd characteristics of the data will cause R to read a column of (mostly) numbers as a Factor instead of as a numeric vector!

Factors

This is the mean of the INDEXES

not the VALUES/Levels

After conversion data ranges are much better!

But Max_T is still suspicious!

Your Final Challenge

As it’s name suggests this data file has some corrupted data (plus the normal errors)

Edit the “Tweaks” actor to add additional checks or add additional plots to identify the problems with the data

If you don’t cause Kepler to abort the workflow due to errors at least once, you aren’t trying hard enough! So make additions in a change-test-repeat cycle

Documents

Using the “R” Actor in Kepler for quality control