Visualizing ORACLE Performance with R - whitepaper

COLLABORATE 14 – IOUG ForumDatabase

Visualizing ORACLE performance with RMaxym Kharchenko, Amazon.com

ABSTRACT

TARGET AUDIENCE Anybody with interest in visualizing ORACLE database performance will benefit from this white paper.No special ‘visualization’ or R knowledge is required to read this document or execute example code; however, some familiarity with ORACLE databases as well as scripting languages is expected. I.e. participants should know how to run queries against ORACLE data dictionary views and understand the concept of (i.e. Perl) functions.

EXECUTIVE SUMMARY

BACKGROUND:

This whitepaper is designed to give you the taste of R by having you execute a few visualizations of your own, using either the actual data from your databases or prepared example data.

PREREQUISITES:

1 | P a g e “Visualizing ORACLE performance with R”White Paper

A picture is worth a thousand words.

This is especially true during performance problem investigations where a well done graph of the issue can often cut resolution time from days to mere minutes. ORACLE database provides a wealth of performance information, but unfortunately only a small part of it is currently visualized by standard tools, such as Enterprise Manager.

Enter “R”: a well known (and free) statistical analysis and graphing framework that can create relevant and interesting visualizations on pretty much any data.Come to this presentation to learn how with a bit of R knowledge, you can make your ASH, AWR, 10046 trace, listener log etc data come alive.

Learner will be able to: Understand the benefits of R for performance (and other) data visualizations Explore ORACLE visualization “prime targets”, such as listener logs, ASH or AWR

dictionary views Learn multiple ways and techniques how data in R can be visualized


For examples to work, you need to load and install R tool.

Here is a link for Windows installation: http://cran.r-project.org/bin/windows/base/A link for Mac installation: http://cran.r-project.org/bin/macosx/ And a link for anything else: https://www.google.com/

After R is installed, you need to install a few 3rd party libraries that we will be using in this lab:

install.packages("stringr")install.packages("sqldf")install.packages("scales")install.packages("plyr")install.packages("ggplot2")

Finally you need to load just installed libraries in your session:

library(stringr)library(sqldf)library(scales)library(plyr)library(ggplot2)

That’s all for prerequisites, let’s do some graphing!

HANDS-ON LAB: VISUALIZE YOUR DATA IN R:

Episode 1: What happened to my database at 23:00? Graph: "Time series of Active Sessions"

Visualization Examples:


https://www.google.com/

http://cran.r-project.org/bin/macosx/

http://cran.r-project.org/bin/windows/base/


Get the data

On your database:

Download: rlab_get_dsh.sql from: http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_dsh.sql_.txt

Then run:

export START_TIME="'2014-02-14 00:00:00'" export END_TIME="'2014-02-15 00:00:00'"export PRECISION="'MI'"

[[ ! -z $ORACLE_SID ]] && echo "exit" | \ sqlplus -S -L / as sysdba @rlab_get_dsh.sql $START_TIME $END_TIME $PRECISION | \ perl -pale 's/\s+,/,/g' | perl -pale 's/,\s+/,/g' | grep -Pv '^[\-,]+$' >${ORACLE_SID}_dsh.csv &

Using example data set:


http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_dsh.sql_.txt


d <- read.csv('http://intermediatesql.com/wp-content/uploads/2014/02/r_example_ash.csv', head=T, stringsAsFactors=FALSE)

Transform

If you loaded data from example, you need to massage it for better graphing.

First of all, check what you have:

> str(d)'data.frame': 17075 obs. of 10 variables: $ TS : chr "2013-09-01 06:39:00" "2013-09-01 03:35:00" "2013-09-01 03:31:00" "2013-09-01 03:25:00" ... $ WAIT_CLASS : chr "ON CPU" "ON CPU" "ON CPU" "ON CPU" ... $ EVENT : chr "ON CPU" "ON CPU" "ON CPU" "ON CPU" ... $ READ_OR_WRITE : chr "READ" "READ" "READ" "READ" ... $ BLOCKING_SESSION: chr "" "" "" "" ... $ MACHINE : chr "app1-host" "app1-host" "app1-host" "app1-host" ... $ MODULE : chr "Module4" "Module4" "Module4" "Module4" ... $ IN_STAGE : chr "SQL EXEC" "SQL EXEC" "SQL EXEC" "SQL EXEC" ... $ SQL_ID : chr "mnyy6bt9fgmm5" "mnyy6bt9fgmm5" "mnyy6bt9fgmm5" "mnyy6bt9fgmm5" ... $ N : int 1 1 1 1 1 1 6 6 6 6 ...

Let’s adjust the data, i.e. clean data types etc:

# Convert to 'R date' data typed$TS <- as.POSIXct(d$TS, "UTC")

# And now, review the data againstr(d)

Now, let's say we want to see workload by WAIT_CLASS. Let's group the data by WAIT_CLASS (luckily, R supports SQL!):

d1 <- sqldf("select TS, WAIT_CLASS, sum(N) as N from d group by TS, WAIT_CLASS")d1$TS <- as.POSIXct(d1$TS, "UTC")



Visualize

Basic plots:

# Area graph:ggplot(d1, aes(x=TS, y=N, fill=WAIT_CLASS)) + geom_area()

# Bar graph:ggplot(d1, aes(x=TS, y=N, fill=WAIT_CLASS)) + geom_bar(stat="identity")

# Line graph:ggplot(d1, aes(x=TS, y=N, color=WAIT_CLASS)) + geom_line()

Let's beautify our graphs a bit. First, let's convert WAIT_CLASS to a factor, so that we can play with label ordering

d1$WAIT_CLASS <- as.factor(d1$WAIT_CLASS)# Reorder how WAIT_CLASS is shown in the graph by the (sum of) Nd1$WAIT_CLASS <- reorder(d1$WAIT_CLASS, d1$N, FUN=sum)

Let's create a better graph with better colors, labels etc:

ggplot(d1, aes(x=TS, y=N, fill=WAIT_CLASS, order=desc(WAIT_CLASS))) + geom_area(stat="identity") + xlab("Time") + ylab("Active Sessions") + ggtitle("Active sessions by wait class") + theme_minimal() + scale_fill_brewer(palette="Spectral") + scale_y_continuous(labels=comma)

Ok, so, Concurrency wait class seems to be responsible for the problem. Can we get more detailed information here?

First of all, let’s load some helpful functions:

source("http://intermediatesql.com/wp-content/uploads/2014/02/plot_functions.R.txt")

This brings out explore_f() function that can simplify data exploration. You can check out what this function does by

typing:



# Just parametersstr(explore_f)

# Full codeexplore_f

Now let’s see if we can narrow down a problem even further:

explore_f('EVENT')

explore_f('SQL_ID')

explore_f('IN_STAGE')

explore_f('MODULE')

explore_f('MODULE', 'line')

explore_f('BLOCKING_SESSION')

Well, it is clear that the problem is related to Module1 module, sessions are waiting for SOFT PARSING, there is no

particular SQL attached and there are only a handful of blocking sessions (which is good, since we can focus on them)

I think, you agree that the problem can be seen pretty clearly now. This is very similar to Enterprise Manager does, by

the way.

Feel free to continue exploration.

1. Look at other columns in the data frame and graph them

2. Try different colors:

# explore_f() saves previous visualization in “p” variable# You can reuse it and just add additional elementsexplore_f('BLOCKING_SESSION')p + scale_fill_brewer(palette="Accent")p + scale_fill_brewer(palette="Set1")

Other good color choices can be found here: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/


http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/


3. Try, percentage wise graphs:

explore_f('EVENT', pct=TRUE)explore_f('BLOCKING_SESSION', pct=TRUE)

3. Try facets:

# Facet view: Review each wait_class separately - real scalesggplot(d1, aes(x=TS, y=N, fill=WAIT_CLASS, order=desc(WAIT_CLASS))) + geom_bar(stat="identity") + xlab("Time") + ylab("Active Sessions") + ggtitle("Active sessions by wait class") + theme_minimal() + scale_fill_brewer(palette="Spectral") + scale_y_continuous(labels=comma) + facet_grid(WAIT_CLASS ~ .)

Episode 2: When to schedule backups? Graph: "ARC log heat maps"




Get the data

On your database:

Download: rlab_get_arc.sql from: http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_arc.sql_.txt

Then run:



http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_arc.sql_.txt


[[ ! -z $ORACLE_SID ]] && echo "exit" | \ sqlplus -S -L / as sysdba @rlab_get_arc.sql $START_TIME $END_TIME $PRECISION | \ perl -pale 's/\s+,/,/g' | perl -pale 's/,\s+/,/g' | grep -Pv '^[\-,]+$' >${ORACLE_SID}_arc.csv &


d <- read.csv('http://intermediatesql.com/wp-content/uploads/2014/02/r_example_arc.csv', head=T, stringsAsFactors=FALSE)

Transform



> str(d)'data.frame': 653 obs. of 3 variables: $ TS : chr "2013-07-30 03:00:00" "2013-07-30 04:00:00" "2013-07-30 05:00:00" "2013-07-30 06:00:00" ... $ N : int 13 19 20 12 77 9 23 6 9 9 ... $ MBYTES: num 9741 14291 15049 9070 67061 ...

Let's adjust data types and clean the data:

# Remove non-conforming datesd <- d[str_length(d$TS) == 19,]# Convert timestamp to 'R date' data typed$TS <- as.POSIXct(d$TS, "UTC")

# And now, review the data againstr(d)

Visualize

Let's make the simplest ARC log plot: MBytes by Time

ggplot(d, aes(x=TS, y=MBYTES)) + geom_line()



Let's beatify our graphs a bit. I.e. let us color our line according to what day of the week it is:

d$WEEKDAY <- ifelse(strftime(d$TS, "%a") %in% c('Sun', 'Sat'), 'Weekend', 'Weekday')

ggplot(d, aes(x=TS, y=MBYTES, color=WEEKDAY, group=1)) + geom_line() + geom_point() + scale_color_manual(values=c("YellowGreen", "OrangeRed")) + theme_minimal() + xlab('Time') + ylab('Arclogs per hour (Mbytes)') +ggtitle("ARC logs by day") +theme(legend.title=element_blank())

This graph shows the data but it's a bit lame. Let's convert it into a heat map:

# Adjust the datad$HOUR <- strftime(d$TS, "%H")d$DAY <- strftime(d$TS, "%m-%d %a")

# And plot it as a heat mapggplot(d, aes(x=HOUR, y=DAY, fill=MBYTES)) + geom_tile() + scale_fill_gradient(low="white", high="salmon", labels = comma) + theme_minimal() + ggtitle("ARC logs heat map")

Feel free to experiment further with data layouts, colors etc, i.e.:

ggplot(d, aes(x=HOUR, y=DAY, fill=MBYTES)) + geom_tile() + scale_fill_gradient(low="yellow", high="blue", labels = comma) + theme_minimal() + ggtitle("ARC logs heat map")

Episode 3: Do I have unstable SQLs?Graph: SQL Elapsed Time Boxplots and Violins




Get the data

On your database:

Download: rlab_get_sql_stat.sql from: http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_sql_stat.sql_.txt

Then run:


[[ ! -z $ORACLE_SID ]] && echo "exit" | \sqlplus -S -L / as sysdba @rlab_get_sql_stat.sql $START_TIME $END_TIME $PRECISION | \


http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_sql_stat.sql_.txt


perl -pale 's/\s+,/,/g' | perl -pale 's/,\s+/,/g' | grep -Pv '^[\-,]+$' >${ORACLE_SID}_sql_stat.csv &


d <- read.csv('http://intermediatesql.com/wp-content/uploads/2014/02/r_example_sql_stat.csv', head=T, stringsAsFactors=FALSE)

Transform



> str(d)'data.frame': 174604 obs. of 27 variables: $ TS : chr "2013-08-25 08:00:00" "2013-08-25 08:00:00" "2013-08-25 08:00:00" "2013-08-25 08:00:00" ... $ SQL_ID : chr "g5a46cnz09948" "vk4q6hr39s57g" "d0ku4grrv4575" "yy17gdxs08574" ... $ FETCHES : chr "37804" "0" "12803" "11112" ... $ SORTS : chr "0" "0" "0" "0" ... $ EXECUTIONS : chr "37804" "10346" "12803" "11112" ... $ PARSE_CALLS : chr "13708" "1315" "7966" "3931" ... $ DISK_READS : chr "0" "10459" "0" "304" ... $ BUFFER_GETS : chr "0" "466539" "66131" "33742" ... $ ROWS_PROCESSED : chr "37804" "10346" "12803" "406" ... $ CPU_TIME : chr "3059552" "4099356" "671899" "384945" ... $ ELAPSED_TIME : chr "11402732" "59958791" "743968" "2114141" ... $ IOWAIT : chr "0" "49926497" "0" "1709069" ... $ CLWAIT : chr "0" "0" "0" "0" ... $ APWAIT : chr "0" "0" "0" "0" ... $ CCWAIT : chr "0" "0" "0" "0" ... $ DIRECT_WRITES : chr "0" "0" "0" "0" ... $ PLSEXEC_TIME : chr "0" "590720" "0" "0" ... $ JAVEXEC_TIME : chr "0" "0" "0" "0" ... $ IO_OFFLOAD_ELIG_BYTES : chr "0" "0" "0" "0" ... $ IO_INTERCONNECT_BYTES : chr "0" "85590016" "0" "2490368" ... $ PHYSICAL_READ_REQUESTS : chr "0" "10448" "0" "304" ... $ PHYSICAL_READ_BYTES : chr "0" "85590016" "0" "2490368" ...



$ PHYSICAL_WRITE_REQUESTS : chr "0" "0" "0" "0" ... $ PHYSICAL_WRITE_BYTES : chr "0" "0" "0" "0" ... $ OPTIMIZED_PHYSICAL_READS: chr "0" "0" "0" "0" ... $ CELL_UNCOMPRESSED_BYTES : chr "0" "0" "0" "0" ... $ IO_OFFLOAD_RETURN_BYTES : chr "0" "0" "0" "0" ...

Let's adjust data types, clean the data and convert dates to PST:

# Remove non-conforming datesd <- d[str_length(d$TS) == 19,]# Convert to 'R date' data typed$TS <- as.POSIXct(d$TS, "UTC")

# For whatever reason all numbers come here as "characters".# Let's convert them back to numbersfor (x in (setdiff(names(d), c('TS', 'SQL_ID')))) { d[[x]] <- as.numeric(d[[x]])}# And now, review the data againstr(d)

Visualize

Let's do some data analysis. I.e. let's determine if we have SQLs that are wildly varying in "buffer gets per execution"

over time. I’m also going to use native R data transformation tools (such as ddply) to show that not only SQL can do that

First of all, let's clean the data:

# Do not pay attention to SQLs that are not executed a lot or not read a lotd1 <- d[d$EXECUTIONS >100 & d$BUFFER_GETS >1000, ]

# Calculate gets per executiond1$GETS_PER_EXEC <- d1$BUFFER_GETS/d1$EXECUTIONS

# Remove NA values ("non existing data", similar to NULL in ORACLE)d1 <- d1[!is.na(d1$GETS_PER_EXEC), ]str(d)

Now let's calculate a standard deviation for (a vector of) "buffer gets per execution" for each SQL.



Standard deviation is a measure of how much data varies against the average. We will use ddply() function

from plyr package for that. It breaks data into groups (by SQL_ID), calculates standard deviation per group and then

combines the data back, assigning per-group deviation to the new column: SD.

Think: analytic functions in ORACLE SQL.

d1 <- ddply(d1, c("SQL_ID"), transform, SD=sd(GETS_PER_EXEC, na.rm=TRUE))

We got the results, but we still have a lot of different SQLs there:

d1$SQL_ID <- as.factor(d1$SQL_ID)> str(d1$SQL_ID) Factor w/ 322 levels "01g33zp5h7qvb",..: 1 1 1 1 1 1 1 1 1 1 ...

If we graph all 322 SQLs, the plot will be pretty messy.

Let's select the first 8 SQLs by "buffer gets per exec" deviation so that are plot is not overwhelmed with data:

d1 <- add_cat_top_n('SQL_ID', 'SD', drop_others=T, top_n=8, data=d1)

# Factor math is complicated, so add_cat_top_n() converts all factors to chars. Let's convert it backd1$SQL_ID <- as.factor(d1$SQL_ID)

# Let's see how many SQL_IDs are now in our data setlevels(d1$SQL_ID)

[1] "6cdj64z9p6c5c" "7gpq48pasy99h" "ddqma9ku89b00" "g5wp6qf3uu12p"[5] "kts18nwzj82hk" "tff6bbwan4jqq" "tvjkbtkswgjc3" "wmrpbrd8mm88a"

# Cool, now we only are dealing with 8 sqls – remember that we requested only top_n sqls (by "standard deviation": 'SD' column in our data frame)

We are ready to make our first graph:

ggplot(d1, aes(x=SQL_ID, y=GETS_PER_EXEC, fill=SQL_ID)) + geom_boxplot()

Boxplot shows the data in terms of percentiles. 25 to 75 percentile is shown as a box, dots show outliers.



Let's beatify our graph a bit:

# First of all, let's reorder SQL_ID factor levels by values of GETS_PER_EXEC, so that# our boxplots are displayed in the order from smallest to biggest gets_per_exec.d1$SQL_ID <- reorder(d1$SQL_ID, d1$GETS_PER_EXEC, FUN=max)

# Beautified graph:ggplot(d1, aes(x=SQL_ID, y=GETS_PER_EXEC, fill=SQL_ID)) + geom_boxplot() + scale_y_continuous(labels=comma) +theme_minimal() +ylab("Gets per execution") +ggtitle("Most wildly varying SQLs by gets/per/execution") +coord_flip()

What about violins? (you remember section header, right ?)

Violin is a graph that is very similar to boxplot, it shows "the shape" (or: density) of data sequence (think of

a histogram here).

To transform the latest plot to a violin plot, just replace geom_boxplot() with geom_violin():

ggplot(d1, aes(x=SQL_ID, y=GETS_PER_EXEC, fill=SQL_ID)) + geom_violin() + scale_y_continuous(labels=comma) +theme_minimal() +ylab("Gets per execution") +ggtitle("Most wildly varying SQLs by gets/per/execution") +coord_flip()

# Zooming for better pictureggplot(d1, aes(x=SQL_ID, y=GETS_PER_EXEC, fill=SQL_ID)) + geom_violin() + scale_y_continuous(labels=comma, limits=c(0, 10000)) +theme_minimal() +ylab("Gets per execution") +ggtitle("Most wildly varying SQLs by gets/per/execution") +coord_flip()

Violins are pretty, but in my mind a better way to explore data shape is to use histograms or density curves directly.

Hint: Histogram is a bucketed raw data (a bunch of bars, each of which represents frequency of data within that

bar). Density curve is a mathematical approximation of data frequency distribution (histogram bars smoothed into a

line).

ggplot(d1, aes(x=GETS_PER_EXEC, fill=SQL_ID)) + geom_density() + scale_x_continuous(labels=comma) + scale_y_continuous(labels=comma) +theme_minimal() + ylab('Density') +xlab('Buffer gets per execution') +ggtitle ('Buffer gets per execution density') + facet_grid(SQL_ID ~ ., scales="free_y")

Looking at this graph, we can see that i.e. sql_id=ddqma9ku89b00 almost always has the same number of buffer gets per

execution (and a small number at that), while sql_id=tff6bbwan4jqq has 3 separate "peaks" (which is very interesting!)

APPENDICES

List of downloads for this lab:



R tool:

Windows installation: http://cran.r-project.org/bin/windows/base/ Mac installation: http://cran.r-project.org/bin/macosx/

Custom R functions:

http://intermediatesql.com/wp-content/uploads/2014/02/plot_functions.R.txt

SQL scripts:

http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_dsh.sql_.txt http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_arc.sql_.txt http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_sql_stat.sql_.txt

Example data sets:

http://intermediatesql.com/wp-content/uploads/2014/02/r_example_ash.csv http://intermediatesql.com/wp-content/uploads/2014/02/r_example_arc.csv http://intermediatesql.com/wp-content/uploads/2014/02/r_example_sql_stat.csv

REFERENCES

Further R resources:

Websites:

R project for statistical computing: http://www.r-project.org/ R bloggers: http://www.r-bloggers.com/ Hadley Wickham blog: http://www.r-statistics.com/tag/hadley-wickham/

o And personal page: http://had.co.nz/ Cookbook for R: http://www.cookbook-r.com/

Books: The Art of R programming: http://www.amazon.com/The-Art-Programming-Statistical-

Software/dp/1593273843/ref=tmm_pap_title_0?ie=UTF8&qid=1392504776&sr=8-1 R graphics cookbook:

http://www.amazon.com/R-Graphics-Cookbook-Winston-Chang/dp/1449316956/ref=tmm_pap_title_0?ie=UTF8&qid=1392504949&sr=8-2

(free) Advanced R programming: http://adv-r.had.co.nz/


http://adv-r.had.co.nz/



http://www.amazon.com/The-Art-Programming-Statistical-Software/dp/1593273843/ref=tmm_pap_title_0?ie=UTF8&qid=1392504776&sr=8-1

http://www.amazon.com/The-Art-Programming-Statistical-Software/dp/1593273843/ref=tmm_pap_title_0?ie=UTF8&qid=1392504776&sr=8-1

http://www.cookbook-r.com/

http://had.co.nz/

http://www.r-statistics.com/tag/hadley-wickham/

http://www.r-bloggers.com/

http://www.r-project.org/

http://intermediatesql.com/wp-content/uploads/2014/02/r_example_sql_stat.csv

http://intermediatesql.com/wp-content/uploads/2014/02/r_example_arc.csv

http://intermediatesql.com/wp-content/uploads/2014/02/r_example_ash.csv

http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_sql_stat.sql_.txt

http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_arc.sql_.txt

http://intermediatesql.com/wp-content/uploads/2014/02/rlab_get_dsh.sql_.txt

http://intermediatesql.com/wp-content/uploads/2014/02/plot_functions.R.txt

http://cran.r-project.org/bin/macosx/

http://cran.r-project.org/bin/windows/base/

Data & Analytics

Visualizing ORACLE Performance with R - whitepaper