23
January 2013 Portland User Group MeetUp Presentation R & Text Analytics Daniel Fennelly Portland R User Group Portland, Oregon 15 January 2013

"R & Text Analytics" (15 January 2013)

Embed Size (px)

DESCRIPTION

Fennelly, David. "R & Text Analytics (PPT)." Portland R User Group, 15 January 2013.

Citation preview

Page 1: "R & Text Analytics" (15 January 2013)

January 2013 Portland User Group MeetUp Presentation

R & Text Analytics

Daniel Fennelly

Portland R User Group Portland, Oregon 15 January 2013

Page 2: "R & Text Analytics" (15 January 2013)

Following are some notes on the usage of the R package TopicWatchr. TopicWatchr is designed to neatly access the [Luckysort API](http://luckysort.com/products/api/docs/intro) TopicWatchr was authored by Homer Strong and is currently maintained and updated by Daniel Fennelly.

Page 3: "R & Text Analytics" (15 January 2013)

> library(TopicWatchr)

Loading required package: RJSONIO

Loading required package: RCurl

Loading required package: bitops

Welcome to TopicWatchr!

Remember to check for updates regularly.

Found TopicWatch account file in ~/.tw

Welcome [email protected]

Credentials can be stored in `~/.tw` [email protected] hunter2 Or you can authenticate in the interactive shell... > clearCredentials()

> setCredentials()

Enter username: [email protected]

Enter password:

>

Note: Be careful about the password prompt in ESS. It seems ESS hides the password in the minibuffer before displaying it in the \*R\* buffer.

Page 4: "R & Text Analytics" (15 January 2013)

Package Summary 1. Formulate and send API requests according to task 2. Receive and parse JSON response 3. Page through multiple requests, offer quick visualization tools, other utilities Other end-user tools to access this data include the [TopicWatch](https://studio.luckysort.com/) web interface and the tw.py python client.

Page 5: "R & Text Analytics" (15 January 2013)

The Basics The data we work with at LuckySort and which we'll be talking about here have a few specific qualities: 1. Text Sources 2. Terms 3. Time

Page 6: "R & Text Analytics" (15 January 2013)

The Basics Text Sources • Hourly: Twitter Data, StockTwits, Consumer Facebook statuses, Wordpress

posts and comments... • Daily: RSS news sources, Amazon.com product reviews, Benzinga News

Updates • your data? (talk with us!) Let's fetch our personal list of our sources. > my.sources <- getSources()

> head(my.sources)

name id

1 Wordpress Intense Debate comments wp_en_comments-id

2 StockTwits stock_twits

3 Benzinga News Updates benzinga_news_updates_1

4 AngelList angelco

5 Amazon.com Shoes best sellers reviews amzn-bestsellers-shoes

6 Amazon.com Home & Kitchen best sellers reviews amzn-bestsellers-home

> dim(my.sources)

[1] 35 2

Page 7: "R & Text Analytics" (15 January 2013)

The Basics Text Sources Let's get some more specific metadata. > twitter.info <- getSourceInfo("twitter_sample") > names(twitter.info)

[1] "metrics" "resolutions" "users"

[4] "name" "finest_resolution" "owner"

[7] "aggregate_type" "type" "id"

> twitter.info$finest_resolution

[1] 3600

> twitter.info$metrics

[1] "documentcounts"

Sources have specific resolutions available to them, given in seconds. The finest resolution for Twitter is one hour. The metrics are almost always going to just be "documentcounts", although we're working on making available numeric sources like stock market or exchange rate information.

Page 8: "R & Text Analytics" (15 January 2013)

The Basics Terms in Time Our most basic analysis is that of the term occurring within a streaming document source. How are \<term\> occurrences in \<document source\> changing over time from \<start\> to \<finish\>. > end <- Sys.time() > start <- ISOdate(2012, 12, 01, tz="PST")

> start; end

[1] "2012-12-01 12:00:00 PST"

[1] "2013-01-14 23:10:19 PST"

> terms <- c("obama", "mayan", "newtown", "iphone")

> resolution <- 3600 * 24

> recent.news <- metric.counts(terms, src="twitter_sample", start=start, end=end,

resolution=resolution, freq=T, debug=T)

get: https://api.luckysort.com/v1/sources/twitter_sample/metrics/documentcounts?

start=2012-12-01T12:00:00Z&end=2013-01-15T07:10:19Z&grams=obama,mayan,newtown,

iphone&limit=300&resolution=86400&offset=0&freq=TRUE

Page 9: "R & Text Analytics" (15 January 2013)

The Basics Terms in Time Let's plot our data and see what it looks like! The function `plotSignal` just wraps some handy ggplot2 code. For anything sophisticated you'll probably want to tailor your plotting to your own needs. > png("news.png", width=1280, height=720)

> plotSignal(recent.news)

> dev.off()

![news_plot](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/news.png)

Page 10: "R & Text Analytics" (15 January 2013)

The Basics Terms in Time Of course one's choice of resolution is going to change the look of the data. At the daily resolution there's no way to disambiguate between sustained daily usage of a term or rapid usage within a short time span. Take a look at these plots of the same terms over the same time span collected at hourly and daily resolution.

Page 11: "R & Text Analytics" (15 January 2013)

The Basics Terms in Time

![tech words daily resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsDaily.png)

Page 12: "R & Text Analytics" (15 January 2013)

The Basics Terms in Time

![tech words hourly resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsHourly.png)

Page 13: "R & Text Analytics" (15 January 2013)

The Basics Term Co-occurrences Moving beyond simple word counts, we're often interested in the subset of a text source mentioning a specific term. We also might want to compact the occurrence of several related terms into a single signal. This is where *filters* and *indices* come in handy! An index like `~bullish` is just a weighted sum of terms. For example, the terms `buy`, `upgrade`, `longterm` and `added` are all contained within the `~bullish` index. We've created several public indices like these which we feel are useful in certain applications like stock market or consumer sentiment analysis. (Of course users can also create their own indices too.) Let's look at the behavior of the `~bullish` and `~bearish` indices on StockTwits, a twitter-like community around the stock market. We filter on documents containing Apple's ticker symbol "$aapl" so that the only signals we're looking at are in some way related to Apple.

Page 14: "R & Text Analytics" (15 January 2013)

The Basics Term Co-occurrences > aapl.sentiment <- aggregateCooccurrences("stock_twits", "$aapl", c("~bullish",

"~bearish"), start=start, end=end, debug=TRUE, resolution=86400)

> head(aapl.sentiment)

times ~bearish ~bullish

1 2012-12-01 16:00:00 0.1398305 0.1313559

2 2012-12-02 16:00:00 0.1944719 0.1924975

3 2012-12-03 16:00:00 0.2195296 0.2074127

4 2012-12-04 16:00:00 0.2502294 0.1945549

5 2012-12-05 16:00:00 0.1986820 0.1805601

6 2012-12-06 16:00:00 0.2187758 0.1786600

> plotSignal(aapl.sentiment)

Page 15: "R & Text Analytics" (15 January 2013)

The Basics Term Co-occurrences

![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png)

![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png)

Page 16: "R & Text Analytics" (15 January 2013)

Prototyping Event Analysis How do we identify transient spikes corresponding to real world events? Suppose we want to use only these document count time-series and that we have a sliding history window. We might start with example data of events and try the performance of a couple different algorithms. source.id,datetime,gram,event,n twitter_sample,2012-09-12 05:45:00 -0700,apple,true,2

twitter_sample,2012-08-24 15:45:00 -0700,patent,true,1

twitter_sample,2012-10-29 08:00:00 -0700,#sandy,true,3

stock_twits,2012-10-02 08:15:00 -0700,$CMG,true,1

stock_twits,2012-09-13 09:30:00 -0700,fed,true,2

stock_twits,2012-04-11 07:00:00 -0700,lawsuit,true,1

...

Let's look more specifically at the case of the term "fed" on Stock Twits. From here on we're going to be looking at some code I used to prototype the alerts feature on TopicWatch. This prototyping code is not part of TopicWatchr, but is an example application of the package.

Page 17: "R & Text Analytics" (15 January 2013)

Prototyping Event Analysis > ev <- read.events("data/events.csv")

> fed.freq <- get.signal("fed", "2012-09-13 09:30:00", freq=T)

> head(fed.freq)

times vals

1 2012-09-01 11:00:00 0.0000000000

2 2012-09-01 12:00:00 0.0000000000

3 2012-09-01 13:00:00 0.0009699321

4 2012-09-01 14:00:00 0.0000000000

5 2012-09-01 15:00:00 0.0000000000

6 2012-09-01 16:00:00 0.0000000000

Page 18: "R & Text Analytics" (15 January 2013)

Prototyping Event Analysis

![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)

Page 19: "R & Text Analytics" (15 January 2013)

Prototyping Event Analysis > compute.thresholds <- function(x,window=96,t.func=compute.threshold){

L <- length(x) - window

t <- rep(NA,length(x))

for(i in 1:L){

t[window+i] <- t.func(x[(i-1):(i+window-1)])

}

t

}

> z.function <- function(theta=2.5){function(x){mean(x) + theta*sd(x)}}

> max.function <- function(theta=1.0){function(x){max(x) * theta}}

> cv.function <- function(theta=1.0){function(x){mean(x) + sd(x) * (theta + (sd(x) /

mean(x)))}}

> fed.freq$z <- compute.thresholds(fed.freq$vals, t.func=z.function())

> fed.freq$max <- compute.thresholds(fed.freq$vals, t.func=max.function())

> fed.freq$cv <- compute.thresholds(fed.freq$vals, t.func=cv.function())

> long.fed <- convert.to.long(fed.freq, "times")

> ggplot(long.fed) + geom_line(aes(x=times, y=value, col=variable)) +

scale_color_manual(values=c("black", "red", "blue", "green"))

Page 20: "R & Text Analytics" (15 January 2013)

Prototyping Event Analysis

![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)

Page 21: "R & Text Analytics" (15 January 2013)

Source Statistics > twitter.docs <- document.submatrix("twitter_sample", end=Sys.time(), hours=8,

to.df=FALSE)

> length(twitter.docs)

[1] 225

> twitter.docs[[1]]

best reality best reality show reality love

1 1 1 1

flava best flava of love show

1 1 1 1

reality show

1

> twitter.docterm <- submatrix.to.dataframe(twitter.docs, max.n=1)

> dim(twitter.docterm)

[1] 225 1280

> term.sums <- colSums(twitter.docterm)

> mean(term.sums)

mean(term.sums)

[1] 1.283594

> max(term.sums)

[1] 14

Now we have some information about our sampling of twitter documents. We have 225 documents, with 1280 unique terms. Right now the above function is simply grabbing 25 twitter documents per hour over the past 8 hours.

Page 22: "R & Text Analytics" (15 January 2013)

Source Statistics [Zipf's Law](http://en.wikipedia.org/wiki/Zipf's_law) is a classic finding in the field of lexical analysis. > term.sums <- sort(term.sums, decreasing=TRUE)

> qplot(x=log(1:length(term.sums)), y=log(term.sums))

![twitter zipf](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/twitter_zipf.png)

Page 23: "R & Text Analytics" (15 January 2013)

Feeling Adventurous? Last time at LuckySort HQ: We're looking for beta testers for the R package! In Shackleton's words, what to expect: **...BITTER COLD, LONG MONTHS OF COMPLETE DARKNESS, CONSTANT DANGER, SAFE RETURN DOUBTFUL...**

This time around we're in a slightly more stable place. There's more data, more options, and more opportunities to maybe discover some cool stuff! (Expect some darkness, minimal danger, and a shrinking population of software bugs.) prug-topicwatchr See also these notes used at the Portland R User's Group meeting on 15 January 2013 on GitHub. They cover basic usage of the TopicWatchr package to pull time series text data from the LuckySort API, with some examples of prototyping event detection heuristics with R. ( https://github.com/danielfennelly/prug-topicwatchr ) Talk with me about it, or get in touch later at [email protected]