Upload
buithu
View
213
Download
0
Embed Size (px)
Citation preview
R: Twitter/Text Mining
Setting up an API, i.e. Twitter Side
Twitter uses an Application Programming Interface (API) that allows developers to pull Twitter data.
To setup an API, go to http://apps.twitter.com, sign in using an existing Twitter account. Once logged in, select Create New App.
Twitter has recently increased its’ requirements for accessing their API. You will need to complete the application for the creation of a new app. You will need to specify the following in your application. Your application will go under a review before being approved. In my case this took about 20 minutes is all.
Purpose of app, e.g. advertising, research, etc. Provide documentation as to how the app will be used by yourself and users Identify whether or not this work is for some government / organization or yourself
Below is a listing of my current Twitter API connections. For the sake of discussion, we will use the DSCI210_TwitterPull API for discussion here.
1
After a successful application has been setup, the following information is needed to successfully access Twitter data from R.
App Name Consumer Key Consumer Secret Access Token Access Token Secret
2
Working with Twitter Pulls in R
The following packages/libraries will be used to analyze Twitter data in R.
# install rtweet from CRANinstall.packages("rtweet")
# load rtweet packagelibrary(rtweet)
The following information is obtained from the DSCI210_TwitterPull application on Twitter side.
#Setting global values in R for connection to Twittercreate_token( app = "DSCI210_TwitterAPI", consumer_key = "<consumer key provided by Twitter>", consumer_secret = "<consumer secret provided by Twitter>", access_token = "<access token provided by Twitter>", access_secret = "<access secret provided by Twitter>p" )
From the rtweet package, the search_tweets function can be used to obtain a Twitter pull from Twitter’s API.
3
Getting a Twitter pull for a recent event, say #venezuela.
#Getting a pull for #venezuelapull <- search_tweets( "#venezuela",
n = 5000, include_rts = FALSE
)
The object returned by this function is a data.frame.
#What is the type of this objectclass(pull)
A snip-if of the data…
What are the fields returned by the API?
#What are the field names returned by the APInames(pull)
4
Getting Basic Summaries
Getting a top 10 list of screen names …
#Getting dplyr library for some simple analyseslibrary(dplyr)
(pull %>% group_by(screen_name) %>% summarise(Count=n()) %>% arrange( desc(Count) ) %>% head(10) )
5
A summary of source – keeping only sources that are occurring more than 10 times
#What the sources for #venezeula, keep count > 10(pull %>% dplyr::group_by(source) %>% dplyr::summarise(Count=n()) %>% dplyr::arrange( desc(Count) ) %>% dplyr::filter(Count > 10)
)
Next, let us consider some basic mapping…
############################################# Prepare for mapping Twitter data# Get maps package
## install maps from CRANinstall.packages("maps")
## load rtweet packagelibrary(maps)
Obtain a twitter search for #venezuela originating from the United States
## search for 10,000 tweets sent from the USpull_usa <- search_tweets( "#venezuela AND lang:en", geocode = lookup_coords("usa"), n = 1000 )
6
Some code to prepare for mapping…
# create latitude and longitude variables using all available tweet and profile geo-location datapull_usa <- lat_lng(pull_usa)
#Plotting a blank United States map# plot state boundariespar(mar = c(0, 0, 0, 0))maps::map("state", lwd = .25)
# plot lat and lng points onto state mapwith(pull_usa, points(lng, lat, pch = 20, cex = 2, col = rgb(0, .3, .7, .75)))
7
Creating a plot over time using the ts_plot() function. This function uses the ggplot2() syntax. Aggregation is being done over every 5 mins here.
############################################### Doing some plotting over time with the rtweet package
#Creating a time series plot usign the ts_plot() functionts_plot(pull, "5 mins") + ggplot2::theme_minimal() + ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) + ggplot2::labs( x = NULL, y = NULL, title = "Frequency of #venezula tweets over last few minutes", subtitle = "Tweet Counts in 5 min intervals", caption = "\nSource: Data collected from Twitter's REST API via rtweet" )
8
Application of Basic Text Mining Procedures
First, installing some packages for basic text mining
################################################ Basic Text Mining Procedures# Installing some package for text mining
install.packages("tidytext")install.packages("stringr")install.packages("wordcloud2")
#Adding these librarieslibrary(tidytext)library(stringr)library(wordcloud2)
Getting a pull from Twitter for #venezuela and tweets in English…
################################################ Getting a pull for #venezuela in Englishpull <- search_tweets( "#venezuela AND lang:en", n = 5000, include_rts = FALSE)
Splitting the tokens (i.e. words) from each text so that each token is contained in it’s own row. The resulting table will have only one word per row.
##################################################################### Split each token (i.e. word) of the text column into its own row# This gives one token per row
(pull %>% unnest_tokens(word,text)) -> TweetTable
Getting rid of common English words from consideration. These are known as stop words. The data() function will load these common English words into this instance of R.
#Load the stop_words dataset (list of common English words)data("stop_words")
9
The following code will eliminate the stop words from the TweetTable.
#Keep all words that are *not* stop_words, i.e. anti-join being done here(TweetTable %>% anti_join(stop_words) %>% select(word)) -> TweetTable_NoStopWords
Getting a word count for each of the remaining words.
################################################## Do a word count, keeping only top 100 words(TweetTable_NoStopWords %>% count(word, sort = TRUE) %>% head(100)) -> WordCounts
Creating a simple wordcloud for this data…
################################################# Creating the word cloudwordcloud2(WordCounts, size=0.7)
Additional words should be removed from this wordcloud. For example, “venezuela”, “https”, “t.co” should be removed.
################################################## Getting rid of words that dominate this wordcloud # i.e. words that are specific to this example that should be removed
(WordCounts %>% filter(!word %in% c('venezuela','t.co', 'https', 'amp'))) -> FinalWordCounts
10
Creating a final wordcloud for this example…
################################################# Creating the word cloudwordcloud2(FinalWordCounts, size=0.7)
11