Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Dan Sullivan

Big Data TechCon Boston 2015

*

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

* Performance Considerations

*

* First commercial work in natural language processing in late 1980s

* Document Warehousing and Text Mining, 2001

* Most recent and current text mining work in life sciences area

* Classification

* Named Entity Recognition

* Event Extraction

* Contact

* [email protected]

* @dsapptech

* Linkedin.com/in/dansullivanpdx

mailto:[email protected]

*

Discount Code:

DATA35

• Available as book & eBook

• FREE shipping in the U.S.

• EPUB, PDF, and MOBI

eBook formats provided

Also available at booksellers and

online retailers – 35% off discount

only good at informit.com

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows, Procedures and Governance


*

*Large volumes of accessible and relevant texts:

*Social media

*Email

*Patents and research

*Customer communications

* Use Cases

*Market research

*Brand monitoring

*e-Discovery

* Intellectual property management

Manual procedures are time consuming and costly

Volume of literature continues to grow

Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually

Some success with popular tools but limitations

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows

*Performance Considerations

*

* Analysis of tone or opinion of a communication

* Polarity:text {positive, neutral, negative}

* Categorization:

text {angry, pleased, confused …}

* Scaletext -10 … +10

* Metadata about context essential

* subject area

* communication medium

*

*Keywords

*Lexical Affinity

* Affective Norms for English Words (ANEW)

* Emotional Dimensions

* Arousal

* Dominance

* Valence

*Statistical Classification

*Semantic or Concept-based Classification

*

* Use Cases

* Brand monitoring

* Competitive intelligence

* Demographic modeling

* Campaign analysis

* Tools

* RapidMiner

* ViralHeat Sentiment Analysis API

* Python NLTK

* Python TextBlog

* R sentiment package

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows, Procedures and Governance


*

* Technique for identify dominant themes in document

* Does not require training

* Multiple Algorithms

* Probabilistic Latent Semantic Indexing (PLSI)

* Latent Dirichlet allocation (LDA)

*Assumptions

*Documents about a mixture of topics

*Words used in document attributable to topic

Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/

Debt, Law,

Graduation

Debt, EU,

Greece, Euro

Source: http://www.nytimes.com/pages/business/index.html April 27, 2015

EU, Greece,

Negotiations,

Varoufakis

http://www.nytimes.com/pages/business/index.html

*

* Topics represented by words; documents about a set of topics

*Doc 1: 50% politics, 50% presidential

*Doc 2: 25% CPU, 30% memory, 45% I/O

*Doc 3: 30% cholesterol, 40% arteries, 30% heart

* Learning Topics

*Assign each word to a topic

*For each word and topic, compute

* Probability of topic given a document P(topic|doc)

* Probability of word given a topic P(word|topic)

* Reassign word to new topic with probability P(topic|doc) * P(word|topic)

* Reassignment based on probability that topic T generated use of word W

TOPICS

Image Source: David Blei, “Probabilistic Topic Models”

http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/

*

* Use Cases

* Data exploration in large corpus

* Pre-classification analysis

* Identify dominant themes

* Tools

*Stanford Topic Modeling Toolbox

*Mallet (UMass Amherst)

*R package: topicmodels

*Python package: Gensim

*

* Sentiment Analysis

* Topic Modeling

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows


* 3 Key Components

* Data

* Representation scheme

* Algorithms

* Data

* Positive examples – Examples from representative

corpus

* Negative examples – Randomly selected from same

publications

* Representation

* TF-IDF

* Vector space representation

* Cosine of vectors measure of similarity

* Algorithms

* Supervised learning

* SVMs

* Ridge Classifier

* Perceptrons

* kNN

* SGD Classifier

* Naïve Bayes

* Random Forest

* AdaBoost*

*

*

*

Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:

Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/

Support Vector Machine (SVM) is large margin classifier

Commonly used in text classification

Initial results based on life sciences sentence classifier

Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png

*

* Term Frequency (TF)

tf(t,d) = # of occurrences of t in d

t is a term

d is a document

* Inverse Document Frequency (IDF)

idf(t,D) = log(N / |{d in D : t in d}|)

D is set of documents

N is number of document

* TF-IDF = tf(t,d) * idf(t,D)

* TF-IDF is

* large when high term frequency in document and low

term frequency in all documents

* small when term appears in many documents

*

* Bag of word model

* Ignores structure (syntax) and meaning (semantics) of sentences

* Representation vector length is the size of set of unique words in corpus

* Stemming used to remove morphological differences

* Each word is assigned an index in the representation vector, V

* The value V[i] is non-zero if word appears in sentence represented by vector

* The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus

*

Non-VF, Predicted VF: “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of

EspB into the host cell.”

“Data were log-transformed to correct for heterogeneity of the variances where necessary.”

“Subsequently, the kanamycin resistance cassette from pVK4 was cloned into thePstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.”

VF, Predicted Non-VF “Here, it is reported that the pO157-encoded Type V-secreted serine protease

EspP influences the intestinal colonization of calves. “

“Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “

“The DsbLI system also comprises a functional redox pair”

Adding additional examples is not likely to substantially

improve results as seen by error curve

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 2000 4000 6000 8000 10000

All

Training Error

Validation Error

8 Alternative Algorithms

Select 10,000 most important features using chi-square

*

* SAS Text Miner

* IBM Text Analytics

* Smartlogic

* Python: scikit-learn

* R: RTextTools

* R: tm

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows


*

* Processes of identifying words and phrases of objects in specific categories. Also known as:

*Entity identification

*Entity extraction

*Chunking

* Two steps:

* Detect entities

* Classify entities

* Common classes of entities:

* Persons

* Organizations

* Geographic locations

* Dates

* Monetary amounts

*

*

* Four Broad Techniques

*Linguistic - utilize structure of sentence

* Statistical – detect patterns in training

examples

* Custom patterns – regular expressions

* Dictionaries

*Challenges

*Creating training corpus

*Granularity

*

*

*

*Use Cases

* Name normalization

* Entity correlation

*Quantified metrics based on texts

*Building block for event extraction

*Tools

* Stanford Core NLP

* OpenNLP

* Mallet

* Basis Technology

* Lexalytics

* NetOwl

* Cogitio API

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows


*

* Entities and relations between

entities

* Company A acquires Company B

* Engineer A filed patent application

on Topic B on Date C

*Politician P announces A on Twitter

on Date B

* Assign roles to entities

* Assign subtypes

* Link to semantic data

*

* Brenden’s Twitter NLP Tools -

https://github.com/aritter/twitter_nlp

* Alchemy API

* Turku BioNLP Event Extraction Software

* Stanford Biomedical Event Parser

Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/

*

* Classification

* Named Entity Recognition

* Event Extraction

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows


*

* Document Collection

* Text Extraction

* Pre-processing

* Case conversion

* Punctuation removal

* Stemming

* Normalization

* N-gram analysis

* Analysis

* Term Frequency – Inverse Document Frequency

* Conditional Probabilities and Topic Models

* NER and Entity Extraction

* Integration

* Link to Structured Data

* Augment with additional semantic information

* Utilization

* Improve information retrieval

* Identity brand perception problems

* Assess likelihood of customer churn

* Predict likelihood of …

Collect

Extract &

Pre-Process

Analyze

Integrate

Utilize

*Source: https://uima.apache.org/

*



*Sentiment Analysis

*Topic Modeling

*Classification


*Event Extraction

* Workflows


*

* Scalability

* Multiple language support

* Quality

*Precision

*Recall

* Algorithm selection

* Reliability and timeliness of sources

* Integration rules

* Increase quantity of data (not always helpful; see

error curves)

* Improve quality of data

* Utilize multiple supervised algorithms,

ensemble and non-ensemble

* Use unlabeled data and semi-supervised

techniques

* Feature Selection

* Parameter Tuning

* Feature Engineering

* Given:

* High quality data in sufficient quantity

* State of the art machine learning algorithms

* How to improve results: Change Representation?

*

*TF-IDF

*Loss of syntactic and semantic information

*No relation between term index and meaning

*No support for disambiguation

*Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties

*

Ideal Representation◦ Capture semantic

similarity of words

◦ Does not require feature engineering

◦ Minimal pre-processing, e.g. no mapping to ontologies

◦ Improves precision and recall

*Words represented as set of

weights in vector

*Useful properties

* Semantically similar words in close

proximity

* Methods for capturing phrases, e.g.

“Secretion system”

* Captures some semantic features

* Trained with

* Skip-gram or CBOW algorithms

* Text, such as PubMed abstracts and

open access papers

*T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf

*

*

* “Characterization of the Affective Norms for English Words by discrete emotional categories” http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf

* “New Avenues in Opinion Mining and Sentiment Analysis” http://sentic.net/new-avenues-in-opinion-mining-and-sentiment-analysis.pdf

* “Empirical Study of Topic Modeling in Twitter” http://snap.stanford.edu/soma2010/papers/soma2010_12.pdfhttp://snap.stanford.edu/soma2010/papers/soma2010_12.pdf

* “Open Domain Event Extraction from Twitter” http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf

http://sentic.net/new-avenues-in-opinion-mining-and-sentiment-analysis.pdf

http://snap.stanford.edu/soma2010/papers/soma2010_12.pdf

http://snap.stanford.edu/soma2010/papers/soma2010_12.pdf

http://turing.cs.washington.edu/papers/kdd12-ritter.pdf