51
Dan Sullivan Big Data TechCon Boston 2015 *

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Embed Size (px)

Citation preview

Page 1: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Dan Sullivan

Big Data TechCon Boston 2015

*

Page 2: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

* Performance Considerations

Page 3: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* First commercial work in natural language processing in late 1980s

* Document Warehousing and Text Mining, 2001

* Most recent and current text mining work in life sciences area

* Classification

* Named Entity Recognition

* Event Extraction

* Contact

* [email protected]

* @dsapptech

* Linkedin.com/in/dansullivanpdx

Page 4: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Discount Code:

DATA35

• Available as book & eBook

• FREE shipping in the U.S.

• EPUB, PDF, and MOBI

eBook formats provided

Also available at booksellers and

online retailers – 35% off discount

only good at informit.com

Page 5: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows, Procedures and Governance

* Performance Considerations

Page 6: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

*Large volumes of accessible and relevant texts:

*Social media

*Email

*Patents and research

*Customer communications

* Use Cases

*Market research

*Brand monitoring

*e-Discovery

* Intellectual property management

Page 7: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Manual procedures are time consuming and costly

Volume of literature continues to grow

Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually

Some success with popular tools but limitations

Page 8: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

*Performance Considerations

Page 9: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Analysis of tone or opinion of a communication

* Polarity:text {positive, neutral, negative}

* Categorization:

text {angry, pleased, confused …}

* Scaletext -10 … +10

* Metadata about context essential

* subject area

* communication medium

Page 10: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

*Keywords

*Lexical Affinity

* Affective Norms for English Words (ANEW)

* Emotional Dimensions

* Arousal

* Dominance

* Valence

*Statistical Classification

*Semantic or Concept-based Classification

Page 11: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Use Cases

* Brand monitoring

* Competitive intelligence

* Demographic modeling

* Campaign analysis

* Tools

* RapidMiner

* ViralHeat Sentiment Analysis API

* Python NLTK

* Python TextBlog

* R sentiment package

Page 12: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows, Procedures and Governance

* Performance Considerations

Page 13: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Technique for identify dominant themes in document

* Does not require training

* Multiple Algorithms

* Probabilistic Latent Semantic Indexing (PLSI)

* Latent Dirichlet allocation (LDA)

*Assumptions

*Documents about a mixture of topics

*Words used in document attributable to topic

Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/

Page 14: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Debt, Law,

Graduation

Debt, EU,

Greece, Euro

Source: http://www.nytimes.com/pages/business/index.html April 27, 2015

EU, Greece,

Negotiations,

Varoufakis

Page 15: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Topics represented by words; documents about a set of topics

*Doc 1: 50% politics, 50% presidential

*Doc 2: 25% CPU, 30% memory, 45% I/O

*Doc 3: 30% cholesterol, 40% arteries, 30% heart

* Learning Topics

*Assign each word to a topic

*For each word and topic, compute

* Probability of topic given a document P(topic|doc)

* Probability of word given a topic P(word|topic)

* Reassign word to new topic with probability P(topic|doc) * P(word|topic)

* Reassignment based on probability that topic T generated use of word W

TOPICS

Page 16: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Image Source: David Blei, “Probabilistic Topic Models”

http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/

Page 17: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Use Cases

* Data exploration in large corpus

* Pre-classification analysis

* Identify dominant themes

* Tools

*Stanford Topic Modeling Toolbox

*Mallet (UMass Amherst)

*R package: topicmodels

*Python package: Gensim

Page 18: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Sentiment Analysis

* Topic Modeling

Page 19: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

* Performance Considerations

Page 20: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

* 3 Key Components

* Data

* Representation scheme

* Algorithms

* Data

* Positive examples – Examples from representative

corpus

* Negative examples – Randomly selected from same

publications

* Representation

* TF-IDF

* Vector space representation

* Cosine of vectors measure of similarity

* Algorithms

* Supervised learning

* SVMs

* Ridge Classifier

* Perceptrons

* kNN

* SGD Classifier

* Naïve Bayes

* Random Forest

* AdaBoost*

Page 21: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Page 22: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Page 23: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:

Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/

Page 24: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Support Vector Machine (SVM) is large margin classifier

Commonly used in text classification

Initial results based on life sciences sentence classifier

Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png

*

Page 25: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

* Term Frequency (TF)

tf(t,d) = # of occurrences of t in d

t is a term

d is a document

* Inverse Document Frequency (IDF)

idf(t,D) = log(N / |{d in D : t in d}|)

D is set of documents

N is number of document

* TF-IDF = tf(t,d) * idf(t,D)

* TF-IDF is

* large when high term frequency in document and low

term frequency in all documents

* small when term appears in many documents

*

Page 26: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

* Bag of word model

* Ignores structure (syntax) and meaning (semantics) of sentences

* Representation vector length is the size of set of unique words in corpus

* Stemming used to remove morphological differences

* Each word is assigned an index in the representation vector, V

* The value V[i] is non-zero if word appears in sentence represented by vector

* The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus

*

Page 27: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Non-VF, Predicted VF: “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of

EspB into the host cell.”

“Data were log-transformed to correct for heterogeneity of the variances where necessary.”

“Subsequently, the kanamycin resistance cassette from pVK4 was cloned into thePstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.”

VF, Predicted Non-VF “Here, it is reported that the pO157-encoded Type V-secreted serine protease

EspP influences the intestinal colonization of calves. “

“Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “

“The DsbLI system also comprises a functional redox pair”

Page 28: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Adding additional examples is not likely to substantially

improve results as seen by error curve

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 2000 4000 6000 8000 10000

All

Training Error

Validation Error

Page 29: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

8 Alternative Algorithms

Select 10,000 most important features using chi-square

Page 30: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* SAS Text Miner

* IBM Text Analytics

* Smartlogic

* Python: scikit-learn

* R: RTextTools

* R: tm

Page 31: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

* Performance Considerations

Page 32: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Processes of identifying words and phrases of objects in specific categories. Also known as:

*Entity identification

*Entity extraction

*Chunking

* Two steps:

* Detect entities

* Classify entities

* Common classes of entities:

* Persons

* Organizations

* Geographic locations

* Dates

* Monetary amounts

Page 33: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Page 34: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Four Broad Techniques

*Linguistic - utilize structure of sentence

* Statistical – detect patterns in training

examples

* Custom patterns – regular expressions

* Dictionaries

*Challenges

*Creating training corpus

*Granularity

Page 35: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Page 36: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Page 37: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

*Use Cases

* Name normalization

* Entity correlation

*Quantified metrics based on texts

*Building block for event extraction

*Tools

* Stanford Core NLP

* OpenNLP

* Mallet

* Basis Technology

* Lexalytics

* NetOwl

* Cogitio API

Page 38: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

* Performance Considerations

Page 39: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Entities and relations between

entities

* Company A acquires Company B

* Engineer A filed patent application

on Topic B on Date C

*Politician P announces A on Twitter

on Date B

* Assign roles to entities

* Assign subtypes

* Link to semantic data

Page 40: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Brenden’s Twitter NLP Tools -

https://github.com/aritter/twitter_nlp

* Alchemy API

* Turku BioNLP Event Extraction Software

* Stanford Biomedical Event Parser

Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/

Page 41: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Classification

* Named Entity Recognition

* Event Extraction

Page 42: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

*Performance Considerations

Page 43: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Document Collection

* Text Extraction

* Pre-processing

* Case conversion

* Punctuation removal

* Stemming

* Normalization

* N-gram analysis

* Analysis

* Term Frequency – Inverse Document Frequency

* Conditional Probabilities and Topic Models

* NER and Entity Extraction

* Integration

* Link to Structured Data

* Augment with additional semantic information

* Utilization

* Improve information retrieval

* Identity brand perception problems

* Assess likelihood of customer churn

* Predict likelihood of …

Collect

Extract &

Pre-Process

Analyze

Integrate

Utilize

Page 44: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*Source: https://uima.apache.org/

Page 45: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Emerging Demand for Text Analytics

* Text Mining Techniques

*Sentiment Analysis

*Topic Modeling

*Classification

*Named Entity Recognition

*Event Extraction

* Workflows

*Performance Considerations

Page 46: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* Scalability

* Multiple language support

* Quality

*Precision

*Recall

* Algorithm selection

* Reliability and timeliness of sources

* Integration rules

Page 47: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

* Increase quantity of data (not always helpful; see

error curves)

* Improve quality of data

* Utilize multiple supervised algorithms,

ensemble and non-ensemble

* Use unlabeled data and semi-supervised

techniques

* Feature Selection

* Parameter Tuning

* Feature Engineering

* Given:

* High quality data in sufficient quantity

* State of the art machine learning algorithms

* How to improve results: Change Representation?

*

Page 48: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*TF-IDF

*Loss of syntactic and semantic information

*No relation between term index and meaning

*No support for disambiguation

*Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties

*

Ideal Representation◦ Capture semantic

similarity of words

◦ Does not require feature engineering

◦ Minimal pre-processing, e.g. no mapping to ontologies

◦ Improves precision and recall

Page 49: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*Words represented as set of

weights in vector

*Useful properties

* Semantically similar words in close

proximity

* Methods for capturing phrases, e.g.

“Secretion system”

* Captures some semantic features

* Trained with

* Skip-gram or CBOW algorithms

* Text, such as PubMed abstracts and

open access papers

*T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf

Page 50: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

Page 51: Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

*

* “Characterization of the Affective Norms for English Words by discrete emotional categories” http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf

* “New Avenues in Opinion Mining and Sentiment Analysis” http://sentic.net/new-avenues-in-opinion-mining-and-sentiment-analysis.pdf

* “Empirical Study of Topic Modeling in Twitter” http://snap.stanford.edu/soma2010/papers/soma2010_12.pdfhttp://snap.stanford.edu/soma2010/papers/soma2010_12.pdf

* “Open Domain Event Extraction from Twitter” http://turing.cs.washington.edu/papers/kdd12-ritter.pdf