30
August 25, 2015 © Analytics Inside 2014-2015 Advanced. Analytical. Intelligence. Big Data Text Analytics Victoria Loewengart and Michael Covert

Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

August 25, 2015

© Analytics Inside 2014-2015

Advanced.

Analytical.

Intelligence.

Big Data Text AnalyticsVictoria Loewengart and Michael Covert

Page 2: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Agenda

• Introductions• What is Text Analytics / Natural Language Processing• Why Text Analytics is a Big Data problem• Need for TA/NLP• Basic TA/NLP Concepts• TA big data implementation with traditional TA

technologies• Advanced TA/NLP Concepts

– Semantic relationships and ontologies– Sentiment– Clustering and topic extraction

• Big data topic extraction algorithms bake-off• Summary and Conclusion

2August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 3: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

• Text analytics is another “old but now new again” trend

– Reading and understanding text

– Heavily reliant on machine learning

– Areas of focus:• Sentiment analysis

• Extraction of “named entities”– Connecting named entities through references, actions, etc.

• Grouping documents with similar characteristics

• Assigning documents to “topics”

• Clustering (similarity / trending)

3August 25, 2015 Proprietary and Confidential

Text Analytics and Natural Language Processing

Page 4: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Definitions

• Natural Language Processing (NLP) is understanding, analysis, manipulation, and/or generation of natural (spoken) languages.

• Computational Linguistics is the study of the applications of computers in processing and analyzing language, as in automatic machine translation and text analysis.

• Text Analytics is the process of deriving high-quality information from text.

• Text Mining is the process of finding new things from text analytics that were previously unknown

4August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 5: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

• What does Big Data have to do with Text Analytics and Natural Language Processing?

– There are now a million words in the English language and about 3.4 billion combinations (N-grams)!• This is clearly a Big Data problem

– Refining and improving language recognition benefits from massive amounts of data• Original datasets were relatively small – 42,000 sentences

• Newer datasets are huge – 1,000,000 sentences and more

• Language processing is a classic “long tail” distribution

• Google “billion word” project

5August 25, 2015 Proprietary and Confidential

Big Data and TA/NLP

Page 6: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

• More and more need exists – free text data is exploding due to social media, voice recognition, and other “automated” systems

• The belief is therefore that Big Data will provide better capabilities for understanding language.

• Machine learning has become key. Rule based NLP is still used, but most new science is statistical.– 14.7 words per day are added! Rules cannot be updated

fast enough.

6August 25, 2015 Proprietary and Confidential

Need for TA/NLP

Page 7: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Using Big Data Technologies

• Extracting, ingesting, digitizing, and preparing the text for mining– Connectivity to a broad spectrum of data sources.– Text ingestion and conversion.– Text preprocessing and preparation.

• Mapping your use cases to linguistic, statistical, trained, and unsupervised techniques– Text processing using linguistic rules.– Statistical text analysis.– Supervised and unsupervised techniques.

• Enrich the data and analyzing the findings– Post-processing and data enrichment with domain

knowledge.– A UI for browsing, refining, and analysis.

7August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 8: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Basic Concepts

• Information Retrieval (IR) refers to the human-computer interaction (HCI) that happens when we use a machine to search a body of information for information objects (content) that match our search query. Depending on the sophistication of the algorithm, a person's query is matched against a set of documents to find a subset of 'relevant' documents.

• Information Extraction (IE) is extraction of specific information such as Named Entities, Events, and Facts.

• Metrics are Precision, Recall, and F-Measure

8August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 9: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

9August 25, 2015 Copyright © 2014-2015 Analytics Inside

• Named Entities applicable to most domains:– People names

– Organization names

– Dates

– Locations (Countries, Cities, Continents/geographic terms)

– Currency

• Domain specific named entities:– Diseases, diagnoses, procedures, body parts

– Drugs, dosages, and usage

– Identifiers – SSN, Driver’s license, Claim number, Domain name, URL

Named Entity Extraction

Page 10: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

The National Information Exchange Model (NIEM)

10August 25, 2015 Copyright © 2014-2015 Analytics Inside

PersonName

AddressPhone

IdentificationLicense

CompanyVehicle

PatientMedical ProviderHospital or Facility

PharmaceuticalDiagnosis / Injury

ProcedurePharmacy

Medical ReportBiometrics

Police ReportCoroner ReportArrest Record

ChargeConviction

Enforcement AgencyAlias

ObservationWeapon

Criminal Method…

User IDIP Address

Network OriginationOnline postings

Social Media PagesEmail

Text Messages…

Info Bearing EntityDocument

URLTerm

ConceptSentiment

Security logWeb log

AssetAsset classHR Report

Encryption Method…

Financial InstrumentEventTask

LanguagePredictionInference

AccountCredit Card

PolicyClaimLienTitle

Page 11: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

11August 25, 2015 Proprietary and Confidential

Simple TA example using MapReduce

Page 12: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

12August 25, 2015 Proprietary and Confidential

Parallelize NLP Operations

Page 13: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Relationships

• Relationships may occur through communication, friendship, advice, influence, or exchange. The two basic elements of a relationship network are links and nodes.

• Relationship analysis is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.

• Semantic relations among words can be extracted from their textual context in natural languages.

• Graphs allow us to store the relationships between entities, and algorithms allow us to interrogate these connections.

13August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 14: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Simple Relationships - Techniques

• Simple relationships are identified through co-location.

• Co-location is the instance of occurrence within a unit of text• Sentence

• Paragraph

• Document

• Metadata is relevant too – coauthors

• Topics are words that are assigned to a document that relate “concepts.”

14August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 15: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

15August 25, 2015 Copyright © 2014-2015 Analytics Inside

• Nouns are parsed into sentence structures– Yields <subject> <verb> <object> relationships

– Can usually detect compound subjects and various verb inflective forms

– Captures modifiers (adjectives and adverbs) that can be used in sentiment or inversion

• Graph analysis and graph theory now comes into play– When documents and document sets are processed, typically creates a

very large graph

Semantic Relationships

Clusters of terms

Graph structures

Central terms

Page 16: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Relationships -Example

16August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 17: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

17August 25, 2015 Copyright © 2014-2015 Analytics Inside

• An Ontology is “a description of things that exist and how they relate to each other” (Chris Welty).

• An Ontology Model is:– the classification of entities and

– modeling the relationships between those entities.

Ontologies

Page 18: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Sentiment

• An opinion is a binary expression that consists of two key components:

– A target (which we shall call “topic”, as referred to by most social analytics tools);– A sentiment on the target/topic, often accompanied by a probability.

• Sentiment analysis on content means discerning the opinions in content and picking the mood (attitude) within those opinions.

• A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.

18August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 19: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Classification and Clustering

• Classification / Categorization

– The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically.

• Clustering

– Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.

• Machine learning is used

– Supervised uses “known results”

– Unsupervised finds results from the unknown

19August 25, 2015 Copyright © 2014-2015 Analytics Inside

Page 20: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Extracting topics

20August 25, 2015 Copyright © 2014-2015 Analytics Inside

TopicCluster ID

Topic

Topic

Topic

Topic

Topic

Cluster ID

Cluster ID

Cluster ID

Cluster ID

Cluster IDProbability

ClusteringLDA and CVB

Documents

Probability

Probability

Summary

Analytics Inside ™ - 2015

Term

Term

Term

Term

Page 21: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Extracting topics

21August 25, 2015 Copyright © 2014-2015 Analytics Inside

Probability

Summary

Analytics Inside ™ - 2015

Page 22: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Document Term Matrix

22August 25, 2015 Copyright © 2014-2015 Analytics Inside

Space reduction, Latent Semantic Indexing, and eigenvectors

Reveals the most important terms in a set of documents

Note that this looks justlike a graph adjacency matrix!

Page 23: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

• OpenNLP– The Apache OpenNLP library is a machine learning based Java toolkit

for the processing of natural language text.

• NLTK– It is a Python library. Provides easy-to-use interfaces to over 50

corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning

• Stanford NLP– Java libraries for statistical NLP, deep learning NLP, and rule-based

NLP tools for major computational linguistics problems.

23August 25, 2015 Proprietary and Confidential

Open Source NLP Libraries

Page 24: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

• All provide trained machine learning models for NLP processing

• Low level building blocks that can be wrapped in “Big Data Technologies”

• Run NLP operations in parallel

24August 25, 2015 Proprietary and Confidential

Open Source NLP Libraries

Page 25: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Topic Extraction Algorithm Bake-Off

• We ran Mahout CVB and MLLib LDA topic extraction algorithms against the same set of 11 documents describing 1) terrorism and 2) healthcare

– Mahout is Hadoop MapReduce

– MLLib is Spark

• Documents are copied into HDFS

– Stop list is employed

25August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 26: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Mahout – An example

Running a CVB example

# Create sequencefiles from the text filesmahout seqdirectory -i docs -o sequencefiles/ -c UTF-8 -chunk 5

# Generate vectors from sequence files and calculate the weights of the termsmahout seq2sparse -i sequencefiles/ -o vectors/ -ow -wt tfidf -x 4800 -nv

# create matrixmahout rowid -i vectors/tfidf-vectors -o matrix

# run cvbmahout cvb -i matrix/matrix -o lda_output -mt lda_output/models -dtlda_output/docTopics -k 2 -nt --maxIter 10 --num_terms 10000

# dump resultsmahout vectordump -i lda_output/final -d vectors/dictionary.file-0 -dt sequencefile --vectorSize 10 -sort TRUE

26August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 27: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

MLLib– example• Running MLLib LDA example

– ./bin/run-example mllib.LDAExample --stopwordFile stoplist/stopwords.txt docs --k 2

• Other options include – --maxIterations <value>– number of iterations of learning. default: 10– --docConcentration <value>– amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0– --topicConcentration <value>– amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0– --vocabSize <value>– number of distinct word types to use, chosen by frequency. (-1=all) default: 10000– --checkpointDir <value>– Directory for checkpointing intermediate results. Checkpointing helps with

recovery and eliminates temporary shuffle files on disk. default: None– --checkpointInterval <value>– Iterations between each checkpoint. Only used if checkpointDir is set. default: 10

27August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 28: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

ComparisonMLLib59.962s

Topic 0 Topic1

homeland patient

committee pain

threat history

somalia disease

minneapolis chest

qaeda upper

american skin

leaders sickle

radicalization normal

security years

Mahout32 minutes

Topic0 Topic1

al patient

shabaab pain

muslim she

homeland disease

qaeda chest

committee upper

american her

u.s. normal

our skin

threat pulmonary

August 25, 2015© 2014 Analytics Inside, LLC. All Rights

Reserved.28

Page 29: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Summary

– Text Analytics is “new again” important science for understanding the meaning of unstructured text

– Text Analytcs is a Big Data problem

– Traditional TA techniques can be used with Big Data technologies

– Machine learning is at the core of Text Analytics

– Major Big Data technologies (Spark, Hadoop) support ML libraries for clustering and topic extraction

29August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 30: Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

Questions and Answers

August 25, 2015 30Proprietary and Confidential

[email protected]@AnalyticsInside.us

http://www.AnalyticsInside.us