Big Data Text Analytics - Meetupfiles.meetup.com/18870621/Big Data Text Analytics_final.pdf · • Classification / Categorization –The task is to assign a documentto one or more

August 25, 2015

© Analytics Inside 2014-2015

Advanced.

Analytical.

Intelligence.

Big Data Text AnalyticsVictoria Loewengart and Michael Covert

Agenda

• Introductions• What is Text Analytics / Natural Language Processing• Why Text Analytics is a Big Data problem• Need for TA/NLP• Basic TA/NLP Concepts• TA big data implementation with traditional TA

technologies• Advanced TA/NLP Concepts

– Semantic relationships and ontologies– Sentiment– Clustering and topic extraction

• Big data topic extraction algorithms bake-off• Summary and Conclusion

2August 25, 2015 Copyright © 2014-2015 Analytics Inside

• Text analytics is another “old but now new again” trend

– Reading and understanding text

– Heavily reliant on machine learning

– Areas of focus:• Sentiment analysis

• Extraction of “named entities”– Connecting named entities through references, actions, etc.

• Grouping documents with similar characteristics

• Assigning documents to “topics”

• Clustering (similarity / trending)

3August 25, 2015 Proprietary and Confidential

Text Analytics and Natural Language Processing

Definitions

• Natural Language Processing (NLP) is understanding, analysis, manipulation, and/or generation of natural (spoken) languages.

• Computational Linguistics is the study of the applications of computers in processing and analyzing language, as in automatic machine translation and text analysis.

• Text Analytics is the process of deriving high-quality information from text.

• Text Mining is the process of finding new things from text analytics that were previously unknown


• What does Big Data have to do with Text Analytics and Natural Language Processing?

– There are now a million words in the English language and about 3.4 billion combinations (N-grams)!• This is clearly a Big Data problem

– Refining and improving language recognition benefits from massive amounts of data• Original datasets were relatively small – 42,000 sentences

• Newer datasets are huge – 1,000,000 sentences and more

• Language processing is a classic “long tail” distribution

• Google “billion word” project


Big Data and TA/NLP

• More and more need exists – free text data is exploding due to social media, voice recognition, and other “automated” systems

• The belief is therefore that Big Data will provide better capabilities for understanding language.

• Machine learning has become key. Rule based NLP is still used, but most new science is statistical.– 14.7 words per day are added! Rules cannot be updated

fast enough.


Need for TA/NLP

Using Big Data Technologies

• Extracting, ingesting, digitizing, and preparing the text for mining– Connectivity to a broad spectrum of data sources.– Text ingestion and conversion.– Text preprocessing and preparation.

• Mapping your use cases to linguistic, statistical, trained, and unsupervised techniques– Text processing using linguistic rules.– Statistical text analysis.– Supervised and unsupervised techniques.

• Enrich the data and analyzing the findings– Post-processing and data enrichment with domain

knowledge.– A UI for browsing, refining, and analysis.

7August 25, 2015 © 2014 Analytics Inside, LLC. All Rights Reserved.

Basic Concepts

• Information Retrieval (IR) refers to the human-computer interaction (HCI) that happens when we use a machine to search a body of information for information objects (content) that match our search query. Depending on the sophistication of the algorithm, a person's query is matched against a set of documents to find a subset of 'relevant' documents.

• Information Extraction (IE) is extraction of specific information such as Named Entities, Events, and Facts.

• Metrics are Precision, Recall, and F-Measure



• Named Entities applicable to most domains:– People names

– Organization names

– Dates

– Locations (Countries, Cities, Continents/geographic terms)

– Currency

• Domain specific named entities:– Diseases, diagnoses, procedures, body parts

– Drugs, dosages, and usage

– Identifiers – SSN, Driver’s license, Claim number, Domain name, URL

Named Entity Extraction

The National Information Exchange Model (NIEM)


PersonName

AddressPhone

IdentificationLicense

CompanyVehicle

…

PatientMedical ProviderHospital or Facility

PharmaceuticalDiagnosis / Injury

ProcedurePharmacy

Medical ReportBiometrics

…

Police ReportCoroner ReportArrest Record

ChargeConviction

Enforcement AgencyAlias

ObservationWeapon

Criminal Method…

User IDIP Address

Network OriginationOnline postings

Social Media PagesEmail

Text Messages…

Info Bearing EntityDocument

URLTerm

ConceptSentiment

…

Security logWeb log

AssetAsset classHR Report

Encryption Method…

Financial InstrumentEventTask

LanguagePredictionInference

…

AccountCredit Card

PolicyClaimLienTitle

…


Simple TA example using MapReduce


Parallelize NLP Operations

Relationships

• Relationships may occur through communication, friendship, advice, influence, or exchange. The two basic elements of a relationship network are links and nodes.

• Relationship analysis is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.

• Semantic relations among words can be extracted from their textual context in natural languages.

• Graphs allow us to store the relationships between entities, and algorithms allow us to interrogate these connections.


Simple Relationships - Techniques

• Simple relationships are identified through co-location.

• Co-location is the instance of occurrence within a unit of text• Sentence

• Paragraph

• Document

• Metadata is relevant too – coauthors

• Topics are words that are assigned to a document that relate “concepts.”



• Nouns are parsed into sentence structures– Yields <subject> <verb> <object> relationships

– Can usually detect compound subjects and various verb inflective forms

– Captures modifiers (adjectives and adverbs) that can be used in sentiment or inversion

• Graph analysis and graph theory now comes into play– When documents and document sets are processed, typically creates a

very large graph

Semantic Relationships

Clusters of terms

Graph structures

Central terms

Relationships -Example



• An Ontology is “a description of things that exist and how they relate to each other” (Chris Welty).

• An Ontology Model is:– the classification of entities and

– modeling the relationships between those entities.

Ontologies

Sentiment

• An opinion is a binary expression that consists of two key components:

– A target (which we shall call “topic”, as referred to by most social analytics tools);– A sentiment on the target/topic, often accompanied by a probability.

• Sentiment analysis on content means discerning the opinions in content and picking the mood (attitude) within those opinions.

• A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral.


Classification and Clustering

• Classification / Categorization

– The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically.

• Clustering

– Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering.

• Machine learning is used

– Supervised uses “known results”

– Unsupervised finds results from the unknown


Extracting topics


TopicCluster ID

Topic

Topic

Topic

Topic

Topic

Cluster ID

Cluster ID

Cluster ID

Cluster ID

Cluster IDProbability

ClusteringLDA and CVB

Documents

Probability

Probability

Summary

Analytics Inside ™ - 2015

Term

Term

Term

Term

Extracting topics


Probability

Summary

Analytics Inside ™ - 2015

Document Term Matrix


Space reduction, Latent Semantic Indexing, and eigenvectors

Reveals the most important terms in a set of documents

Note that this looks justlike a graph adjacency matrix!

• OpenNLP– The Apache OpenNLP library is a machine learning based Java toolkit

for the processing of natural language text.

• NLTK– It is a Python library. Provides easy-to-use interfaces to over 50

corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning

• Stanford NLP– Java libraries for statistical NLP, deep learning NLP, and rule-based

NLP tools for major computational linguistics problems.


Open Source NLP Libraries

• All provide trained machine learning models for NLP processing

• Low level building blocks that can be wrapped in “Big Data Technologies”

• Run NLP operations in parallel


Open Source NLP Libraries

Topic Extraction Algorithm Bake-Off

• We ran Mahout CVB and MLLib LDA topic extraction algorithms against the same set of 11 documents describing 1) terrorism and 2) healthcare

– Mahout is Hadoop MapReduce

– MLLib is Spark

• Documents are copied into HDFS

– Stop list is employed


Mahout – An example

Running a CVB example

# Create sequencefiles from the text filesmahout seqdirectory -i docs -o sequencefiles/ -c UTF-8 -chunk 5

# Generate vectors from sequence files and calculate the weights of the termsmahout seq2sparse -i sequencefiles/ -o vectors/ -ow -wt tfidf -x 4800 -nv

# create matrixmahout rowid -i vectors/tfidf-vectors -o matrix

# run cvbmahout cvb -i matrix/matrix -o lda_output -mt lda_output/models -dtlda_output/docTopics -k 2 -nt --maxIter 10 --num_terms 10000

# dump resultsmahout vectordump -i lda_output/final -d vectors/dictionary.file-0 -dt sequencefile --vectorSize 10 -sort TRUE


MLLib– example• Running MLLib LDA example

– ./bin/run-example mllib.LDAExample --stopwordFile stoplist/stopwords.txt docs --k 2

• Other options include – --maxIterations <value>– number of iterations of learning. default: 10– --docConcentration <value>– amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0– --topicConcentration <value>– amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0– --vocabSize <value>– number of distinct word types to use, chosen by frequency. (-1=all) default: 10000– --checkpointDir <value>– Directory for checkpointing intermediate results. Checkpointing helps with

recovery and eliminates temporary shuffle files on disk. default: None– --checkpointInterval <value>– Iterations between each checkpoint. Only used if checkpointDir is set. default: 10


ComparisonMLLib59.962s

Topic 0 Topic1

homeland patient

committee pain

threat history

somalia disease

minneapolis chest

qaeda upper

american skin

leaders sickle

radicalization normal

security years

Mahout32 minutes

Topic0 Topic1

al patient

shabaab pain

muslim she

homeland disease

qaeda chest

committee upper

american her

u.s. normal

our skin

threat pulmonary

August 25, 2015© 2014 Analytics Inside, LLC. All Rights

Reserved.28

Summary

– Text Analytics is “new again” important science for understanding the meaning of unstructured text

– Text Analytcs is a Big Data problem

– Traditional TA techniques can be used with Big Data technologies

– Machine learning is at the core of Text Analytics

– Major Big Data technologies (Spark, Hadoop) support ML libraries for clustering and topic extraction


Questions and Answers

August 25, 2015 30Proprietary and Confidential

[email protected]@AnalyticsInside.us

http://www.AnalyticsInside.us

mailto:[email protected]

mailto:[email protected]