51
Cohan Sujay Carlos CEO, Aiaioo Labs Fun with Text Hacking Text Analytics

Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

  • Upload
    nasscom

  • View
    4.531

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Cohan Sujay CarlosCEO, Aiaioo Labs

Fun with TextHacking Text Analytics

Page 2: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

What I am going to talk about.

Text Analytics1. Examine 3 kinds of opportunities2. Discuss 3 text analytics problems3. Touch upon 3 things to watch out

for and 3 things to embrace.

Page 3: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

What if we can master “text”?What do we get from it?

There are opportunities in every vertical:

1. Aerospace / Defense / Automotive –-- Filing of various routine documents / Technical specification standardization / Competitive intelligence and customer feedback management

Page 4: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

What if we can master “text”?What do we get from it?

There are opportunities in every vertical:

1. Aerospace / Defense / Automotive –-- Filing of various routine documents / Technical specification standardization / Competitive intelligence and customer feedback management

2. Healthcare / Life sciences –-- Reporting / Storing relevant patents and publications / Analysis of research and competitive intelligence

Page 5: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

What if we can master “text”?What do we get from it?

There are opportunities in every vertical:

1. Aerospace / Defense / Automotive –-- Filing of various routine documents / Technical specification standardization / Competitive intelligence and customer feedback management

2. Healthcare / Life sciences –-- Reporting / Storing relevant patents and publications / Analysis of research and competitive intelligence

3. Legal and Government –-- Legal and administrative filings / Case document and administrative record management / Analysis of legal and administrative documents (land records, case files)

Page 6: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

What if we can master “text”?What do we get from it?

Do you observe a pattern?

In every vertical …

Output Text / Store and Transform Text / Ingest and Analyze Text

Page 7: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

How do we unlockthe value in “text”?

Output Text / Store and Transform Text / Ingest and Analyze Text

Natural Language Generation Natural Language Understanding

Natural Language Processing (aka Text Analytics)

Page 8: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1:Customer Service

Let’s say you have some text … … and a database or spreadsheet with columns

“John Chambers of Springfield, MAreported a problem with the clutchon his Ford Ranger purchased inBoston, MA in 2005.”

… and you have to fill in the database fieldsfrom the information in the text …

Reporter Location (of Reporter)

Product

Page 9: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1:Land Records

Let’s say you have some text … … and a database or spreadsheet with columns

“Property K45L234(lot 23-24) in Wake Countyof 3000 sq ftwas sold to James Fischeron 3-30-1997 …”

… and you have to fill in the database fieldsfrom the information in the text …

Page 10: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1:Land Records

Let’s say you have some text … … and a database or spreadsheet with columns

“Property K45L234(lot 23-24) in Wake Countyof 3000 sq ftwas sold to James Fischeron 3-30-1997 …”

… and you have to fill in the database fieldsfrom the information in the text …

Title Number Lot County

Page 11: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1:M&A Transactions

Let’s say you have some text … … and a database or spreadsheet with columns

“Acme Financials, a subsidiaryof Lehman Sisters, was acquiredby John Doe Corp on 5/26/2001.”

… and you have to fill in the database fieldsfrom the information in the text …

Page 12: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1:M&A Transactions

Let’s say you have some text … … and a database or spreadsheet with columns

“Acme Financials, a subsidiaryof Lehman Sisters, was acquiredby John Doe Corp on 5/26/2001.”

… and you have to fill in the database fieldsfrom the information in the text …

Acquirer Acquired Date

Page 13: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

Let’s say you have some text … … and a database or spreadsheet with columns

“John Chambers of Springfield, MAreported a problem with the clutchon his Ford Ranger purchased inBoston, MA in 2005.”

Entities are pieces of text that could go into the fields in the database.

Identifying entities and the relations between them

Reporter Location (of Reporter)

Product

Page 14: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

Let’s say you have some text … … and a database or spreadsheet with columns

“John Chambers of Springfield, MAreported a problem with the clutchon his Ford Ranger purchased inBoston, MA in 2005.”

Entities are pieces of text that could go into the fields in the database.

Identifying entities and the relations between them

Reporter Location Product

John Chambers

Springfield, MA

Ford Ranger

Page 15: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

Relations tell you about the connections between entities.

“John Chambers of Springfield, MAreported a problem with the clutchon his Ford Ranger purchased inBoston, MA in 2005.”

Entities are pieces of text that could go into the fields in the database.

Relations connect the entities that belong in a row.

Identifying entities and the relations between them

Reporter Location Product

John Chambers

Springfield, MA

Ford Ranger

Location of Reporter

Page 16: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

“John Chambers of Springfield, MAreported a problem with the clutchon his Ford Ranger purchased inBoston, MA in 2005.”

Information extraction converts:unstructured information into structured information.

Identifying entities and the relations between them

Reporter Location Product

John Chambers

Springfield, MA

Ford Ranger

Page 17: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

“John Chambers of Springfield, MAreported a problem with the clutchon his Ford Ranger purchased inBoston, MA in 2005.”

Information extraction can improve efficienciesin processes where humans read text and copy fields into

databases.

Identifying entities and the relations between them

Reporter Location Product

John Chambers

Springfield, MA

Ford Ranger

Page 18: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

How can text analytics methods be usedto automate entity and relation extraction?

Rule based methods Machine learning methods

Aiaioo Labs aiaioo.com

Page 19: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

Rule-based frameworks for entity and relation extraction?

http://services.gate.ac.uk/annie/

Page 20: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

Page 21: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

It uses lists of first names and last names of persons, and names of places … and matches them in the text …

How does GATE/Annie identify entities and the relations?

“John Chambers of Springfield, MA reported a problem with the clutchon his Ford Ranger purchased in Boston, MA in 2005.”

“Jack”“Jill”“John”

“Chambers”“Miller”“Farnsworth”

“Springfield”“Boston”“Cambridge”

“MA”“CA”“MD”

Page 22: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

Machine learning frameworks for entity and relation extraction?

https://opennlp.apache.org/

Apache OpenNLP

Page 23: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

Machine learning frameworks need training data.

https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html

Page 24: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

From examples such as:

It learns to recognize:

How does OpenNLP identify entities and the relations?

“John Chambers of Springfield, MA reported a problem with the clutchon his Ford Ranger purchased in Boston, MA in 2005.”

“<START:reporter>John Archer<END> of <START:location>Maryland<END> reported a problem with his <START:product>Figo<END>.”“<START:reporter>Vince Chambers<END> of <START:location>Denver, CO<END> had trouble with his <START:product>Focus<END>.”

Page 25: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 1: Customer Service[ Information Extraction ]

How to choose between text analytics methods for entity and relation extraction?

Rule based methods Machine learning methods

3 months to reasonably performing modelTypically higher precisionTypically less flexibilityTypically less recall

1+ years to reasonably performing modelTypically lower precisionTypically more flexibilityTypically higher recall + overall performance

Page 26: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

5’11”5’ 8”

Can you classify these door heights as: Short / Tall ?

5’8”5’11” 6’2”

6’6”5’ 2”

6’8”

6’9”

6’10”

Aiaioo Labs aiaioo.com

Page 27: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

5’11”5’ 8”

In analytics, an analyst comes upwith a rule.

5’8”5’11” 6’2”

6’6”5’ 2”

6’8”

6’9”

6’10”If door_height < 6’ then Short else Tall

Aiaioo Labs aiaioo.com

Page 28: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

5’11”5’ 8”

In machine learning, the computer comes up with a rule from examples.

5’8”5’11” 6’2”

6’6”5’ 2”

6’8”

6’9”

6’10”

Aiaioo Labs aiaioo.com

Page 29: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

How do we unlockthe value in “text”?

The first use case …

Output Text / Store and Transform Text / Ingest and Analyze Text

Information ExtractionIdentifying entities and the relations between them

Aiaioo Labs aiaioo.com

Page 30: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

How do we unlockthe value in “text”?

The second use case …

Output Text / Store and Transform Text / Ingest and Analyze Text

Text CategorizationLabeling text with one or more category labels

Aiaioo Labs aiaioo.com

Page 31: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2:Organizing Text for Storage

Let’s say you have some text … … and you want to mark it as one of …

“John Chambers of Springfield, MAreported a problem with the clutchon his Ford Ranger purchased inBoston, MA in 2005.”

ReportInquiry

Aiaioo Labs aiaioo.com

Page 32: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

Start by collecting some samples of documents of each of your categories

Report InquiryI have a problem

This complaint is about

Where can I buy a

Do you sell furniture

Aiaioo Labs aiaioo.com

Page 33: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

Train a classifier with them.

Aiaioo Labs aiaioo.com

Report InquiryI have a problem

This complaint is about

Where can I buy a

Do you sell furniture

Page 34: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

Start by collecting some samples of documents of each of your categories

Politics SportsThe United Nations

The United States and

Manchester United

Manchester and Barca

Aiaioo Labs aiaioo.com

Page 35: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

Train a classifier with them.

Politics SportsThe United Nations

The United States and

Manchester United

Manchester and Barca

Aiaioo Labs aiaioo.com

Page 36: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

Run the classifier on a new piece of text.

The classifier will return a label.

Politics

Nations and States

Aiaioo Labs aiaioo.com

Page 37: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

How can text analytics methods be usedto automate organization/categorization?

Rule based methods Machine learning methods

Aiaioo Labs aiaioo.com

Page 38: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

But rule-based methods work for classification too.

Rule-based text categorization is often used in:Social media sentiment classification

Aiaioo Labs aiaioo.com

Page 39: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

We use lists of negative and positive words (usually adjectives)

(available in the AFINN gazetteer) … and match them in the text …

How do we use rules to identify sentiment?

“I am sad that Steve Jobs died.”

“sad”“bad”“evil”

“distraught”“dead”“died”

“thrilled”“excited”“amazed”

“happy”“love”“joy”

Aiaioo Labs aiaioo.com

Page 40: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

Can we use entity and relation extraction to do better?

“I am sad that [Steve Jobs died].”

Analysis: This person holds a positive opinionof Steve Jobs

The –ve entity ‘sad’ is related to the –ve event ‘Steve Jobs died’.

Aiaioo Labs aiaioo.com

Page 41: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 2: Organizing Text[ Text Categorization ]

How to choose between text analytics methods for text categorization?

Rule based methods Machine learning methods

Typically higher precisionTypically less flexibilityTypically less recall

Typically lower precisionTypically more flexibilityTypically higher recall + overall performance

Aiaioo Labs aiaioo.com

Page 42: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

How do we unlockthe value in “text”?

The first use case …

Output Text / Store and Transform Text / Ingest and Analyze Text

Information ExtractionIdentifying entities and the relations between them

Aiaioo Labs aiaioo.com

Page 43: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

How do we unlockthe value in “text”?

The second use case …

Output Text / Store and Transform Text / Ingest and Analyze Text

Text CategorizationLabeling text with one or more category labels

Aiaioo Labs aiaioo.com

Page 44: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

How do we unlockthe value in “text”?

The third use case …

Output Text / Store and Transform Text / Ingest and Analyze Text

Question AnsweringGenerating a response to an inquiry

Aiaioo Labs aiaioo.com

Page 45: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 3:Answering Questions

Let’s say you get a question … … and you want to answer to be one of …

“Do you ship your cars to Boston, MA?” YesNo

Aiaioo Labs aiaioo.com

Page 46: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 3:Answering Questions

First you classify the question into one of 3 types… and these are…

“Do you ship your cars to Boston, MA?”

“Who is the CEO of Apple?”

“Why is the sky blue?”

Yes/No questionsFactoid questions

Non-factoid questions

Aiaioo Labs aiaioo.com

Page 47: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

Use Case 3:Answering Questions

Look for answers in databases that you created using entity / relationship extraction

“Do you ship your cars to Boston, MA?”

“Who is the CEO of Apple?”

“Why is the sky blue?”

Product Ships To

Cars USA

CEO Firm

Tim Cook Apple

Aiaioo Labs aiaioo.com

Page 48: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

To watch out for:

Text Analytics Traps1. Testing on Training Data2. Using US Training Data for India3. Treating all Data Sources as One

Aiaioo Labs aiaioo.com

Page 49: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

To embrace:

Text Analytics Tricks1. UI Compensation for AI Inaccuracy2. Raising Precision at the Cost of

Recall3. Domain Specific Rules

Aiaioo Labs aiaioo.com

Page 50: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

About Aiaioo Labs

AI Research Lab1. http://aiaioo.com2. http://aiaioo.com/publications3. http://aiaioo.wordpress.com

Aiaioo Labs aiaioo.com

Page 51: Nasscom Big Data and Analytics Summit 2016:Technology Track: Fun with Text - Hacking Text Analytics

THANK YOU

Aiaioo Labs aiaioo.com