41
Jeff Fried CTO BA Insight @jefffried #tbc2016 Rules-Based vs. Document-Based Bake-off

AutoClassificaiton - Rules versus Machine Learning

Embed Size (px)

Citation preview

Page 1: AutoClassificaiton - Rules versus Machine Learning

Jeff FriedCTOBA Insight

@jefffried#tbc2016

Rules-Based vs. Document-Based Bake-off

Page 2: AutoClassificaiton - Rules versus Machine Learning
Page 3: AutoClassificaiton - Rules versus Machine Learning

Focused on Search and

SharePoint since 2004

Longtime

Search Nerd

• CTO, BA Insight

• Senior PM, Microsoft

• VP, FAST

• SVP, LingoMotors

About Jeff Fried

Passionate About

• Search

• SharePoint

• Search-driven

applications

• Information Strategy

Blog:

BAinsight.com/blog

Technet Column

“A View from the

Crawlspace”

[email protected]

Page 4: AutoClassificaiton - Rules versus Machine Learning

About BA Insight

– Connectivity

– Applications -

– Classification -

– Analytics

Page 5: AutoClassificaiton - Rules versus Machine Learning

Metadata Drives Great User Experiences

Documents from many sourcesAll client or matter-relevant documents are integrated.

Rich MetaDataContent annotated automatically – concepts,

categories, citations, matters, clients, etc

Navigation ControlsExplore, Discover, Drill-down

Page 6: AutoClassificaiton - Rules versus Machine Learning

Manual Tagging is impractical

and remarkably inconsistent

Page 7: AutoClassificaiton - Rules versus Machine Learning

Automation

Called: AutoClassification, AutoTagging, Metadata Generation, Text Analytics, ….

Page 8: AutoClassificaiton - Rules versus Machine Learning

8

Page 9: AutoClassificaiton - Rules versus Machine Learning

Complicators

Page 10: AutoClassificaiton - Rules versus Machine Learning
Page 11: AutoClassificaiton - Rules versus Machine Learning

11

Common Techniques across Applications

Page 12: AutoClassificaiton - Rules versus Machine Learning

-

-

-

-

-

-

-

-

-

-

-

-

Page 13: AutoClassificaiton - Rules versus Machine Learning

Rules-based Approach

Enhanced Content

Enriched with

Metadata and

Content Types

Search Visualization Workflow

Page 14: AutoClassificaiton - Rules versus Machine Learning

Name Blood Type Give Birth Can Fly Live in Water Class

human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds

Rule-based Classifier (Example)

R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians

Page 15: AutoClassificaiton - Rules versus Machine Learning

Example Rules Engine UI

Page 16: AutoClassificaiton - Rules versus Machine Learning

Examples of Rules

Boolean

• “IT” OR “Information Technology” or “MIS”

• (“Expert” OR “Witness”) NOT “police”

• “New York” AND “environmental policy”

• *work

• "legal" -briefs

• "Legal" NEAR(5) "issue“

Property-based

• filetype:docx

• title:"2029 L.P" or title:2030

• footer="BA Insight Confidential" or

footer:proprietary or footer:BA*

Overriding/changing Linguistics

• NOSTEM(“illumination")

• CASE("prerequisites")

• SOUNDLIKE("prerech")Regular expressions

• title:REGEX([0-4])

• REGEX("\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))")

Page 17: AutoClassificaiton - Rules versus Machine Learning

Controlling scores & thresholds

Page 18: AutoClassificaiton - Rules versus Machine Learning

Taxonomy Management is often included with Auto-Classification Tools

Page 19: AutoClassificaiton - Rules versus Machine Learning

Where do you get Taxonomies?

Page 20: AutoClassificaiton - Rules versus Machine Learning

20

Semantics! Machine Learning! AI!

Page 21: AutoClassificaiton - Rules versus Machine Learning
Page 22: AutoClassificaiton - Rules versus Machine Learning

Key Concepts

Page 23: AutoClassificaiton - Rules versus Machine Learning

False positives vs. false negativesLook at the impact of each in your context

Page 24: AutoClassificaiton - Rules versus Machine Learning

Machine Learning Approach

Page 25: AutoClassificaiton - Rules versus Machine Learning

Example: identify people as good or bad from their appearance

Page 26: AutoClassificaiton - Rules versus Machine Learning

Decision Tree Classifier

Page 27: AutoClassificaiton - Rules versus Machine Learning

Building an accurate classifier

Page 28: AutoClassificaiton - Rules versus Machine Learning

Training and Test Data

28

Page 29: AutoClassificaiton - Rules versus Machine Learning

Choosing the algorithm

Page 30: AutoClassificaiton - Rules versus Machine Learning

+ Easy to get started

+ Transparent and debuggable

+ Easily controlled (when # rules not too large)

- Need taxonomies

- Rule maintenance effort

- Harder to cover domain fully and to switch domains

+ Don’t need taxonomies

+ Improves without manual maintenance

+ Handles new data types/domains more easily

- Need a training set

- Opaque, usually can’t debug

- Can’t specify or control specific examples

Page 31: AutoClassificaiton - Rules versus Machine Learning

What would you use for

Page 32: AutoClassificaiton - Rules versus Machine Learning

Case StudyContent Identification and Movement

Page 33: AutoClassificaiton - Rules versus Machine Learning

Benchmarks

Page 34: AutoClassificaiton - Rules versus Machine Learning

Large scale example

Page 35: AutoClassificaiton - Rules versus Machine Learning

Combinations of Techniques usually work better

Page 36: AutoClassificaiton - Rules versus Machine Learning

Examples of hybrid configurations

Page 37: AutoClassificaiton - Rules versus Machine Learning

Example: clustering combined with rules

Page 38: AutoClassificaiton - Rules versus Machine Learning
Page 39: AutoClassificaiton - Rules versus Machine Learning

carrot2

Open Source & Platform packages offer an easy way to play

Page 40: AutoClassificaiton - Rules versus Machine Learning

How to get started

Setup up a metadata framework

– keep it simple

Develop or acquire managed vocabularies for

critical elements

Start with rule-driven automation

Test out ML-based techniques as you grow