Upload
jeff-fried
View
369
Download
1
Embed Size (px)
Citation preview
Jeff FriedCTOBA Insight
@jefffried#tbc2016
Rules-Based vs. Document-Based Bake-off
Focused on Search and
SharePoint since 2004
Longtime
Search Nerd
• CTO, BA Insight
• Senior PM, Microsoft
• VP, FAST
• SVP, LingoMotors
About Jeff Fried
Passionate About
• Search
• SharePoint
• Search-driven
applications
• Information Strategy
Blog:
BAinsight.com/blog
Technet Column
“A View from the
Crawlspace”
About BA Insight
– Connectivity
– Applications -
– Classification -
– Analytics
Metadata Drives Great User Experiences
Documents from many sourcesAll client or matter-relevant documents are integrated.
Rich MetaDataContent annotated automatically – concepts,
categories, citations, matters, clients, etc
Navigation ControlsExplore, Discover, Drill-down
Manual Tagging is impractical
and remarkably inconsistent
Automation
Called: AutoClassification, AutoTagging, Metadata Generation, Text Analytics, ….
8
Complicators
–
–
–
–
–
11
Common Techniques across Applications
-
-
-
-
-
-
-
-
-
-
-
-
Rules-based Approach
Enhanced Content
Enriched with
Metadata and
Content Types
Search Visualization Workflow
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds
Rule-based Classifier (Example)
R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians
Example Rules Engine UI
Examples of Rules
Boolean
• “IT” OR “Information Technology” or “MIS”
• (“Expert” OR “Witness”) NOT “police”
• “New York” AND “environmental policy”
• *work
• "legal" -briefs
• "Legal" NEAR(5) "issue“
Property-based
• filetype:docx
• title:"2029 L.P" or title:2030
• footer="BA Insight Confidential" or
footer:proprietary or footer:BA*
Overriding/changing Linguistics
• NOSTEM(“illumination")
• CASE("prerequisites")
• SOUNDLIKE("prerech")Regular expressions
• title:REGEX([0-4])
• REGEX("\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))")
Controlling scores & thresholds
Taxonomy Management is often included with Auto-Classification Tools
Where do you get Taxonomies?
20
Semantics! Machine Learning! AI!
Key Concepts
False positives vs. false negativesLook at the impact of each in your context
Machine Learning Approach
Example: identify people as good or bad from their appearance
Decision Tree Classifier
Building an accurate classifier
–
Training and Test Data
28
Choosing the algorithm
–
–
–
+ Easy to get started
+ Transparent and debuggable
+ Easily controlled (when # rules not too large)
- Need taxonomies
- Rule maintenance effort
- Harder to cover domain fully and to switch domains
+ Don’t need taxonomies
+ Improves without manual maintenance
+ Handles new data types/domains more easily
- Need a training set
- Opaque, usually can’t debug
- Can’t specify or control specific examples
What would you use for
Case StudyContent Identification and Movement
Benchmarks
Large scale example
Combinations of Techniques usually work better
Examples of hybrid configurations
Example: clustering combined with rules
carrot2
Open Source & Platform packages offer an easy way to play
How to get started
Setup up a metadata framework
– keep it simple
Develop or acquire managed vocabularies for
critical elements
Start with rule-driven automation
Test out ML-based techniques as you grow
41
www.BAinsight.com
@jefffried