38
Marko Smiljanić, NIRI Inteligent computing Ltd, CEO Developing and validating a document classifier: a real-life story

Developing and validating a document classifier: a real-life story - Marko Smiljanic

Embed Size (px)

Citation preview

Page 1: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Marko Smiljanić, NIRI Inteligent computing Ltd,CEO

Developing and validating a document classifier:a real-life story

Page 2: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Developing and validating a document classifier:

a real-life storyMarko Smiljanić, CEO

www.niri-ic.com

Page 3: Developing and validating a document classifier: a real-life story - Marko Smiljanic

About us.

NIRI: 10 years in Intelligent Computing Text Mining Knowledge Discovery and Management All about Data Science

NIŠ

Page 4: Developing and validating a document classifier: a real-life story - Marko Smiljanic

About me.

My role

COMPANY

Page 5: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Page 6: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business context

Page 7: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business context

Largest clients include Public Employment Services in EU, USA, and

Asia Staffing companies in EU, USA

Page 8: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Vacancies Job seekers

Job Taxonom

y

SkillTaxonom

y

ELISE Platform

Page 9: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Page 10: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Vacancies

Job Taxonom

y

Document Classification

Page 11: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Occupation Taxonomies ISCO (International Standard Classification of

Occupations) ESCO O*NET and many more ISCO level 1 (10)

ISCO level 2 (42)ISCO level 3 (124)ISCO level 4 (400)

ESCO level 5 (5000)

“Delivery service worker”

Challenges (for humans) Knowing the

taxonomy Ambiguous taxonomyHybrid positionsVague vacancy

Page 12: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Client’s situationin 2014

VacancyAggregato

rand Classifier

Correct Code? PublishRepair

Code!NO

23%

ОК65%

no help

14%

OK9%

no code

12%

2000-4000 per day (into >2000 taxonomy classes) %?

Page 13: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Page 14: Developing and validating a document classifier: a real-life story - Marko Smiljanic

The Solution:NIRI will build you a better classifier

VacancyAggregato

rand Classifier

NIRI Classifier Publish2000-4000 per day

Page 15: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Really?How accurate will it be?How will it fit our process?

Reduce manual effort Increase volume Improve final accuracy

Really. We will (try to):

Page 16: Developing and validating a document classifier: a real-life story - Marko Smiljanic

But you need to give us training data > 1M vacancies

No class12%

Not verified14%

Verified74%%?

Page 17: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Long tail effect

Page 18: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Architecture of our solution

FeatureExtractor Negotiator

Classifier 1

Classifier 2

Classifier N

…Vacancy [Class,

Confidence]+

Vacancy Classifier

External Services

Page 19: Developing and validating a document classifier: a real-life story - Marko Smiljanic

What to do with confidence?

Vacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, Confidence…

Bulk Accept

To check manualy

Batch Processing

CO

NFID

EN

CE

High accuracy

Low accuracy

Page 20: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Using confidence

Page 21: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Page 22: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Measuring accuracy in the laboratory

No class12%

Not verified14%

Verified74%

No class

Incorrect

Correct

Test20%

Train80% Train

Test

x 5

Vacancy Classifier

Page 23: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Corpus Classifier Classifier 100 Classifier 1000

74% 78% 80% 85%

14%13% 12%

10%12% 9% 8% 5%

One of many Laboratory MeasurementsCorrect Incorrect No class

Measuring accuracy in the laboratory

Does this make any sense?

Yes, but…

Page 24: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Measuring accuracy in the laboratory

No class12%

Not verified14%

Verified74%

Vacancy Classifier

No class 9%

Incorrect13%

Correct78%

OriginalClassifier

This is not relaityBiased train/test setAccuracy of test set unknown Inability to test against 26%

Page 25: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Page 26: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Remember the process?

VacancyAggregato

rand Classifier

Correct Code? PublishRepair

Code!NO

23%

ОК65%

no help

14%

OK9%

no code

12%

Page 27: Developing and validating a document classifier: a real-life story - Marko Smiljanic

This is what it actually looks like.Check Repair

Reduce manual effort Increase volume Improve final accuracy

We will

Page 28: Developing and validating a document classifier: a real-life story - Marko Smiljanic

And we proposed this one.Bulk Accept Check Repair

Page 29: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Best/worst case analysis, some manual validation, careful assumptions:

Bulk Accept

Check Repair

Page 30: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Impact estimation showed that: Step 1 effort reduction 60%

(due to bulk acceptance) Step 2 effort reduction 11%

(due to bulk acceptance and top 5 offers) Significant published volume increase

(almost to 100%) Accuracy slightly larger

(+1%, to around 92%)

Does this make any sense?

Yes, but…

Page 31: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Page 32: Developing and validating a document classifier: a real-life story - Marko Smiljanic

No class12%

Not verified14%

Verified74%%?

How can we measure production accuracy?

We can not,unless…

Page 33: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Golden Test Set

Page 34: Developing and validating a document classifier: a real-life story - Marko Smiljanic

How was it built?Check & Repair4 eye principle

Vacancy Classifier

Published

Original Code&

Top 5 VC codes

Original Code&

Top 5 VC codes

Original Code&

Top 5 VC codes

Every single classification was marked as either Correct, Acceptable, or Wrong

Page 35: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Results

Current NIRI VC Current(HQ source)

NIRI VC (HQ source)

63.05%73.91% 72.06% 74.38%

65.98%77.56% 76.25% 78.69%

Golden Test Set ResultsCorrect Acceptable

Highest Quality Source (Training)

Page 36: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Business Context The Challenge The Solution Effectiveness

Laboratory measurements Impact estimation Reality

Wrap up

The flow

Page 37: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Wrap up Clean semantic data, in real-life, can only be a myth. We are looking into

data cleansing approaches. Measuring usefulness can be hard and expensive, but … … it can/must to be monitored after the system is deployed.

It changes over time. Continuous learning, where possible is a great thing. 1) Implementing state-of-the-art machine learning algorithm is one thing.

2) Making it useful is another. 3) Explaining that to the end-user is the third.

NIRI is a very cool company to work with!

I hope you liked the story, and I thank you for your attention.

Page 38: Developing and validating a document classifier: a real-life story - Marko Smiljanic

Developing and validating a document classifier:

a real-life storyMarko Smiljanić, CEO

www.niri-ic.com