NAMED ENTITY RECOGNITIONsuwangcompling.com/wp-content/uploads/2016/11/ner_overview_ver… ·...

Preview:

Citation preview

NAMED ENTITY RECOGNITION

JACOB SU WANG OJO LABS INC.

WE EXPLORE …

IN THIS TALK

• WHAT IS NER, WHAT ARE ITS APPLICATIONS

• WHAT ARE THE METHODS USED IN VARIOUS CONDITIONS

• WHAT MODEL TO USE WHEN?

• HOW DO THE MODELS WORK?

WHAT IS A NAMED ENTITY?

WORDS/PHRASES OF INTEREST IN TEXT

NAMED ENTITIES

• NATURAL NAMED ENTITIES

• PROPER NOUNS

• E.G. PERSON NAME (Steve Jobs), ORGANIZATION (OJO Labs), LOCATION (Austin), ETC.

• DEFINED NAMED ENTITIES

• NON-PRON WORDS/PHRASES WE DEFINE TO BE INFORMATIVE

• E.G. TIME (9pm, 1945), INDICATORS (which, where, etc.), CONTEXTUALLY-SIGNIFICANT TERMS (buffalo hunter, horse trading in Lonesome Dove).

NAMED ENTITY RECOGNITION TASKpic1: https://www.semanticengine.ws/namedentityrecognitionpic2: https://www.ravn.co.uk/named-entity-recognition-ravn-part-1/

IDENTIFICATION OF WORD/PHRASES OF INTEREST

IN TEXT

APPLICATIONS

QUESTION-ANSWERING: LOCATE INFORMATIVE TEXT

WHY?

Q: WHEN DID ADOLF HITLER COME TO POWER?

WHY?

Q: WHEN DID ADOLF HITLER COME TO POWER?

QUESTION-ANSWERING: LOCATE INFORMATIVE TEXT

WHY?

Q: WHEN DID ADOLF HITLER COME TO POWER?

QUESTION-ANSWERING: LOCATE INFORMATIVE TEXT

INFO SUPPLEMENTATION: AMAZON X -RAY

WHY?pic1: http://www.blogher.com/kindle-paperwhite-smart-bitches-reviewpic2: https://adelightfulspace.wordpress.com/2015/09/14/review-amazon-kindle-

INFO SUPPLEMENTATION: AMAZON X -RAY

WHY?

TEXT NORMALIZATION: PROPER LEVEL OF ABSTRACTION

WHY?

WHAT DO PEOPLE CARE ABOUT?

WHAT DO COMPANIES CARE ABOUT?

COOL! HOW?

LOOK UP A GAZETTEER?

HOW?

STRING MATCHING ONLY IS NOT GOING TO CUT! MAINLY BECAUSE OF WEAK GENERALIZATION!

LOOKING-UP AIN’T GONNA CUT!

HOW?

ENGLAND: COUNTRY_NAME OR LOCATION?

1945: NUMBER OR TIME?

CHASE: PERSON_NAME OR ORGANIZATION?

ISSUE 1: AMBIGUITY

LOOKING-UP AIN’T GONNA CUT!

HOW?

STREET, ST., STRT, …

UNIVERSITY OF TEXAS, UNIV TX, UT, …

NAMED ENTITY RECOGNITION, NER, …

ISSUE 2: VARIANTS

LOOKING-UP AIN’T GONNA CUT!

HOW?

MOST ITEMS WILL BE OOVS!

ISSUE 3: OUT-OF-VOCAB ITEMS

ZIPF’S LAW

SOLUTION: FEATURIZATION

HOW?

… W W W NAMED ENTITY W W W …MUCH RICHER INFORMATION THAN HAVING ONY THE ITEM ITSELF!!

SEMANTIC

SYNTACTIC

LEXICAL

MORPHOLOGICAL

FEATURE VECTOR OF WORD 1

HOW?SOLUTION: FEATURIZATION

• E.G. DISAMBIGUATION

1945 (NUMBER): - PARSER LIKELY TO GIVE NUM/ADJ POS TAG. - IN MORE SIMILAR CONTEXT AS ARBITRARY-LENGTH DIGIT SEQUENCES AS “YYYY” FORMAT SEQUENCES.

1945 (TIME): - MORE LIKELY TO BE THE LAST ITEM BEFORE DELIMITERS (COMMA OR PERIOD). - MORE LIKELY TO HAVE PRECEDING ‘IN’.

HOW?SOLUTION: FEATURIZATION

• E.G. VARIANTS

UT, UNIVERSITY OF TEXAS, UNIV TX, …

- SIMILAR COOCCURRENCE VECTORS (OVER VOCAB). - MORE SIMILAR TO EACH OTHER THAN TO OTHER NON-WORD ABBREVIATIONS (UT, UNIV TX)

HOW?SOLUTION: FEATURIZATION

• E.G. OUT-OF-VOCAB ITEMS

BARFKNECHT, THORUP, PECKENPAUGH, … (RARE SURNAMES, LESS THAN 0.15% IN 100,000 PEOPLE)

- SIMILAR CONTEXTUAL DISTRIBUTION TO “JOHNSON” OR “SMITH” THAN TO RANDOM WORDS. - LIKELY TO BE TAGGED AS “PRON” BY PARSER.

FEATURIZATION

METHODS

FEATURIZATION

• FEATURE ENGINEERING

• DOMAIN EXPERT KNOWLEDGE

• LINGUISTIC KNOWLEDGE

• FEATURE ABSTRACTION

• “ALMOST FROM SCRATCH” WITH AUTOMATIC FEATURE DISCOVERY

FEATURE ENGINEERING

… W W W NAMED ENTITY W W W …

• MORPHOLOGY: PREFIX, SUFFIX, STEM, ETC. IN A ENGLISH • anti-, con-, dis-, re-, …, -ly, -ness, …

• LEXICAL: GAZETTEER, SPELLING • {city_names}, …, capitalized, …

• SEMANTICS: COOCCURRENCE PATTERN, HEARST PATTERNS • cooccurrence counts, CITIES such as …

• SYNTAX: DEPENDENCY CONTEXT, SYNTACTIC PATH • (NE, dobj, kill), …, V->NP->N, …

FEATURE ABSTRACTION

… W W W NAMED ENTITY W W W …

• MORPHOLOGY: CHARACTER EMBEDDINGS • from sequences of characters in words (CNN)

• LEXICAL & SEMANTICS: WORD EMBEDDINGS • from sequences of words in sentences (word2vec with RNN/FNN)

• SYNTAX: WORD EMBEDDING + PHRASE EMBEDDINGS • from sequences of words in sentences (vectors with RecNN)

FEATURE ENGINEERING VS. FEATURE ABSTRACTIONA BIG DIFFERENCE WE CARE ABOUT: INTERPRETABILITY

THEY WORK IN THE CLASSIFICATION TASK BUT I DON’T KNOW WHAT THEY MEAN!!

WHAT MODEL TO USE?

NERWHAT TO USE?

ALL DEPENDS ON DATA AVAILABILITY & KNOWLEDGE STATE

WHAT MODEL TO USE?

SUPERVISED SEMI-SUPERVISED UNSUPERVISED

ANNOTATION

LABELING SCHEME

WHAT MODEL TO USE?OUR GENERAL OPINION

PURPOSEPERFORMANCE EXPECTATION

(ACCURACY/F1)

SUPERVISEDWEAPONIZED,

PRODUCTION-LEVEL MODELS

95%+

SEMI-SUPERVISED

EXPLORATORY MODELS PATTERN DISCOVERY

~75%

UNSUPERVISED ~65%

OVERVIEW

FEATURE ENGINEERING FEATURE ABSTRACTION

SUPERVISED• CONDITIONAL RANDOM

FIELDS (CRF)• RECURRENT NEURAL

NETS (RNN)

SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION

-

UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY

FEATURE ENGINEERING FEATURE ABSTRACTION

SUPERVISED• CONDITIONAL RANDOM

FIELDS (CRF)• RECURRENT NEURAL

NETS (RNN)

SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION

-

UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY

LINEAR-CHAIN GRAPHICAL MODEL

CRF

STATES: LATENT VARIABLES THAT TAKE LABELS AS VALUES

OBSERVATION: WORDS

EXAMPLE

ADOLF HITLER CAME TO POWER IN 1933

CRF

EXAMPLE

CRF

ADOLF HITLER CAME TO POWER IN 1933

OBJECTIVE: FINDING THE BEST PATH!

COMPUTING FOR THE BEST SEQUENCE OF LABELS (Y’s)? EXPENSIVE!!

CRF

CRFOBJECTIVE: FINDING THE BEST PATH!

COMPLEXITY OF BRUTE-FORTH COMPUTATION

E.G. ATIS DATASET

- TYPICAL SENTENCE: ~15 WORDS - SIZE OF LABEL SET: 127 - POSSIBLE PATHS / LABEL SEQUENCES: 12715

CRFBRUTE FORTH?

WE WANT TO BREAK THE GRAPH INTO SUBCOMPONENTS (FACTORS)

CRFFACTORIZATION

: FACTOR AT TIME T�(Xt, Yt)

HITLER CAME TO POWER IN

CRFFACTORIZATION

: FACTOR AT TIME T�(Xt, Yt)

HITLER CAME TO POWER IN

<0, 1, 1, 0, 0, 0, … >

FEATURE INDICATOR FUNCTION

f1 f2 f3 f4 f5 f6, …

CRFFEATURE FUNCTIONS (INDICATOR FUNCTIONS)

fk(Xt, Yt) =

(1 if �(Xt, Yt) has feature k

0 otherwise

E.G. - X_t-1 IS WORD “TO” - X_t+1 HAS POS TAG “PREP” - Y_t+1 IS LABEL “B-PER” - e.t.

�(Xt, Yt)

HITLER CAME TO POWER IN

CRFBEST SEQUENCE

Y = argmax

Y

QTt exp

⇣PKk wkfk(Xt, Yt)

PY 0

QTt exp

⇣PKk wkfk(Xt, Y

0t )⌘

w_k: WEIGHT OF FEATURE FUNCTION f_k

THE PROBABILITY OF THE ENTIRE SEQUENCE PRODUCT OF “WEIGHTED FEATURE SCORE” OF FACTORS

THIS COULD BE FORMULATED INTO AN OPTIMIZATION PROBLEM AND SOLVED WITH VARIOUS ALGORITHMS!

FACTORIZATION: AS YOU LIKE IT

Sutton & McCallum (2011)

CRF

WE FOUND THESE MODELS LESS APPROPRIATE

CRF

HIDDEN MARKOV MODEL (HMM)

• EMPIRICALLY WEAKER PERFORMANCE • TROUBLE CAPTURING LONG-DISTANCE DEPENDENCY

MAXIMUM ENTROPY MARKOV MODEL

(MEMM)• CRF IS AN IMPROVED VERSION OF MEMM

SUPPORT VECTOR MACHINE (SVM)

• SVM, AS A BINARY CLASSIFIER, AGGREGATES ERROR!

Lafferty et al. (2001) Sutton & McCallum (2011)

RECAP

SUPERVISED + FEATURE ENGINEERING

IN: ADOLF HITLER CAME TO POWER IN 1933.

OUT: B-PER I-PER O O O O B-TIME. NER

LEARNING A MAPPING

TOOLBOX

SUPERVISED + FEATURE ENGINEERING

LIBRARIES

CRF• pycrfsuite: https://python-crfsuite.readthedocs.io/en/latest/ • crf++: https://taku910.github.io/crfpp/

HMM• seqlearn: https://github.com/larsmans/seqlearn • hmmlearn: https://github.com/hmmlearn/hmmlearn

MEMM • nltk: http://www.nltk.org/_modules/nltk/classify/maxent.html

SVM • sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

FEATURE ENGINEERING FEATURE ABSTRACTION

SUPERVISED• CONDITIONAL RANDOM

FIELDS (CRF)• RECURRENT NEURAL

NETS (RNN)

SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION

-

UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY

LEARNING TYPE & OBJECTIVE

SUPERVISED VS. SEMI-SUPERVISED NER

LEARNING TYPE

SUPERVISEDINDUCTIVE LEARNING

LEARNING CLASSIFIER THAT INCORPORATES GENERAL RULES

SEMI-SUPERVISEDTRANSDUCTIVE LEARNING

CLASSIFY UNLABELED DATA AS SPECIFIC CASES BY THEIR SIMILARITY TO LABELED DATA

SUPERVISED LEARNING: INDUCTIVE

SUPERVISED VS. SEMI-SUPERVISED NER

WE ARE LEARNING THESE PARAMETERS!

classifier(label sequences|word sequences;⇥)

SEMI-SUPERVISED LEARNING: TRANSDUCTIVE

SUPERVISED VS. SEMI-SUPERVISED NER

SIMILAR WORDS SHOULD HAVE SAME LABELS

BOOTSTRAPPINGALTERNATING/MUTUAL BOOTSTRAPPING

BY LEXICAL FEATURES

EXTRACTION

• AUTHOR

{[A-Z][A-Za-z .,&]; [A-Za-z.]; ...}

• TITLE

{[A-Z0-9][A-Za-z0-9 .,:’#!?;&]; [A-Za-z0-9?!]}

• ...

E.G. REGEX CHARACTERIZATION

E.G. LEXICAL RULES

Brin (1999)

Collins & Singer (1999)

BY CONTEXTUAL FEATURES

EXTRACTION

Riloff & Jones (1999)

BY DISTRIBUTIONAL SIMILARITY

EXTRACTION

Pasca et al. (2006)

BOOTSTRAPPINGALTERNATING/MUTUAL BOOTSTRAPPING

THE SET OF NAMED ENTITIES!

LABEL PROPAGATION

ITEMS

SIMILARITIES

(TOKENS)

*THE ACTUAL GRAPH IS USUALLY FULLY CONNECTED, BUT NOT NECESSARILY SO IN SOME VARIANTS OF LP

LABEL PROPAGATION

L_1

L_2

<0,1,1,0,0, …><1,1,0,1,1, …>

<0,1,1,1,0, …>

LABELED NODES: LABEL + FEATURE VEC UNLABELED NODES: FEATURE VEC ONLY

LABEL PROPAGATION

Network graph from Mejova (2015), interpretation differs here.

LABELED ITEMS

UNLABELED ITEMS

ITEMS

SIMILARITIES

PROPAGATION

FULLY LABELED!

PROCEDURE

LABEL PROPAGATION

l

l l

u

u u

Xl+u

Xl+u Xl+u

C

SIMILARITY MATRIX SOFT LABEL ASSIGNMENT DISTRIBUTION

Zhu & Ghahramani (2002)

STEP 1

LABEL PROPAGATION

NEW DISTRIBUTION MATRIX <= SIMILARITY MATRIX * DISTRIBUTION MATRIX

X =

C

Xl+u

Zhu & Ghahramani (2002)

STEP 1

LABEL PROPAGATION

AN INTUITIVE EXPLANATION WHY THE UPDATE WORKS

X =

C

Xl+u

Zhu & Ghahramani (2002)

INFLUENCE OF LABELED DATA ON UNLABELED DATA PROPORTIONAL TO THEIR SIMILARITY!

WORD X_i

LABELED WORD

X_jx =

STEP 2

LABEL PROPAGATION

ROW NORMALIZE SIMILARITY MATRIX

EACH ROW IS A PROBABILITY DISTRIBUTION OVER CLASSES!

Xl+u

C

Zhu & Ghahramani (2002)

STEP 3

LABEL PROPAGATION

ROW NORMALIZE SIMILARITY MATRIX

CLAMP/ “REPLENISH” THE LABELED DATA

C

Xl+u

Zhu & Ghahramani (2002)

CONVERGENCE

LABEL PROPAGATION

UNTIL ALL THE ITEMS ARE LABELED …

C

Xl+u

Zhu & Ghahramani (2002)

PROS & CONS

BOOTSTRAPPING VS. LABEL PROPAGATION

PROS CONS

BOOTSTRAPPING• EASY IMPLEMENTATION • FAST EXTRACTION

• LABOR INTENSIVE (HEAVY HUMAN SUPERVISION TO

GUARANTEE QUALITY • PRONE TO INTRODUCING

NOISE

LABEL PROPAGATION• CONVERGENCE

GUARANTEED • MORE AUTOMATED

• PARAMETERS DIFFICULT TO TUNE (PERFORMANCE DEPENDS HEAVILY ON PARAMETER SETTING)

• SLOW WITH LARGE GRAPH (SOPHISTICATED VARIANTS)

RECAP

SEMI-SUPERVISED + FEATURE ENGINEERING

SIMILAR WORDS SHOULD HAVE SAME LABELS

TOOLBOX

SEMI-SUPERVISED + FEATURE ENGINEERING

LIBRARIES

BOOTSTRAPPING NONE NEEDED

LABEL PROPAGATION

• sklearn: http://scikit-learn.org/stable/modules/label_propagation.html • MAD: https://github.com/psorianom/modified_adsorption

Taludkar & Crammer (2009)

FEATURE ENGINEERING FEATURE ABSTRACTION

SUPERVISED• CONDITIONAL RANDOM

FIELDS (CRF)• RECURRENT NEURAL

NETS (RNN)

SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION

-

UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY

OBJECTIVE

UNSUPERVISED LEARNING VS. OTHER

SUPERVISED SEMI-SUPERVISED UNSUPERVISED

ANNOTATION

LABELING SCHEME

LEARNING OBJECTIVE LABELING STUFF ALSO INDUCE A

LABELING SCHEME

GENERAL IDEA

UNSUPERVISED + FEATURE ENGINEERING

INDUCING LABELING SCHEME

INDUCING LABELING SCHEME

UNSUPERVISED LEARNING

HYPERNYMS

HYPONYMS

UNSUPERVISED LEARNINGINDUCING LABELING SCHEME

WHAT IS A HEARST PATTERN?

HEARST PATTERN BASED EXTRACTION

PARADIGM

Y such as X (LABEL CANDIDATE, NE CANDIDATE)

E.G.

CITIES such as Austin, Dallas, and Houston

Hearst (1992)

HEARST PATTERN BASED EXTRACTIONSTEP 1

STEP 2

HEARST PATTERN BASED EXTRACTION

LABEL(X) = argmax

YSCORE(X,Y)

Evans (2003)

THIS IS A SELECTION PROCESS FOR LABELS

STEP 3

HEARST PATTERN BASED EXTRACTION

Etzioni et al. (2005)

PMI(X,Yi[X]) = PMI(Austin, CITY such as Austin)where

X = Austin

Y = CITY

Yi[X] = CITY such as Austin

Y_1 Y_2 Y_3 …. Y_k (Austin,CITY)

Y such as X

GROUNDING

EXTERNAL TAXONOMY BASED EXTRACTIONEXAMPLE: WORDNET

HYPERNYM (MORE GENERAL)

HYPONYM (MORE SPECIFIC)

EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT

FIND ALL CAPITALIZED WORDS / PHRASES

(OPTION: BOOTSTRAP FROM GAZETEER)

MANUALLY FIND A SET OF HYPERNYMS THAT

COVERS ALL THE NE CANDIDATES!

EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT

EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT

TOPIC SIGNATURE

SIG(X) = {(word, freqword

) | word in X

0s context}

EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT

argmax

Y(SIM(SIG(X), SIG(Y ))) = LOCATION

SIM=230

SIM=410SIM=140

CURRENT NODE = ENTITY

EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT

argmax

Y(SIM(SIG(X), SIG(Y ))) = VILLAGE

SIM=410

SIM=251

SIM=533

CURRENT NODE = LOCATION

EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT

SWEET SPOT FOUND!!

argmax

Y(SIM(SIG(X), SIG(Y ))) = VILLAGE

SIM=533CURRENT NODE = VILLAGE

EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT

NE CANDIDATES: MORDOR HOBBITON, HOBBIT, WIZARD, DWARF, FAIRY

Alphonseca & Manandhar (2002)

RECAP

UNSUPERVISED + FEATURE ENGINEERING

INDUCING LABELING SCHEME

FEATURE ENGINEERING FEATURE ABSTRACTION

SUPERVISED• CONDITIONAL RANDOM

FIELDS (CRF)• RECURRENT NEURAL

NETS (RNN)

SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION

-

UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY

FEATURE ENGINEERING VS. FEATURE ABSTRACTION

FEATURE ENGINEERING

AUTOMATIC FEATURE ABSTRACTION THROUGH JOINT OPTIMIZATION

• WHAT ARE EMBEDDINGS?

FEATURE ABSTRACTIONREPRESENTATION: EMBEDDINGS

REPRESENTATIONS WHICH LIVE IN A HIGH DIMENSIONAL SPACE WHERE THE DISTANCE AMONG ITEMS IS DEFINED WITH SIMILARITY OF SORTS…

FEATURE ABSTRACTIONE.G. WORD EMBEDDINGS

FEATURE ABSTRACTIONPIPELINE: HOW ARE EMBEDDINGS LEARNED?

ONE-HOT

PREDICTION RECALIBERATION

ONE-HOT

FEATURE ABSTRACTIONJOINT OPTIMIZATION

FEATURE ABSTRACTION

ONE-HOT

JOINT OPTIMIZATION

MULTICHANNEL EMBEDDINGS

EMBEDDINGS COULD DRAW ON INFORMATION FROM MULTIPLE SOURCES!

MORPHOLOGICAL LEXICAL SEMANTIC SYNTACTIC

dos Santos & Guimaraes (2014a)

MULTICHANNEL EMBEDDINGSEXAMPLE: CHAR-WORD JOINT FEATURIZATION

EXAMPLE: CHAR-WORD JOINT FEATURIZATION

dos Santos & Guimaraes (2015)

PROJECTION

PROJECTION

EMBEDDING LV1

EMBEDDING LV2

MULTICHANNEL EMBEDDINGS

ARCHITECTURE

RECURRENT NEURAL NETS

ADOLF HITLER CAME TO POWER IN 1933

B-PER I-PER O O O O B-TIME

TIME DISTRIBUTED PREDICTION

OUTPUT SEQUENCE: LABELS

INPUT SEQUENCE: WORDSTHE MODEL “REMEBERS” WHAT HAPPENED IN THE 3 PREVIOUS TIME STEPS

PROCESS

RECURRENT NEURAL NETS

ADOLF

B-PER

PROJECTION TO EMBEDDING SPACE

PROJECTION TO LABEL SPACE

PROCESS

RECURRENT NEURAL NETS

ADOLF HITLER

B-PER I-PER

THE PARAMETERS “REMEMBER”

ITS TRANSITIONAL HISTORY!

THE SAME HIDDEN LAYER AT DIFFERENT TIME POINTS

PROCESS

RECURRENT NEURAL NETS

ADOLF HITLER CAME

B-PER I-PER O

THE PARAMETERS “REMEMBER”

ITS TRANSITIONAL HISTORY!

RESULT

RECURRENT NEURAL NETS

ADOLF HITLER CAME TO POWER IN 1933

B-PER I-PER O O O O B-TIME

AT EACH TIME POINT, THE PREVIOUS HISTORY IS ENCODED IN PARAMETERS

STATE-OF -THE-ART: BIDIRECTIONAL LSTM-CRF

RECURRENT NEURAL NETSClassifier

EncoderInput

Join

t Tra

inin

g

Lample et al. (2016)

ClassifierEncoder

InputJo

int T

rain

ing

Lample et al. (2016)CAN ALSO BE MULTICHANNEL EMBEDDINGS!

RECURRENT NEURAL NETSSTATE-OF -THE-ART: BIDIRECTIONAL LSTM-CRF

TOOLBOX

SUPERVISED NER + FEATURE ABSTRACTION

LIBRARIES

RNN

• word embeddings - pre-trained: spacy (https://spacy.io/) - create new: https://radimrehurek.com/gensim/models/word2vec.html • neural nets - Keras: https://keras.io/layers/recurrent/ - Tensorflow: https://www.tensorflow.org/tutorials/recurrent/ - Theano: http://deeplearning.net/tutorial/rnnslu.html

COMPARISON

FEATURE ENGINEERING VS. FEATURE ABSTRACTION

FEATURIZATION INTERPRETABILITY

FEATURE ENGINEERING MANUAL INTERPRETABLE

FEATURE ABSTRACTION AUTOMATIC NOT

INTERPRETABLE

ARE DEEP LEARNING BASED MODELS NECESSARILY BETTER?

FEATURE ENGINEERING VS. FEATURE ABSTRACTION

• CRF CONVERGES FAST • CRF IS GOOD IN LOW DATA • CRF IS MORE INTERPRETABLE • PERFORMANCE DIFFERENCE ~1%

LIME (LOCAL INTERPRETABLE MODEL-AGNOSTIC EXPLANATIONS) https://github.com/marcotcr/lime

• OPTION 1 (ABSENT DOMAIN KNOWLEDGE)

• 1) UNSUPERVISED EXPLORATION

• 2) PAID LABELING, THEN SUPERVISED MODEL

• OPTION 2 (EXPERT DOMAIN KNOWLEDGE AVAILABLE)

• 1) PAID LABELING ON SMALL SET

• 2) SEMI-SUPERVISED EXPLORATION

• 3) PAID LABELING, THEN SUPERVISED MODEL

SUGGESTIONS ON MODELINGNEW DOMAIN + UNLABELED DATA

PAID LABELINGTOOLS

CONFIDENT IN DOMAIN

KNOWLEDGE

MECHANICAL TURK (https://www.mturk.com/mturk)

LESS CONFIDENT IN DOMAIN

KNOWLEDGE

• CROWDFLOWER (https://www.crowdflower.com/)

• SPARE5 (https://app.spare5.com/fives)

THANK YOU!

Recommended