33
Natural Language Natural Language Processing Processing Verbatim Text Coding and Verbatim Text Coding and Data Mining Report Generation Data Mining Report Generation Josef S.W. Leung Josef S.W. Leung ( ( [email protected] [email protected] ) ) Ching-Long Yeh Ching-Long Yeh ( ( [email protected] [email protected] ) ) NLP One of the Top Priority Funding It in Computer Science Research -- National Natural Science Foundation, China

Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung ([email protected]) Ching-Long Yeh ([email protected])

Embed Size (px)

Citation preview

Page 1: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Natural Language Natural Language ProcessingProcessing

Verbatim Text Coding andVerbatim Text Coding andData Mining Report GenerationData Mining Report Generation

Josef S.W. LeungJosef S.W. Leung (([email protected]@ieee.org))

Ching-Long YehChing-Long Yeh (([email protected]@cse.ttit.edu.tw))

NLP One of the Top Priority Funding Items

in Computer Science Research -- National Natural Science

Foundation, China

Page 2: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Language

Listen

(Understand)Speak

(Generate)

Page 3: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Natural Language

Internal Representatio

ns

GenerationGeneration

Analysis/ Analysis/ UnderstandingUnderstanding

Natural Language ProcessingNatural Language Processing

Page 4: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Outline of PresentationOutline of Presentation

• NLP IntroductionNLP Introduction– Natural Language Analysis/UnderstandingNatural Language Analysis/Understanding

– Natural Language GenerationNatural Language Generation

• Case 1: Verbatim Text CodingCase 1: Verbatim Text Coding– May need NL analysis techniquesMay need NL analysis techniques

• Case 2: Data Mining Report GenerationCase 2: Data Mining Report Generation– May need NL generation techniquesMay need NL generation techniques

Page 5: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Pre-processing

Tokens

Parsing

Syntactic structure

Semantic Interpretation Semantic

representation

Contextual Interpretation

Knowledge representati

on

Input sentence

Modules of NL Modules of NL UnderstandingUnderstanding

Page 6: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Parsing for Syntactic Parsing for Syntactic AnalysisAnalysis

Grammar Grammar Rules:Rules:

S

NP

VP

NP + VP

ART + N

V + NP

Lexicon:Lexicon:

N

N

V

ART

dog

cat

chased

the

Page 7: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

s

NP VP

ART N V NP

dog chased the cat

ART N

the

Syntactic StructureSyntactic Structure

Page 8: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Structural AmbiguityStructural Ambiguity

• Time flies like an arrow.Time flies like an arrow.

• The passage of time is as quick as The passage of time is as quick as an arrow.an arrow.

• A species of flies called ‘time flies’ A species of flies called ‘time flies’ enjoy an arrow.enjoy an arrow.

Page 9: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Structural AmbiguityStructural Ambiguity

• The man saw the girl with The man saw the girl with telescope.telescope.

• The man saw the girl who possessed The man saw the girl who possessed the telescope.the telescope.

• The man saw the girl with the aid of The man saw the girl with the aid of the telescope.the telescope.

Page 10: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

User’s Goal

Surface Sentences

Strategic Component

Tactical Component

Domain KB

Planning Operators

User Model

Discourse Model

Linguistic Rules & Lexicon

Text Planning

Linguistic Realizatio

n

Natural Language Natural Language GenerationGeneration

Page 11: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Unification GrammarUnification Grammar

the man sees a the man sees a sheepsheep

S [numb=X, S [numb=X, tense=T]tense=T]

NP [numb=X] VP [numb=X, NP [numb=X] VP [numb=X, tense=T]tense=T]VP[numb=N,tenseVP[numb=N,tense

=M]=M] V [numb=N, tense=M] NPV [numb=N, tense=M] NP

NP NP [numb=Y][numb=Y]

det [numb = Y] noun [numb = det [numb = Y] noun [numb = Y]Y]

manman : : noun [numb = sing]noun [numb = sing] a a :: det [numb = sing]det [numb = sing] the the : : detdetsheepsheep :: nounnounseessees : : [tense = pres, numb = sing][tense = pres, numb = sing]

Page 12: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Migraine abortive Migraine abortive treatment is used to treatment is used to abort migraine.abort migraine.((cat clause)((cat clause) (process ((lex “ (process ((lex “useuse”) (type material)))”) (type material))) (partic ((affected ((cat proper) (partic ((affected ((cat proper) (lex “ (lex “migraine abortive treatmentmigraine abortive treatment”)))”))) (agent none))) (agent none))) (circum ((purpose ((cat clause) (circum ((purpose ((cat clause) (keep-in-order no) (keep-for no) (keep-in-order no) (keep-for no) (position end) (position end) (process ((lex “ (process ((lex “abortabort”)”) (effect-type creative) (effect-type creative) (type material))) (type material))) (partic ((created ((lex “ (partic ((created ((lex “migrainemigraine”)”) (countable no) (countable no) (cat common))))))))))) (cat common)))))))))))

Page 13: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Verbatim Text CodingVerbatim Text Coding

• A text content classification problem.A text content classification problem.

• Group semantically similar answer items.Group semantically similar answer items.

• Develop a code list/tree to represent the Develop a code list/tree to represent the answer item groups.answer item groups.

• Simple NL analysis techniques may help.Simple NL analysis techniques may help.

• Details will be given in the first example of Details will be given in the first example of NLP application.NLP application.

Page 14: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Data Mining Report Data Mining Report GenerationGeneration

• Data mining results are usually in Data mining results are usually in rule or tree formats with obscure rule or tree formats with obscure notations.notations.

• NL generation techniques may help NL generation techniques may help translate the data mining results translate the data mining results into plain natural languages.into plain natural languages.

• Details will be given in the second Details will be given in the second example of NLP application.example of NLP application.

Page 15: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

Answer Items Code Tree

• Small Small screen/window/textscreen/window/text

• Long list of answer Long list of answer itemsitems

• Difficult to browse/viewDifficult to browse/view

• Worse than paper formWorse than paper form

Page 16: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

Key Terms

Page 17: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Ranking Answers by SimilarityRanking Answers by Similarity

Items with similar meaning

Page 18: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Text Similarity MeasuresText Similarity Measures

StringString

SemanticsSemantics CoverageCoverage

Text Text Similarity Similarity ScoreScore

Page 19: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

• A user-interface for classifying answer A user-interface for classifying answer items by drag-and-drop actions.items by drag-and-drop actions.

• NLP reduces time and effort in NLP reduces time and effort in searching, browsing, and selecting searching, browsing, and selecting multiple answer items for multiple answer items for classification.classification.

• There’s still limitations and not fully There’s still limitations and not fully automated.automated.

Page 20: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Technical Issues of CodiaTechnical Issues of Codia

• Improve user-interface.Improve user-interface.

• Use only simple NLP techniques.Use only simple NLP techniques.

• Ambiguity resolution by human.Ambiguity resolution by human.

• Limited by thesaurus.Limited by thesaurus.

• Still cannot handle negatives ‘Not’. Still cannot handle negatives ‘Not’.

• Knowledge engineering is tedious.Knowledge engineering is tedious.

Page 21: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Limitations and Future Limitations and Future ImprovementsImprovements

• Thesaurus has only Thesaurus has only 60,000 terms 60,000 terms classified into 3900 classified into 3900 semantic categories.semantic categories.

• Manual operation Manual operation (ambiguity (ambiguity resolution relies on resolution relies on human).human).

• Similarity measures Similarity measures are too mechanical.are too mechanical.

• Need to update and Need to update and incorporate incorporate frequently used frequently used terms/categories.terms/categories.

• Towards automation Towards automation by using more AI by using more AI such as NLP, GA and such as NLP, GA and NN.NN.

• More adaptive by More adaptive by rule-based or case-rule-based or case-based reasoning.based reasoning.

Page 22: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Data Mining and Knowledge Data Mining and Knowledge DiscoveryDiscovery

PatternsPatterns

KnowledgeKnowledge

DataData

Data Data MiningMining

InterpretatioInterpretationn

KnowledgKnowledge e DiscoveryDiscovery

Page 23: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

IfIf q12 = 4 and q12 = 4 and

q31 = 6 and q31 = 6 and

q35 = 3 q35 = 3

thenthen q38 = 3 q38 = 3

Page 24: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

IfIf h/h_income = 4 h/h_income = 4

and and city = 6 and city = 6 and

car_owner = 3car_owner = 3

thenthen user = 3 user = 3

Page 25: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

say(feature,say(feature,[r1]).[r1]).

Page 26: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

The segment of respondents who are The segment of respondents who are product X users is characterized byproduct X users is characterized by

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1 say(feature, say(feature, [r1]).[r1]).

Page 27: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

say(general,say(general,[r1]).[r1]).

say(likely,[r1]).say(likely,[r1]).

say(reason,say(reason,[r1]).[r1]).

Page 28: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Basically, the respondents who are Basically, the respondents who are product X users have product X users have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1 say(general, say(general, [r1]).[r1]).

Page 29: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

The respondents who are product X users The respondents who are product X users because they have because they have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1

say(reason, say(reason, [r1]).[r1]).

Page 30: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

It is likely that the people who have It is likely that the people who have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household incomehigh monthly household income

are product X usersare product X users.

r1

say(likely, [r1]).say(likely, [r1]).

Page 31: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Limitations and Future Limitations and Future ImprovementsImprovements

• Pre-defined syntactic Pre-defined syntactic category of code labels.category of code labels.

• Single sentence for each Single sentence for each rule.rule.

• Lack visualization.Lack visualization.

• Almost no text planning.Almost no text planning.

• English only.English only.

• Lack knowledge of Lack knowledge of explanation.explanation.

• Automatic recognition of Automatic recognition of the syntax.the syntax.

• Describe rule relationship Describe rule relationship in multiple coherent in multiple coherent sentences.sentences.

• Text + graphics or even Text + graphics or even multimedia generation.multimedia generation.

• Implement text planning.Implement text planning.

• Multilingual.Multilingual.

• Implement NL techniques Implement NL techniques for explanation.for explanation.

Page 32: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Concluding RemarksConcluding Remarks

• NLP techniques are found useful in:NLP techniques are found useful in:– Verbatim text coding and Verbatim text coding and

– Data mining report generation.Data mining report generation.

• Group similar answer items.Group similar answer items.

• Write simple natural language text.Write simple natural language text.

• A pricey technology because few A pricey technology because few tools are available.tools are available.

Page 33: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Natural Language Natural Language ProcessingProcessing

Josef Siu-Wai LeungJosef Siu-Wai Leung ([email protected])([email protected])

Ching-Long YehChing-Long Yeh ([email protected])([email protected])