Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung...

Preview:

Citation preview

Natural Language Natural Language ProcessingProcessing

Verbatim Text Coding andVerbatim Text Coding andData Mining Report GenerationData Mining Report Generation

Josef S.W. LeungJosef S.W. Leung ((j.leung@ieee.orgj.leung@ieee.org))

Ching-Long YehChing-Long Yeh ((chingyeh@cse.ttit.edu.twchingyeh@cse.ttit.edu.tw))

NLP One of the Top Priority Funding Items

in Computer Science Research -- National Natural Science

Foundation, China

Language

Listen

(Understand)Speak

(Generate)

Natural Language

Internal Representatio

ns

GenerationGeneration

Analysis/ Analysis/ UnderstandingUnderstanding

Natural Language ProcessingNatural Language Processing

Outline of PresentationOutline of Presentation

• NLP IntroductionNLP Introduction– Natural Language Analysis/UnderstandingNatural Language Analysis/Understanding

– Natural Language GenerationNatural Language Generation

• Case 1: Verbatim Text CodingCase 1: Verbatim Text Coding– May need NL analysis techniquesMay need NL analysis techniques

• Case 2: Data Mining Report GenerationCase 2: Data Mining Report Generation– May need NL generation techniquesMay need NL generation techniques

Pre-processing

Tokens

Parsing

Syntactic structure

Semantic Interpretation Semantic

representation

Contextual Interpretation

Knowledge representati

on

Input sentence

Modules of NL Modules of NL UnderstandingUnderstanding

Parsing for Syntactic Parsing for Syntactic AnalysisAnalysis

Grammar Grammar Rules:Rules:

S

NP

VP

NP + VP

ART + N

V + NP

Lexicon:Lexicon:

N

N

V

ART

dog

cat

chased

the

s

NP VP

ART N V NP

dog chased the cat

ART N

the

Syntactic StructureSyntactic Structure

Structural AmbiguityStructural Ambiguity

• Time flies like an arrow.Time flies like an arrow.

• The passage of time is as quick as The passage of time is as quick as an arrow.an arrow.

• A species of flies called ‘time flies’ A species of flies called ‘time flies’ enjoy an arrow.enjoy an arrow.

Structural AmbiguityStructural Ambiguity

• The man saw the girl with The man saw the girl with telescope.telescope.

• The man saw the girl who possessed The man saw the girl who possessed the telescope.the telescope.

• The man saw the girl with the aid of The man saw the girl with the aid of the telescope.the telescope.

User’s Goal

Surface Sentences

Strategic Component

Tactical Component

Domain KB

Planning Operators

User Model

Discourse Model

Linguistic Rules & Lexicon

Text Planning

Linguistic Realizatio

n

Natural Language Natural Language GenerationGeneration

Unification GrammarUnification Grammar

the man sees a the man sees a sheepsheep

S [numb=X, S [numb=X, tense=T]tense=T]

NP [numb=X] VP [numb=X, NP [numb=X] VP [numb=X, tense=T]tense=T]VP[numb=N,tenseVP[numb=N,tense

=M]=M] V [numb=N, tense=M] NPV [numb=N, tense=M] NP

NP NP [numb=Y][numb=Y]

det [numb = Y] noun [numb = det [numb = Y] noun [numb = Y]Y]

manman : : noun [numb = sing]noun [numb = sing] a a :: det [numb = sing]det [numb = sing] the the : : detdetsheepsheep :: nounnounseessees : : [tense = pres, numb = sing][tense = pres, numb = sing]

Migraine abortive Migraine abortive treatment is used to treatment is used to abort migraine.abort migraine.((cat clause)((cat clause) (process ((lex “ (process ((lex “useuse”) (type material)))”) (type material))) (partic ((affected ((cat proper) (partic ((affected ((cat proper) (lex “ (lex “migraine abortive treatmentmigraine abortive treatment”)))”))) (agent none))) (agent none))) (circum ((purpose ((cat clause) (circum ((purpose ((cat clause) (keep-in-order no) (keep-for no) (keep-in-order no) (keep-for no) (position end) (position end) (process ((lex “ (process ((lex “abortabort”)”) (effect-type creative) (effect-type creative) (type material))) (type material))) (partic ((created ((lex “ (partic ((created ((lex “migrainemigraine”)”) (countable no) (countable no) (cat common))))))))))) (cat common)))))))))))

Verbatim Text CodingVerbatim Text Coding

• A text content classification problem.A text content classification problem.

• Group semantically similar answer items.Group semantically similar answer items.

• Develop a code list/tree to represent the Develop a code list/tree to represent the answer item groups.answer item groups.

• Simple NL analysis techniques may help.Simple NL analysis techniques may help.

• Details will be given in the first example of Details will be given in the first example of NLP application.NLP application.

Data Mining Report Data Mining Report GenerationGeneration

• Data mining results are usually in Data mining results are usually in rule or tree formats with obscure rule or tree formats with obscure notations.notations.

• NL generation techniques may help NL generation techniques may help translate the data mining results translate the data mining results into plain natural languages.into plain natural languages.

• Details will be given in the second Details will be given in the second example of NLP application.example of NLP application.

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

Answer Items Code Tree

• Small Small screen/window/textscreen/window/text

• Long list of answer Long list of answer itemsitems

• Difficult to browse/viewDifficult to browse/view

• Worse than paper formWorse than paper form

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

Key Terms

Ranking Answers by SimilarityRanking Answers by Similarity

Items with similar meaning

Text Similarity MeasuresText Similarity Measures

StringString

SemanticsSemantics CoverageCoverage

Text Text Similarity Similarity ScoreScore

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

• A user-interface for classifying answer A user-interface for classifying answer items by drag-and-drop actions.items by drag-and-drop actions.

• NLP reduces time and effort in NLP reduces time and effort in searching, browsing, and selecting searching, browsing, and selecting multiple answer items for multiple answer items for classification.classification.

• There’s still limitations and not fully There’s still limitations and not fully automated.automated.

Technical Issues of CodiaTechnical Issues of Codia

• Improve user-interface.Improve user-interface.

• Use only simple NLP techniques.Use only simple NLP techniques.

• Ambiguity resolution by human.Ambiguity resolution by human.

• Limited by thesaurus.Limited by thesaurus.

• Still cannot handle negatives ‘Not’. Still cannot handle negatives ‘Not’.

• Knowledge engineering is tedious.Knowledge engineering is tedious.

Limitations and Future Limitations and Future ImprovementsImprovements

• Thesaurus has only Thesaurus has only 60,000 terms 60,000 terms classified into 3900 classified into 3900 semantic categories.semantic categories.

• Manual operation Manual operation (ambiguity (ambiguity resolution relies on resolution relies on human).human).

• Similarity measures Similarity measures are too mechanical.are too mechanical.

• Need to update and Need to update and incorporate incorporate frequently used frequently used terms/categories.terms/categories.

• Towards automation Towards automation by using more AI by using more AI such as NLP, GA and such as NLP, GA and NN.NN.

• More adaptive by More adaptive by rule-based or case-rule-based or case-based reasoning.based reasoning.

Data Mining and Knowledge Data Mining and Knowledge DiscoveryDiscovery

PatternsPatterns

KnowledgeKnowledge

DataData

Data Data MiningMining

InterpretatioInterpretationn

KnowledgKnowledge e DiscoveryDiscovery

IfIf q12 = 4 and q12 = 4 and

q31 = 6 and q31 = 6 and

q35 = 3 q35 = 3

thenthen q38 = 3 q38 = 3

IfIf h/h_income = 4 h/h_income = 4

and and city = 6 and city = 6 and

car_owner = 3car_owner = 3

thenthen user = 3 user = 3

say(feature,say(feature,[r1]).[r1]).

The segment of respondents who are The segment of respondents who are product X users is characterized byproduct X users is characterized by

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1 say(feature, say(feature, [r1]).[r1]).

say(general,say(general,[r1]).[r1]).

say(likely,[r1]).say(likely,[r1]).

say(reason,say(reason,[r1]).[r1]).

Basically, the respondents who are Basically, the respondents who are product X users have product X users have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1 say(general, say(general, [r1]).[r1]).

The respondents who are product X users The respondents who are product X users because they have because they have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1

say(reason, say(reason, [r1]).[r1]).

It is likely that the people who have It is likely that the people who have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household incomehigh monthly household income

are product X usersare product X users.

r1

say(likely, [r1]).say(likely, [r1]).

Limitations and Future Limitations and Future ImprovementsImprovements

• Pre-defined syntactic Pre-defined syntactic category of code labels.category of code labels.

• Single sentence for each Single sentence for each rule.rule.

• Lack visualization.Lack visualization.

• Almost no text planning.Almost no text planning.

• English only.English only.

• Lack knowledge of Lack knowledge of explanation.explanation.

• Automatic recognition of Automatic recognition of the syntax.the syntax.

• Describe rule relationship Describe rule relationship in multiple coherent in multiple coherent sentences.sentences.

• Text + graphics or even Text + graphics or even multimedia generation.multimedia generation.

• Implement text planning.Implement text planning.

• Multilingual.Multilingual.

• Implement NL techniques Implement NL techniques for explanation.for explanation.

Concluding RemarksConcluding Remarks

• NLP techniques are found useful in:NLP techniques are found useful in:– Verbatim text coding and Verbatim text coding and

– Data mining report generation.Data mining report generation.

• Group similar answer items.Group similar answer items.

• Write simple natural language text.Write simple natural language text.

• A pricey technology because few A pricey technology because few tools are available.tools are available.

Natural Language Natural Language ProcessingProcessing

Josef Siu-Wai LeungJosef Siu-Wai Leung (j.leung@ieee.org)(j.leung@ieee.org)

Ching-Long YehChing-Long Yeh (chingyeh@cse.ttit.edu.tw)(chingyeh@cse.ttit.edu.tw)

Recommended