Upload
josiah
View
17
Download
0
Embed Size (px)
DESCRIPTION
ISP 433/633 Week 9. NLP in IR. Natural Language Processing. Simple Definition: A study of how to use computers to do things with human languages. What to process? Many levels. Levels of NLP. Phonetics and Phonology How do sounds map to words?; Speech Recognition Morphology - PowerPoint PPT Presentation
Citation preview
ISP 433/633 Week 9
NLP in IR
Natural Language Processing
• Simple Definition:– A study of how to use computers to do
things with human languages.
• What to process?– Many levels
Levels of NLP
• Phonetics and Phonology– How do sounds map to words?; Speech
Recognition
• Morphology– How do parts of words make up words?;
Lexical analysis: friend+ly
• Syntax– How do words and phrases combine to
make bigger structures (sentences)?; Grammars, Parsing
Levels of NLP (II)
• Semantics– What do words and their combination
mean?; Semantic analysis
• Pragmatics– What goes on beyond semantics?;
Sentence meaning + context, discourse analysis. Can you open the window?
Ambiguity in Natural Language
• Lexical ambiguity– For example, the English word ‘bank’ is
ambiguous between ‘side of the river’ and ‘financial institution’
• Syntactic ambiguity: the ambiguity in sentence structure– “John saw a woman with a telescope.
• Semantic ambiguity– Visiting relatives can be boring.
Use of NLP in IR
• Phrase usage vs. individual terms• Search expansion using related
terms/concepts – Language reuse
• Information Extraction• Question Answering• User Interface
– Speech recognition– Text-to-speech
Phrase for indexing
• single terms often inadequate for accurate discrimination
• phrases offer alternative
• for huge DB’s use of phrasal terms becomes necessary– E.g.: “joint”, “venture” may get small
weighting separately
How to select a phrase
• co-occurrence selection leads to high error rates– E.g. The former Soviet President has been a local hero ever since
a Russian tank invaded Wisconsin
• syntactic analysis needed
TAGGER PARSER TERMS
Tagging
INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.
TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per
Parsing
[assert
[[perf [have]][[verb[BE]]
[subject [np[n PRESIDENT][t_pos THE]
[adj[FORMER]][adj[SOVIET]]]]
[adv EVER]
[sub_ord[SINCE [[verb[INVADE]]
[subject [np [n TANK][t_pos A]
[adj [RUSSIAN]]]]
[object [np [name [WISCONSIN]]]]]]]]]
Term Weighting
President 2.623519 soviet 5.416102
President+soviet 11.556747 president+former 14.594883
Hero 7.896426 hero+local 14.314775
Invade 8.435012 tank 6.848128
Tank+invade 17.402237 tank+russian 16.030809
Russian 7.383342 wisconsin 7.785689
Language Reuse
Yugoslav President Slobodan Milosevic
[description]
NP
Phrase to be reused
[entity]
Named entities
Richard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.
Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.
Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday.
Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory.
Entities + Descriptions
Chief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime Minister Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.
Israel's Defense Minister Yitzhak Mordechai will meet senior Palestinian negotiator Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.
Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday.
Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the territory.
Multiple descriptions per entity
Bill Clinton
U.S. PresidentPresidentAn Arkansas nativeDemocratic presidential candidate
Profile for Bill Clinton
Information Extraction (IE)
• To extract information that fits pre-defined database schemas or templates
• Given a text, answer:– What happened?– When?– Where?– What was the outcome?– Who?– …
IE example
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ……….
Name of the Venture: Yaxing BenzProducts: buses and bus chassisLocation: Yangzhou,ChinaCompanies involved: (1)Name: X? Country: German (2)Name: Y? Country: China
Another text
Need another template
Crime-Type: Murder Type: StabbingThe killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general managerLocation: Nanjing, China
………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ……….
Limitations of IE
• Templates are hand-crafted by human experts
• Templates are domain dependent and not easily portable
• Templates = pre-defined questions
Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: How tall is the Matterhorn?
A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches
Q: How tall is the replica of the Matterhorn at Disneyland?
A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years
Question answering
•TREC Question answering track•Eight years
•Corpus: texts and questions
Example systems
• AskJeeves is probably most well known example
• AnswerBus is an open-domain question answering system
• Ionaut, EasyAsk, AnswerLogic, AnswerFriend, Start, LCC, Quasm, Mulder, Webclopedia, etc.
Type of Answer
QA System Output
AnswerBus Sentences
AskJeeves Documents
IONAUT Passages
LCC Sentences
Mulder Extracted answers
QuASM Document blocks
START Mixture
Webclopedia Sentences
Steps
• Usually IR methods are used to retrieve candidate documents
• NLP techniques are used to extract the likely answers from the text of the documents– Question type– Name entities extraction– Co-reference resolution
Type of QuestionQA-Token Question type Example
PLACE$ Where In the Rocky Mountains
COUNTRY$ Where/What country United Kingdom
STATE$ Where/What state Massachusetts
PERSON$ Who Albert Einstein
ROLE$ Who Doctor
NAME$ Who/What/Which The Shakespeare Festival
ORG$ Who/What The US Post Office
DURATION$ How long For 5 centuries
AGE$ How old 30 years old
YEAR$ When/What year 1999
TIME$ When In the afternoon
DATE$ When/What date July 4th, 1776
VOLUME$ How big 3 gallons
AREA$ How big 4 square inches
LENGTH$ How big/long/high 3 miles
WEIGHT$ How big/heavy 25 tons
NUMBER$ How many 1,234.5
METHOD$ How By rubbing
RATE$ How much 50 per cent
MONEY$ How much 4 million dollars
AnswerBus
Automatic Speech Recognition
• Hand-free input of query• Usually need to train the ASR engine
– User speaks known text to engine
• Type– Grammar based
• Controlled vocabulary, defined sentence structure, only for small domains
• More accurate
– Free dictation• Less accurate
Text To Speech
• Usually for vision-impaired users
• Demo: http://www.research.att.com/projects/tts/demo.html