Upload
darren-brooks
View
226
Download
0
Tags:
Embed Size (px)
Citation preview
LING 001 Introduction to LinguisticsLING 001 Introduction to LinguisticsSpring 2010Spring 2010
Syntactic parsingPart-Of-Speech tagging
Apr. 5
Computational Computational linguisticslinguistics
LING 001 Introduction to Linguistics, Spring 2010
2
Computational linguistics
Syntax, semantics, grammar, and the lexiconLexical semantics and ontologiesPhonology/morphology, word segmentation, and taggingSummarizationLanguage generationParaphrasing and textual entailmentParsing and chunkingSpoken language processing, understanding and speech-to-speech translationLinguistic, psychological and mathematical models of languageComputational pragmaticsDialogue and conversational agentsComputational models of discourse
Information retrievalQuestion answeringWord sense disambiguationInformation extraction and text miningSemantic role labelingSentiment analysis and opinion miningCorpus-based modeling of languageMachine translation and translation aidsMultilingual processingMultimodal systems and representationsStatistical and machine learning methodsApplicationsCorpus development and language resourcesEvaluation methods and user studies
LING 001 Introduction to Linguistics, Spring 2010
3
Computational linguistics
• Emphasis on integrating linguistic and other knowledge to produce working systems.
• System performance is important. Computational linguistics deals with language as it’s actually used.• little need to worry about rare constructions and
distinctions;• need to worry about fragments, typos, false starts,
ambiguities, non-native speakers, etc.
• Ambiguity in natural language is pervasive, which makes computational linguistics hard.
LING 001 Introduction to Linguistics, Spring 2010
4
Ambiguity
• Lexical: • Bank• Unlockable
• Syntactic:• I shot an elephant in my pajamas. (How he got in my
pajamas, I'll never know.)• I forgot how good beer tastes.• I met Mary and Elena’s mother at the mall yesterday.
• Semantic:• Every cat chases a mouse.• The police refused the demonstrators a permit because ... ... they feared violence. ... they advocated violence.
LING 001 Introduction to Linguistics, Spring 2010
5
Parsing
• Parsing: taking an input and producing some sort of structure for it.
• A syntactic parser is a device (or algorithm) that takes a phrase or sentence as input, and uses a grammar (including a lexicon) to produce the syntactic structure(s) appropriate for that phrase or sentence (often called parse trees or just trees).
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
6
Context-free grammar
• The type of grammar often applied in parsing is known as a context-free grammar.
• A context-free grammar is a set of rules/productions (and a lexicon) that specify how a syntactic constituent can be composed of smaller constituents (The term “context-free” means that expanding a constituent doesn't depend on what other constituents are around it).
LING 001 Introduction to Linguistics, Spring 2010
7
Context-free grammar
• The symbols for constituents (e.g., phrases and sentences) are called non-terminal symbols. Those representing words are called terminal symbols.
• Each rule has a single non-terminal symbol on the left hand side of the arrow. This symbol is expanded into the symbols (non-terminal or terminal) on the right hand side. The non-terminal symbols on the right hand side can then be expanded by other rules.
• The vertical stroke | is just a shorthand for alternative expansions.
• The grammar “accepts” a sentence if there is a way of expanding S (the start symbol), then expanding all the sub-constituents, and so on, until the leaves of the tree match the words in the sentence (which are terminal symbols). If we want to accept noun phrases, we can treat NP as a start symbol.
LING 001 Introduction to Linguistics, Spring 2010
8
Parsing
• Parsing is to run a grammar backwards to find possible structures of a sentence. It can be viewed as a search problem.
• Top-down strategy: All the expansions of the start symbol are considered, then expansions of each of those constituents, and so on, until we reach expansions that match all the words in the sentence. (What are the problems?)
LING 001 Introduction to Linguistics, Spring 2010
9
Parsing
• Bottom-up strategy: The words are examined and all the small constituents that might contain them are postulated, then we see which of those can be fitted together into larger constituents, and so on, until we reach a tree. (what are the problems?)
LING 001 Introduction to Linguistics, Spring 2010
10
Parsing
• The left-corner strategy (top-down prediction with bottom-up verification): Make the left-most expansion (top-down), find rules that handle the left-most words (bottom-up), repeat the procedure.
• Does this flight include a meal?
LING 001 Introduction to Linguistics, Spring 2010
11
Parsing
• Does this flight include meal?
LING 001 Introduction to Linguistics, Spring 2010
12
Parsing
• Does this flight include meal?
LING 001 Introduction to Linguistics, Spring 2010
13Probabilistic CFGs and Statistic Parsing
• Attach probabilities to context-free grammar rules (PCFG): the expansions for a given non-termimal sum to 1.
• Goal: find a single parse tree (the max probability tree) for a sentence instead of all possible parse trees.
LING 001 Introduction to Linguistics, Spring 2010
14Probabilistic CFGs and Statistic Parsing
.15*.40*.05*.05*…=1.5*10-6 .15*.40*.40*.05*…=1.7*10-6
LING 001 Introduction to Linguistics, Spring 2010
15Probabilistic CFGs and Statistic Parsing
• Probabilities can be computed from an annotated database (a Treebank).
• The Penn Treebank:
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
16
Human parsing
• While most sentences are ambiguous in some way, people rarely notice these ambiguities. Instead, they only seem to see one interpretation for a sentence.
• Lexical subcategorization preferences:• The women kept the dogs on the beach.
The women kept the dogs which were on the beach. 5%The women kept them (the dogs) on the beach.95%
• The women discussed the dogs on the beach.The women discussed the dogs which were on the beach.
90%The women discussed them (the dogs) while on the beach.
10% (keep has a preference for VP -> V NP PP, discuss has a preference for VP -> V NP)
• Part-of-speech preferences:• The complex houses married and single students and their
families.(houses is more likely to be a noun)
LING 001 Introduction to Linguistics, Spring 2010
17
Head lexicalization of PCFGs
• The head word of a phrase gives a good representation of the phrase’s structure and meaning.
• Puts the properties of words back into a PCFG.
• Lexicalized Probabilistic Context-Free Grammars perform much better than PCFGs (88% vs. 73% accuracy).
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
18
Part of Speech tagging
• Part of Speech (POS) tagging: Input: the lead paint is unsafe Output: the/Det lead/N paint/N is unsafe/Adj
• Uses of POS tagging:• Text-to-speech: how do we pronounce “lead”?
http://www.ivona.com/, which words bear a pitch accent? • It can differentiate word senses that involve part of speech
differences (what is the meaning of “interest”)?• Tagged text helps linguists find interesting syntactic
constructions in texts (“google”, “ssh”, etc. used as a verb).
• POS tagging is not parsing. It is highly accurate, state-of-the-art is 97% accuracy. But the baseline is already 90%: 1. Tag every word with its most frequent tag; 2. Tag unknown words as nouns.
LING 001 Introduction to Linguistics, Spring 2010
19
Part of Speech tagging
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• Penn Treebank has 45 different POS tags, which is most widely used.
LING 001 Introduction to Linguistics, Spring 2010
20
Part of Speech tagging
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• Percentage of the words accented (“stressed”) under each part-of-speech category in different speech genres:
LING 001 Introduction to Linguistics, Spring 2010
21
Hidden Markov Model POS tagger
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• HMM model has been widely used in many fields: Natural language processing, speech synthesis/recognition, Computer vision, Biology, Economics, Climatology, etc.
• Top row is unobserved states (hidden states), interpreted as POS tags, bottom row is observed output (words).
• Find the most likely hidden state sequences (POS tag sequence) given an observation sequence (word sequence).
LING 001 Introduction to Linguistics, Spring 2010
22
Hidden Markov Model POS tagger
• Representation for Paths (hidden state sequences): Trellis
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
LING 001 Introduction to Linguistics, Spring 2010
23
HAL
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.