Natural LanguageProcessing
According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by islelf but the wrod as a wlohe. Ceehiro. That’ll sutff the splelchekcer.
(Exert from Sideswipe NZ Herald 22/09/03)
1970s Example: SHRDLU
Person: PICK UP A BIG RED BLOCK.
Computer: OK. (does it)
Person: GRASP THE PYRAMID.
Computer: I DON'T UNDERSTAND WHICH
PYRAMID YOU MEAN.
Person: FIND A BLOCK WHICH IS TALLER
THAN THE ONE YOU ARE HOLDING
AND PUT IT INTO THE BOX.
Computer: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.
Computer: OK. (does it)
Person: WHAT DOES THE BOX CONTAIN?
Computer: THE BLUE PYRAMID AND THE BLUE BLOCK.
Person: WHAT IS THE PYRAMID SUPPORTED BY?
Computer: THE BOX.
Person: HOW MANY BLOCKS ARE NOT
IN THE BOX?
Computer: FOUR OF THEM.
Person: IS AT LEAST ONE OF THEM
NARROWER THAN THE ONE WHICH
I TOLD YOU TO PICK UP?
Computer: YES, THE RED CUBE.
Terry Winograd. 1971.MIT Ph.D. Thesis.
Terry Winograd
Pomegranade
Natural language processing (NLP) Human Language Technology (HLT), Natural Language Engineering (NLE)
• is considered a sub-field of artificial intelligence and has significant overlap with the field of computational linguistics. It is concerned with the interactions between computers and human (natural) languages.
• Natural language generation systems convert information from computer databases into readable human language.
• Natural language understanding systems convert human language into representations that are easier for computer programs to manipulate.
• The term natural language is used to distinguish human languages (e.g. English, Persian, Swedish) from formal or computer languages (e.g. C++, Prolog).
• NLP encompasses both text and speech, but work on speech processing has evolved into a separate field.
Where does it fit in the CS taxonomy?
Computers
Artificial Intelligence AlgorithmsDatabases Networking
Robotics SearchNatural Language Processing
InformationRetrieval
Machine Translation
Language Analysis
Semantics Parsing
…
…
• Yahoo, Google, Microsoft Information Retrieval• Monster.com, HotJobs.com (Job finders) Information Extraction & Information Retrieval• Systran powers Babelfish, Google Machine Translation• Ask Jeeves Question Answering• Myspace, Facebook, Blogspot Processing of User-
Generated Content• Tools for “business intelligence”• All “Big Companies” have (several)
strong NLP research labs: IBM, Microsoft, AT&T, Xerox,
Sun, etc.• Academia: research in an university environment
Applications
What is NLP?
• Combination of computational linguistics, artificial intelligence & cognitive science.
• Concentrates on interpreting text using a combination of
lexical, syntactic, semantic and real world knowledge.
• Applications include intelligent translators, speech recognition software, information management tools and other types of communication software.
Grammar
• The grammar of a language is a description of the structure of that language.
• Grammars provide a scheme for specifying the structure of sentences and rules for combining words into correct phrases and clauses.
English Grammar
• English word order follows a Subject-Object-Verb (SVO) linguistic topology.
• The subject of a verb is the “doer” of the verb, and the object is the “doee”.
The cat is drinking the milk.
Subject Verb Object
Syntax
• Syntax is the study of the rules, or patterns, that govern the way the words in a sentence come together.
• Syntax deals with how different words which are categorised into “parts of speech” (nouns, adjectives, verbs etc), and how they are combined into clauses, or phrases, which in turn combine into sentences.
Syntactic Analysis
• Syntactic analysis involves isolating phrases and sentences into a hierarchical structure, allowing the study of its constituents.
• For example the sentence “the big cat is drinking milk” can be broken up into the following constituents:
Syntactic Analysis
The big cat is drinking milk
Noun Phrase Verb Phrase
Determiner Adjective Phrase
Noun Auxiliary Verb Noun Phrase
The big cat is drinking milk
A Grammar for a very small fragment of English
sentence --> noun_phrase, verb_phrase.
noun_phrase --> determiner, noun. noun_phrase --> proper_noun.
determiner -->[the]. determiner -->[a].
proper_noun -->[pedro].
noun -->[man]. noun -->[apple].
verb_phrase --> verb, noun_phrase. verb_phrase --> verb.
verb -->[eats]. verb -->[sings].
Implementation- Prolog
?- phrase(sentence, [the, man, eats]).
yes
?- phrase(sentence, [the, man, eats, the, apple]).
yes
?- phrase(sentence, [the, apple, eats, a, man]).
yes
?- phrase(sentence, [pedro, sings, the, pedro]).
no
?- phrase(sentence,[eats, apple, man]).
no
?- phrase(sentence,L).
L = [the, man, eats, the, man] ;L = [the, man, eats, the, apple] ;L = [the, man, eats, a, man] ;L = [the, man, eats, a, apple] ;L = [the, man, eats, pedro] ;L = [the, man, sings, the, man] ;L = [the, man, sings, the, apple] ;L = [the, man, sings, a, man] ;L = [the, man, sings, a, apple] ;L = [the, man, sings, pedro] ;L = [the, man, eats] ;L = [the, man, sings] ;L = [the, apple, eats, the, man] ;L = [the, apple, eats, the, apple] ;L = [the, apple, eats, a, man] ;L = [the, apple, eats, a, apple] ;L = [the, apple, eats, pedro] ;L = [the, apple, sings, the, man] ;L = [the, apple, sings, the, apple] ;L = [the, apple, sings, a, man] ;
Issues in Syntax
• “the dog ate my homework” - Who did what?
• Identify the part of speech (POS)– Dog = noun ; ate = verb ; homework = noun– English POS tagging
• Identify collocations
mother in law, hot dog
•
Chomsky’s Grammars
• Chomsky introduced transformational grammars (also called transformational generative grammars or generative grammars).
• He introduced the idea of “deep structures” which provide a syntactic base of language and consist of:
Chomsky’s Grammars
– a series of phrase-structure (rewrite) rules
– a series of (possibly universal) rules that generates the underlying phrase-structure of a sentence
– a series of transformations that act upon the phrase-structure, producing more complex sentences
– a series of morphophonemic rules controlling pronunciation.
Chomsky’s Lexicon
• The lexicon, which can be thought of as a dictionary of the language in a particular form, lists all of the vocabulary words in the language and associates them with their syntactic, semantic and phonological information.
• This information is represented in terms of “features”.
Chomsky’s Feature Terms
• For example, the entry for “cat” might have the following syntactic features:
Cat: [+ Noun], [+ Count], [+ Common], [+ Animate]
• These features are used to fill “slots” in a set of phrase markers. For example, a phrase marker requiring an animate noun ([+ Animate]) would find “cat” eligible for lexical subsitiution into that slot, as it fulfils the requirements of being an animate noun.
Syntactics vs Semantics
• One of the most controversial topics in the development of transformational grammar is the reationship between syntax and semantics.
• There is a considerable degree of interdependence between the two, and the problem is how to formalise this relationship.
Phrase Structure Grammars
• Phrase-structure rules are used to describe a given language's syntax by attempting to break language down into its constituent parts (also known as syntactic categories) namely phrasal categories and lexical categories (parts of speech).
• There are many kinds of phrase-structure rules, which themselves can be combined to generate additional phrase-structure rules.
Phrase Structure Grammars
• In particlar phrase-structure rules must account for the following characteristics:
1. All languages combine nouns (N) and verbs (V) to express ideas about the universe.
2. All languages have rules determining how these are combined into meaningful units.
Phrase Structure Grammars
3. All languages have recursion, i.e. at least one rule that can be repeated ad infinitum:
– An example of this is the English use of "and", which can link any series of two or more nouns or two or more verbs:
• "His and hers and theirs and Mary's and John's... etc. " • "He ran and jumped and played and skipped and
danced and .. etc. "
Phrase Structure Grammar
– This would be described in Transfomational Grammar as:
• A noun phrase (NP) consists of a N or NP, the word ‘and’, and another N or NP.
• A verb phrase (VP) consists of a V or VP, the word ‘and’, and another V or VP.
Phrase Structure Tree
Sentence
Noun Phrase Verb Phrase
Determiner Noun Verb Noun Phrase
Determiner Noun
A monkey climbs the trees
Problems with Traditional Grammars
• They are Grammar based when natural language isn’t strictly ‘Grammar based’.
• Most don’t take into account language variations and dialects.
• Humans have a built in natural language processor that can handle things machine natural language processors cannot.
Yoda
• “When 900 years old you reach, look as good you will not.”
• “With you the force is.”• “A brave man your
Father was.”• Yoda (typically) uses the
OSV linguistic topology which is characteristic of some of the Brazilian languages.
Inherent Complexity
• To understand a sentence you must do more than combine the dictionary meanings of it’s constituents.
• A large amount of human knowledge is assumed and communication takes place between complex agents in complex environments.
Statistical approach
• Statistical Machine Translation