35
Kivik 2013 NLP. Corpus processing, Ontologies 1 The contribution of NLP Corpus processing Ontologies and terminologies

Kivik 2013NLP. Corpus processing, Ontologies1 The contribution of NLP Corpus processing Ontologies and terminologies

Embed Size (px)

Citation preview

Kivik 2013 NLP. Corpus processing, Ontologies 1

The contribution of NLPCorpus processing

Ontologies and terminologies

Kivik 2013 NLP. Corpus processing, Ontologies 2

What is NLP?

• Natural Language Processing– natural language vs. computer languages

• Other names – Computational Linguistics

• emphasizes scientific not technological

– Language Engineering – Language Technology

Kivik 2013 NLP. Corpus processing, Ontologies 3

NLP and linguistics

LING

NLP

supply ideasinterpret results

test theoriesexpose gaps

plus turn into technology

Kivik 2013 NLP. Corpus processing, Ontologies 4

Example: regular morphology

LINGUISTICS: – Rules: stems -> inflected forms

NLP: – program the rules

– apply rules to a lexicon of stems

– Is the output correct? Errors?

LINGUISTICS:– refine the theory

Needed for: web search, spell-checkers, machine translation, speech recognition systems etc.

Kivik 2013 NLP. Corpus processing, Ontologies 5

Applications

• web search– Basic search– Filtering results

• spelling and grammar checking • machine translation (MT) • talk to computers

– speech processing as well

• information extraction– finding facts in a database of documents– answering questions

Kivik 2013 NLP. Corpus processing, Ontologies 6

How can NLP make better dictionaries?

By pre-processing a corpus:

• tokenization

• sentence splitting

• lemmatization

• POS-tagging

• parsing

Each step builds on predecessors

Kivik 2013 NLP. Corpus processing, Ontologies 7

Tokenization

“identifying the words”

from:he didn't arrive.

to: Hedidn’t arrive.

Kivik 2013 NLP. Corpus processing, Ontologies 8

Automatic tokenization

• Western writing systems – easy! space is separator

• Chinese, Japanese, some other writing systems– do not use word-separator

– hard • like POS-tagging (below)

Kivik 2013 NLP. Corpus processing, Ontologies 9

Why isn't space=separator enough (even for English)?

• what is a space– linebreaks, paragraph breaks, tabs

• Punctuation– characters do not form parts of words but may

be attached to words (with no spaces)

• brackets, quotation marks

• Hyphenation– is co-op one word or two? is well-managed?

Kivik 2013 NLP. Corpus processing, Ontologies 10

Sentence splitting

“identifying the sentences”

from:he didn't arrive. to: Hedidn’t arrive.

to:<s> Hedidn’t arrive.</s>

Kivik 2013 NLP. Corpus processing, Ontologies 11

Lemmatization

Mapping from text-word to lemma help (verb)

text-word to lemmahelp help (v)helps help (v)helping help (v)helped help (v)

.

Kivik 2013 NLP. Corpus processing, Ontologies 12

Lemmatization

Mapping from text-word to lemma help (verb) help (noun), helping (noun)

text-word to lemmahelp help (v), help (n)helps help (v), helps (n)**helping help (v), helping (n)helped help (v) helpings helping (n)

**help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending.

.

Kivik 2013 NLP. Corpus processing, Ontologies 13

Lemmatization

Dictionary entries are for lemmas

Match between text-word and dictionary-word

lemmatization

Kivik 2013 NLP. Corpus processing, Ontologies 14

Lemmatization

• Searching by lemma – English: little inflection

– French: 36 forms per verb

– Finno-Ugric: 2000.

• Not always wanted:– English royalty

• singular: kings and queens

• plural royalties: payments to authors

Kivik 2013 NLP. Corpus processing, Ontologies 15

Automatic lemmatization• Write rules:

– if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas

• If detailed grammar available, use it• full lemma list is also required

– Often available from dictionary companies

Kivik 2013 NLP. Corpus processing, Ontologies 16

Part-of-speech (POS) tagging

“identifying parts of speech”

from:he didn't arrive. to: …

.

to:<s> He PNP pers pronoun

did VVD past tense verb

n’t XNOT not

arrive VV base form of verb

. C punctuation

</s>

Kivik 2013 NLP. Corpus processing, Ontologies 17

Tagsets

• The set of part-of-speech tags to choose between– Basic: noun, verb, pronoun …– Advanced: examples - CLAWS English

tagset• NN2 plural noun• VVG -ing form of lexical verb

• Based on linguistics of the language.

Kivik 2013 NLP. Corpus processing, Ontologies 18

POS-tagging: why?

• Use grammar when searching– Nouns modified by buckle– Verbs that buckle is object of

Kivik 2013 NLP. Corpus processing, Ontologies 19

POS-tagging: how?

• Big topic for computational linguistics – well understood – taggers available for major languages

• Some taggers use lemmatized input, others do not • Methods

– constraint-based: set of rules of the form if previous word is "the" and VERB is one of the

possibilities, delete VERB – Statistical:

• Machine learning from tagged corpus• Various methods

• Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999.

Kivik 2013 NLP. Corpus processing, Ontologies 20

Parsing

• Find the structure:– Phrase structure (trees)

The cat sat on the mat– Dependency structure (links)

– The cat sat on the mat

Kivik 2013 NLP. Corpus processing, Ontologies 21

Automatic parsing

• Big topic – see Jurafsky and Martin or other NLP

textbook

• Many methods too slow for large corpora

• Sketch Engine usually uses “shallow parsing”– Patterns of POS-tags– Regular expressions

Kivik 2013 NLP. Corpus processing, Ontologies 22

Summary

• What is NLP?

• How can it help?– Tokenizing– Sentence splitting– Lemmatizing– POS-tagging– Parsing

Ontologies and Terminology

and how they relate to lexicography

Kivik 2013 23NLP. Corpus processing, Ontologies

Kivik 2013 NLP. Corpus processing, Ontologies 24

Terminology

• Contains terms – for the objects and concepts in a domain – organized according to relations between

objects– Different language

• Same objects, so• Same organization• Different terms

Kivik 2013 NLP. Corpus processing, Ontologies 25

Ontology• Artificial Intelligence• Like terminology with reasoning

• Tweety is-a swallow• A swallow is-a bird• Birds flyInference-----------------------• Tweety flies

the rationalist dream of automated reasoning

Bird• flies

swallow robin …

Tweety

Kivik 2013 NLP. Corpus processing, Ontologies 26

Ontology

• Chris is-a dentist• Chris has-practice in Lancing• Chris works 9am-3pm Mon-Fri• …• You live-near Lancing• You want-to-visit dentist• You are-available …Inference---------------------------------------------------------Appointment, you, Chris, Lancing, 10am, Thursday

Kivik 2013 NLP. Corpus processing, Ontologies 27

Items in an ontology

• Defined by relations in ontology

• Labelled (only) by words/phrases in various languages

X1EN: birdFR: oiseau

X2EN: swallowFR: hirondelle

•Ontology/things: language independent

Kivik 2013 NLP. Corpus processing, Ontologies 28

Mismatches and gaps

Y1EN: body partsSP: …

Y2SP: dedo

Y5EN: armSP: bras

Y3EN: finger

Y4EN: toe

Kivik 2013 NLP. Corpus processing, Ontologies 29

Thesaurus (eg Roget)

• Looks like a simple ontology – hierarchy only– supports inference?

• usually fudged

• Language independent?

Kivik 2013 NLP. Corpus processing, Ontologies 30

WordNet

• Princeton Univ project, from ca 1990

• Thesaurus– Synonym sets or synsets– Hyponyms/hyperonyms, antonyms, part-of,

other lexical relations

• Free, online and available for download– Very widely used– Replicated for many languages, Global WN

Assn

Kivik 2013 NLP. Corpus processing, Ontologies 31

Lexicon/dictionary

• About words

• Organized by words

• Language specific

Kivik 2013 NLP. Corpus processing, Ontologies 32

Rationalists Empiricists

• Structure• Depth• Logic• Semantic Web

• Terminology

• Data• Breadth• Statistics• Google

• Lexicography

Kivik 2013 NLP. Corpus processing, Ontologies 33

Terminology Lexicography

• What is the thing called– in languages x, y, z

• What kind of thing is it – Is-a link

– Its place in ontology

• Well-structured hierarchy

• How does the word behave?– what does it denote?

• Where does it occur?

Kivik 2013 NLP. Corpus processing, Ontologies 34

Synthesis• Thesis

– Ontology, terminology, taxonomical lexicography• Semantic web, Roget, WordNets

• Antithesis– Corpus lexicography

• Synthesis: integrating• language-independent structure• language-specific word/phrase behaviour

– Corpus-based terminology– FrameNet

Kivik 2013 NLP. Corpus processing, Ontologies 35

Summary

words

Lexicon

Thesaurus/Terminology

Ontology

things