22
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Embed Size (px)

Citation preview

Page 1: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Ideas for100K Word Data Set for Human and Machine Learning

Lori LevinAlon LavieJaime CarbonellLanguage Technologies InstituteCarnegie Mellon University

Page 2: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

The data set should support

Machine learningMachine learning from small data can work if

the data is structured. Analysis by humans

Humans can learn a lot from a small data set if the form-function mappings are clear.

Page 3: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Concrete Suggestions1. Hand align a portion of the corpus. 2. Include parse trees and feature structures for a

portion of the corpus.3. Include a representative sample of diversity of

phrase structures.4. Include a representative sample of diversity in

function/meaning.5. Include some simple, single sentences.6. Include some full texts.7. Look for well-known divergences. 8. Conduct an evaluation to be sure that the

corpus elicits what you want it to elicit.

Page 4: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Hand align a portion of the corpus

Automatic alignments algorithms can be bootstrapped from the hand alignments.

A lexicon can be created from the alignments.

Humans can study word usage.

Page 5: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Provide parse trees for a portion of the corpus

Parse trees plus alignments can be input to Avenue-style rule learning Automatic treebanking of the minor language

Humans can study the translation of specific structures.

There should be semantic and functional information in addition to structural information. See below.

Page 6: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Include a representative example of structural diversity Part of the corpus can be structured to

include simple, common sub-trees from the English Penn TreeBank.

Learn a collection of structural mappings that is compositionalA lot of mileage from small data

Preliminary work with Katharina ProbstRaw WSJ data requires editingNeed redundant examples of each structure

Page 7: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Include a representative example of function or meaning Finding out how English structures translate

into minor language structures is not enoughFor example, finding out how to translate

English auxiliary verbs is not useful because they have many functions: tense, aspect, epistemics, evidentials, etc.

Finding out how to express tense, aspect, epistemics, evidentials, etc. is useful.

Page 8: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Include some multi-sentence texts

In order to observeTemporal sequencing of eventsCausationRhetorical relations

Contrast, elaboration, etc.

Given and new informationCo-reference

Page 9: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Look for well-known divergences

E.g., run across the street vs cross the street running

But see below for our view of divergences.

Page 10: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Include some simple sentences

So that the form-function mapping is clear to a human without confounding factors

As a seed for machine learning

Page 11: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Evaluation

Test the corpus on a few languages that in order to be sure that the intended structures and functions are elicited. Need to watch out for idiosyncrasies, lexical

gaps, special constructions, etc. For example, if you want to elicit a noun

modified by a preposition, the person in the room will work better than a bottle of wine.

Page 12: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Hard problems

Body of common phenomena with a tail of phenomena that are individually rare, but collectively massive.

Page 13: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Extra slides

Our view of translation divergences Elaboration on the different roles of

structure and function

Page 14: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Our view of divergences which is divergent from some other views of divergences

Divergences arise when the same function is expressed by a different structure.

Many functions are expressed by specialized constructions that do not translate literally into other languages.

Divergences cannot be neatly grouped into a few classes.

Typological differences between languages are relevant: Embedding vs serialization Synthetic vs analytic causative constructions

Page 15: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Coverage: Structure and Function

Structural DiversityAppositives, adjuncts, embedded clauses,

coordinate structures, ellipsis, etc. Functional/Meaning Diversity

Temporal relations, rhetorical relations, modality, negation, tense, aspect, etc.

Page 16: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Structure and Function

The way you understand a text is by knowing which structure has which function.

The same function is expressed by different structures in different languages.

Page 17: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

What a human needs to know(function) Who did what to who when? What happened before/after what? What caused what? Is it first hand knowledge, hearsay, or

inference? Is it certain, probable, or improbable?

Did it happen or not? What do these words mean?

Page 18: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

How a human knows these things(structure/grammar)

Who did what to who when? Grammatical relations, coreference, time expressions, pronouns/pro-drop,

nominalizations, subordinate clauses, case marking, word order, agreement, tense, aspect

What happened before/after what? Time expressions, temporal connectives, tense and aspect morphemes

What caused what Markers of rhetorical relationsbetween sentences

Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable? Markers of modality and epistemics

Did it happen or not? Markers of negation and counterfactuals

What do these words mean? Vocabulary

Other Questions, existentials, possessives, coordinate structures

Page 19: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

How to make sure the corpus captures what a human needs to know

Organize the corpus by function and then a human can observe the corresponding structure.

Page 20: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Coverage of data for human analysis: basics Closed Class and Special Constructions

Dates, names, numbers, prices, etc. Pronouns, prepositions, etc.

Encoding of grammatical relations and/or semantic roles. How do you know who did what to who? Word order, case marking, agreement

Encoding of old and new information Word order, special constructions (e.g., clefts), etc.

Questions Negation Modification Possession Coordination Indirect speech

Page 21: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Coverage of data for human analysis: multi-sentence and multi-clause

Rhetorical relationsCause, elaboration, contrast, etc.

Temporal relationsBefore, after, during, etc.

Same subject and obviation phenomena Subordination

As subject or objectAs complementAs adjunct

Page 22: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

Other grammatically encoded meanings Modality and Epistemics

Certainty, source of information (first hand, second hand, inference), etc.

Conditionals Comparatives Existentials Tense and aspect Definiteness