Evaluation Measure in Language...

Preview:

Citation preview

Evaluation Measure inLanguage Acquisition

Charles YangUniversity of Pennsylvania

LING50MIT

From LSLT to Aspects: A Psychological Turn

From LSLT to Aspects: A Psychological Turn

• LSLT

• “any simplification along these lines is immediately reflected in the length of the grammar” (§26: MDL)

• “defined the best analysis as the one that minimizes information per word in the generated language of grammatical discourses” (§35.4: entropy)

• 23.771 Mathematical Backgrounds for Communication (Hall/Partee)

From LSLT to Aspects: A Psychological Turn

• LSLT

• “any simplification along these lines is immediately reflected in the length of the grammar” (§26: MDL)

• “defined the best analysis as the one that minimizes information per word in the generated language of grammatical discourses” (§35.4: entropy)

• 23.771 Mathematical Backgrounds for Communication (Hall/Partee)

• Aspects

• “... theories require supplementation by an evaluation measure if language acquisition is to be accounted for ... such a measure is not given a priori, in some manner. Rather, any proposal concerning such a measure is an empirical hypothesis about the nature of language” (p37)

Three Psychological Conditions

Three Psychological Conditions

• Formal sufficiency: Does the evaluation measure choose the “correct”grammar?

• Distributional methods: Harris (1951), Fowler (1952), Holt (1953)

Three Psychological Conditions

• Formal sufficiency: Does the evaluation measure choose the “correct”grammar?

• Distributional methods: Harris (1951), Fowler (1952), Holt (1953)

• Ecological validity: Does the evaluation measure operate under reasonable assumptions about the learning data and mechanisms?

Three Psychological Conditions

• Formal sufficiency: Does the evaluation measure choose the “correct”grammar?

• Distributional methods: Harris (1951), Fowler (1952), Holt (1953)

• Ecological validity: Does the evaluation measure operate under reasonable assumptions about the learning data and mechanisms?

• Developmental compatibility: Does the evaluation measure employed by the learner produce similar developmental patterns in language acquisition?

MDL: An Evaluation Measure

DL(D,G) = |G|+ log

1

p(D|G)

MDL: An Evaluation Measure

• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference

DL(D,G) = |G|+ log

1

p(D|G)

MDL: An Evaluation Measure

• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference

• A method for hypothesis selection rather than hypothesis proposing

• Three conditions for choosing alternative methods, e.g., reinforcement learning, Fourier transform

DL(D,G) = |G|+ log

1

p(D|G)

MDL: An Evaluation Measure

• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference

• A method for hypothesis selection rather than hypothesis proposing

• Three conditions for choosing alternative methods, e.g., reinforcement learning, Fourier transform

• The composition of data

DL(D,G) = |G|+ log

1

p(D|G)

MDL: An Evaluation Measure

• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference

• A method for hypothesis selection rather than hypothesis proposing

• Three conditions for choosing alternative methods, e.g., reinforcement learning, Fourier transform

• The composition of data

• Subset Principle (Berwick 1985): the first Evaluation Measure to influence empirical work in language acquisition

DL(D,G) = |G|+ log

1

p(D|G)

Evaluation Measuring Parameters

Evaluation Measuring Parameters

• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)

Evaluation Measuring Parameters

• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)

• Parameters can be viewed as low dimensional description of syntactic variation, or MDL

Evaluation Measuring Parameters

• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)

• Parameters can be viewed as low dimensional description of syntactic variation, or MDL

• CUNY CoLAG Parameter Domain (Sakas & J.D.Fodor in press): 13 parameters, 3072 grammars, 48086 distinct degree-0 sentences

• Most parameters are favorable for the learner and can be set independently (thus “scattered” well)

Evaluation Measuring Parameters

• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)

• Parameters can be viewed as low dimensional description of syntactic variation, or MDL

• CUNY CoLAG Parameter Domain (Sakas & J.D.Fodor in press): 13 parameters, 3072 grammars, 48086 distinct degree-0 sentences

• Most parameters are favorable for the learner and can be set independently (thus “scattered” well)

• Evidence for parameters in language acquisition

MDL in Action

MDL in Action

fly

bring

blow

think

catch

draw

Add -dwalk(Regulars)

flew

brought

blew

thought

caught

drew

walked(Regulars)

Stem Past tense

fly

bring

blow

think

catch

draw

Add -dwalk(Regulars)

flew

brought

blew

thought

caught

drew

walked(Regulars)

Stem Past tense

associations

MDL in Action

MDL in Action

MDL in Action

flybringblowthinkcatchdraw

add -t & Rime !/a/

add -ø & Rime !/u/

Add -dwalk(Regulars)

flewbroughtblewthoughtcaughtdrewwalked(Regulars)

Stem Past tense

MDL in Action

flybringblowthinkcatchdraw

add -t & Rime !/a/

add -ø & Rime !/u/

Add -dwalk(Regulars)

flewbroughtblewthoughtcaughtdrewwalked(Regulars)

Stem Past tense

“ ... the acceptance of these Laws (Grimm’s and Verner’s) as historicalfact is based wholly on considerations of simplicity”

Halle (1961: On the role of simplicity in linguistic descriptions)

Evidence from Children• Largest past tense analysis to date (Gorman & Yang 2011)

• Free-rider effect: verbs belonging to larger rules learned better

Evidence from Children• Largest past tense analysis to date (Gorman & Yang 2011)

• Free-rider effect: verbs belonging to larger rules learned better

Spearman ρ Kendall τ G-K γ☞Abstract rules 0.276 0.191 0.202

Surface rules 0.267 0.180 0.190Words only 0.128 0.133 0.140

Evidence from Children• Largest past tense analysis to date (Gorman & Yang 2011)

• Free-rider effect: verbs belonging to larger rules learned better

Spearman ρ Kendall τ G-K γ☞Abstract rules 0.276 0.191 0.202

Surface rules 0.267 0.180 0.190Words only 0.128 0.133 0.140

Results and discussion • The effects of verb and pattern frequency on CUR (aggregated across all subjects) can be compared with rank correlation coefficients

• Similarly, verbs with low verb frequency and high pattern frequency verbs have significantly higher CURs than verbs with high verb frequency/low pattern frequency (two-tailed Mann-Whitney W = 193, p = 0.035)

• Both these results indicate that pattern frequency trumps verb frequency

•The relative effects of verb and pattern frequency are finally analyzed by fitting a logistic mixed effects regression model with the maximal random effects structure; as verb and pattern frequency are closely correlated (R = 0.635), log verb frequency is first residualized [14] against log pattern frequency

• Increases in both pattern and verb frequency result in fewer overregularizations

• The results are consistent with linguistic models [3, 6, 7, 15] which derive irregular pasts by stem change and suffixation, but storage models have no locus for the free-rider effect

• Future work will use this data to evaluate similarity-based models [5, 15, 16]

English irregular past tense errors and the free-rider effect

[1] S. Kuczaj. 1977. JoVL&VB 16. [2] S. Prasada, S. Pinker. 1993. L&CP 8 [3] M. Halle, K.P. Mohanan. 1985. LI 16. [4] J. Bybee, D. Slobin. 1982. Lg 58. [5] J. McClelland, K. Patterson. 2002. TiCS 6. [6] L. Stockall, A. Marantz. 2006. Mental Lexicon 1. [7] C. Yang. 2002. Knowledge and learning in natural language. Oxford: OUP. [8] R. Brown. 1973. A first language: The early stages. Cambridge: HUP. [9] K. Demuth, J. Culbertson, J. Alter. 2006. L&S 49. [10] W. Hall, W. Nagy, R. Linn. 1984. Spoken words: Effects of situation and social group on oral word usage and frequency. Hillsdale, NJ: Erlbaum. [11] G. Marcus, S. Pinker, M. Ullman, M. Hollander, J. Rosen, F. Xu. 1992. Overregularization in language acquisition. Chicago: UCP. [12] R. Maslen, A. Theakston, E. Lieven, M. Tomasello. 2004. JoSL&HR 47. [13] E. Brill. 1995. Computational Linguistics 21. [14] K. Gorman. 2010. In PWPL 16.2. [15] A. Albright and B. Hayes. 2003. Cognition 90. [16] S. Chandler. 2010. Cognitive Lingustics 21.

Kyle Gormanad • David Fabera • Elika Bergelsonbd • Charles Yangabcd

aDept. of Linguistics • bDept. of Psychology • cDept. of Computer & Information Science • dInstitute for Research in Cognitive Science

The debate • SOME BACKGROUND ON THE SUBJECT

• Storage accounts [1, 2]

• Other accounts, both symbolic [3, 4] and connectionist [5]

The free-rider effect • (definition of overregularization w/ example)

• (say we’ll ignore *broughted and similar errors)

• Yang [7] identifies the free-rider effect: children are quick to master low-frequency irregulars (e.g, wake/woke, forget/forgot) if they follow the same pattern as high-frequency irregulars (break/broke, get/got)

• This requires some representation to be shared by wake and break, and thus is antithetical to the storage account

• Recent experiments on adults also provide support for the derivational account: irregular past tense verbs (e.g., woke) fully prime their base (e.g., wake) in lexical decision [6]

• This is the largest ever study of children’s overregularization errors

Methods • The raw data are CHILDES transcripts of the spontaneous speech of 50 children ages 2-5 who are acquiring monolingual English and have no known cognitive or hearing deficits [8-12]

• A part-of-speech tagger [13] is used to locate verbs in the transcripts; correct and incorrect past tense tokens of irregular verbs are automatically tabulated and then verified by hand; however, immediate repetitions of a verb by a child are counted only one time

• A separate script aggregates the speech of parents and computes verb input frequency

• This coding is validated by comparing the resulting per-verb correct usage rate (CUR) for each of the children studied by Marcus et al. [11] using the Spearman rank correlation coefficient; the results indicates a close match (Adam: ρ = 0.92, Eve: ρ = 0.99, Sarah: ρ = 0.97)

•The procedure yields 11,735 tokens of 97 irregular verb types, and 839 overregularization tokens across 75 verb types (CUR = 92.9%)

• Irregular verbs are grouped into 22 patterns according to the stem change and/or suffix found in the past tense (relative to the present)

• Unlike prior attempts to categorize the English irregular verbs [4, 7, 11], this larger set of patterns uniquely determines the observed past tense from the present form

Pattern Example N Pattern Example N No change burst/burst 13 [ɩ] slide/slid 3

[oʊ] drive/drove 10 [ɔ] wear/wore 3

[ʌ] dig/dug 9 [oʊ...-d] sell/sold 2

[æ] sit/sat 8 [aʊ] find/found 2

[u:] blow/blew 7 [ɛ...-d] say/said 1

[ɛ] breed/bred 7 [ɔ...-t] lose/lost 1

[ɛ...-t] keep/kept 7 [ɛɚ...-d] hear/heard 1

[-t] build/built 5 was be/was 1

[eɩ] eat/ate 5 went go/went 1

rime→[ɔt] buy/bought 5 make make/made 1

[ɑ] shoot/shot 3 stood stand/stood 1

Spearman ρ Kendall τb G-K γ Pattern frequency 0.267 0.180 0.190

Verb frequency 0.128 0.133 0.144

Estimate Std. err. p-value Log pattern frequency 0.488 0.160 0.002

Resid. log verb frequency 0.413 0.081 4.2e-7

< avg. rule freq. > avg. verb freq. > avg. rule freq. < avg. verb freq.

two-tailed Mann-Whitney W=156.5, p=0.019

Yesterday he ____

Yesterday he ____

Yesterday he ____

• The forgotten Wug test (Berko 1958)

• Only one out of 86 children produced bing-bang, gling-glang

Yesterday he ____

• The forgotten Wug test (Berko 1958)

• Only one out of 86 children produced bing-bang, gling-glang

• Children over-regularize: 8-10%

Yesterday he ____

• The forgotten Wug test (Berko 1958)

• Only one out of 86 children produced bing-bang, gling-glang

• Children over-regularize: 8-10%

• Children over-irregularize: <0.2%

Yesterday he ____

• The forgotten Wug test (Berko 1958)

• Only one out of 86 children produced bing-bang, gling-glang

• Children over-regularize: 8-10%

• Children over-irregularize: <0.2%

• Children’s Evaluation Measure produces a binary outcome: productive or lexical

• probability spreading insufficient

Exceptions in Evaluation Measure

Exceptions in Evaluation Measure

“core”, “basic word order”, “default case”, “unmarked form”vs.

“periphery”, “lexical listing”, “exceptional marking”, “diacritics”

Exceptions in Evaluation Measure

• SPE: “Clearly, we must design our linguistic theory in such a way that the existence of exceptions does not prevent the systematic formulation of those regularities that remain ... Finally, an overriding consideration is that the evaluation measure must be designed in such a way that the wider and more varied the class of exceptions to a rule, the less highly valued is the grammar” (p172)

“core”, “basic word order”, “default case”, “unmarked form”vs.

“periphery”, “lexical listing”, “exceptional marking”, “diacritics”

Exceptions in Evaluation Measure

• SPE: “Clearly, we must design our linguistic theory in such a way that the existence of exceptions does not prevent the systematic formulation of those regularities that remain ... Finally, an overriding consideration is that the evaluation measure must be designed in such a way that the wider and more varied the class of exceptions to a rule, the less highly valued is the grammar” (p172)

• But majority doesn’t rule: 90% of English words in speech are stress initial (Cutler & Carter 1987); Legate & Yang poster

“core”, “basic word order”, “default case”, “unmarked form”vs.

“periphery”, “lexical listing”, “exceptional marking”, “diacritics”

Measuring Rules

Measuring Rules

Measuring Rules

• Optimization again: instead of space, it’s time

Measuring Rules

• Optimization again: instead of space, it’s time

• Exceptions = Real time processing slowdown

• “kicked the bucket” faster than “lifted the bucket” by 51ms (Swinney & Cutler 1979)

• Production: German irregular past participle (-n) faster than regular (-t) by 38ms (Clahsen & Fleischhauer 2011)

• Lexical decision: English irregular verbs faster than regulars by 19ms (English Lexicon Project; Lignos 2011)

Measuring Rules

• Optimization again: instead of space, it’s time

• Exceptions = Real time processing slowdown

• “kicked the bucket” faster than “lifted the bucket” by 51ms (Swinney & Cutler 1979)

• Production: German irregular past participle (-n) faster than regular (-t) by 38ms (Clahsen & Fleischhauer 2011)

• Lexical decision: English irregular verbs faster than regulars by 19ms (English Lexicon Project; Lignos 2011)

• Exceptions delay rule computation

Measuring Rules

• Optimization again: instead of space, it’s time

• Exceptions = Real time processing slowdown

• “kicked the bucket” faster than “lifted the bucket” by 51ms (Swinney & Cutler 1979)

• Production: German irregular past participle (-n) faster than regular (-t) by 38ms (Clahsen & Fleischhauer 2011)

• Lexical decision: English irregular verbs faster than regulars by 19ms (English Lexicon Project; Lignos 2011)

• Exceptions delay rule computation

•Exception 1•Exception 2 •Exception 3•...•Rule

Tolerance Principle

Tolerance Principle

N

lnN

To be productive, the maximum exceptions to a rule/process applicable to N items is

Tolerance Principle

N

lnN

To be productive, the maximum exceptions to a rule/process applicable to N items is

• If English has 150 irregular verbs, we need 900 regulars to have a productive -ed rule: 1050/ln(1050) = 150

Tolerance Principle

N

lnN

To be productive, the maximum exceptions to a rule/process applicable to N items is

• If English has 150 irregular verbs, we need 900 regulars to have a productive -ed rule: 1050/ln(1050) = 150

• Children start over-regularization when they reach the tipping point

Tolerance Principle

N

lnN

To be productive, the maximum exceptions to a rule/process applicable to N items is

• If English has 150 irregular verbs, we need 900 regulars to have a productive -ed rule: 1050/ln(1050) = 150

• Children start over-regularization when they reach the tipping point

• N (e.g., vocabulary size) and the number of exceptions may vary from speaker to speaker, accounting for certain individual patterns in language acquisition and sociolinguistic variation

A Birth-er Problem

A Birth-er Problem

• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)

A Birth-er Problem

• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)

• While some -er’s are real (hunt-hunter), some are not (corn-corner, cent-center, sock-soccer): children need to learn -er despite exceptions

A Birth-er Problem

• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)

• While some -er’s are real (hunt-hunter), some are not (corn-corner, cent-center, sock-soccer): children need to learn -er despite exceptions

• English Lexicon Project (Balota et al. 2007)

• hunt-hunter type: 94, cent-center type: 18

• The suffix -er is productive: 18 < 112/ln(112)=24

A Birth-er Problem

• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)

• While some -er’s are real (hunt-hunter), some are not (corn-corner, cent-center, sock-soccer): children need to learn -er despite exceptions

• English Lexicon Project (Balota et al. 2007)

• hunt-hunter type: 94, cent-center type: 18

• The suffix -er is productive: 18 < 112/ln(112)=24

• The suffix -th fails to reach productivity: warmth, width, depth etc. overwhelmed by tooth, booth, filth, forth, ...

stride-strode-??? N

lnN

stride-strode-???

• Mere majority is not sufficient; filibuster proof majority required

N

lnN

stride-strode-???

• Mere majority is not sufficient; filibuster proof majority required

• [-lexical insertion] (Halle 1972, esp. fn1)

• gaps only arise in unproductive corners of morphology

N

lnN

stride-strode-???

• Mere majority is not sufficient; filibuster proof majority required

• [-lexical insertion] (Halle 1972, esp. fn1)

• gaps only arise in unproductive corners of morphology

• 102 out of 161 irregular verbs (36%) show preterite and past participle syncretism

• Tolerance Principle only allows 1/ln(161)=20% exceptions

• *forwent, *sightsaw, *stridden

N

lnN

Evaluation Metrics

Evaluation Metrics

• What not to do: Computer chess

Evaluation Metrics

• What not to do: Computer chess

• Resource bounded optimization

Evaluation Metrics

• What not to do: Computer chess

• Resource bounded optimization

• Convergence of methods and disciplines

Evaluation Metrics

• What not to do: Computer chess

• Resource bounded optimization

• Convergence of methods and disciplines

• Simple theories are usually right ones

Thank you, to my teachers

• Bob Berwick

• Noam Chomsky

• Morris Halle

Recommended