Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Evaluation Measure inLanguage Acquisition
Charles YangUniversity of Pennsylvania
LING50MIT
From LSLT to Aspects: A Psychological Turn
From LSLT to Aspects: A Psychological Turn
• LSLT
• “any simplification along these lines is immediately reflected in the length of the grammar” (§26: MDL)
• “defined the best analysis as the one that minimizes information per word in the generated language of grammatical discourses” (§35.4: entropy)
• 23.771 Mathematical Backgrounds for Communication (Hall/Partee)
From LSLT to Aspects: A Psychological Turn
• LSLT
• “any simplification along these lines is immediately reflected in the length of the grammar” (§26: MDL)
• “defined the best analysis as the one that minimizes information per word in the generated language of grammatical discourses” (§35.4: entropy)
• 23.771 Mathematical Backgrounds for Communication (Hall/Partee)
• Aspects
• “... theories require supplementation by an evaluation measure if language acquisition is to be accounted for ... such a measure is not given a priori, in some manner. Rather, any proposal concerning such a measure is an empirical hypothesis about the nature of language” (p37)
Three Psychological Conditions
Three Psychological Conditions
• Formal sufficiency: Does the evaluation measure choose the “correct”grammar?
• Distributional methods: Harris (1951), Fowler (1952), Holt (1953)
Three Psychological Conditions
• Formal sufficiency: Does the evaluation measure choose the “correct”grammar?
• Distributional methods: Harris (1951), Fowler (1952), Holt (1953)
• Ecological validity: Does the evaluation measure operate under reasonable assumptions about the learning data and mechanisms?
Three Psychological Conditions
• Formal sufficiency: Does the evaluation measure choose the “correct”grammar?
• Distributional methods: Harris (1951), Fowler (1952), Holt (1953)
• Ecological validity: Does the evaluation measure operate under reasonable assumptions about the learning data and mechanisms?
• Developmental compatibility: Does the evaluation measure employed by the learner produce similar developmental patterns in language acquisition?
MDL: An Evaluation Measure
DL(D,G) = |G|+ log
1
p(D|G)
MDL: An Evaluation Measure
• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference
DL(D,G) = |G|+ log
1
p(D|G)
MDL: An Evaluation Measure
• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference
• A method for hypothesis selection rather than hypothesis proposing
• Three conditions for choosing alternative methods, e.g., reinforcement learning, Fourier transform
DL(D,G) = |G|+ log
1
p(D|G)
MDL: An Evaluation Measure
• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference
• A method for hypothesis selection rather than hypothesis proposing
• Three conditions for choosing alternative methods, e.g., reinforcement learning, Fourier transform
• The composition of data
DL(D,G) = |G|+ log
1
p(D|G)
MDL: An Evaluation Measure
• MDL is similar with (or equivalent to) many other approaches such as Bayesian inference
• A method for hypothesis selection rather than hypothesis proposing
• Three conditions for choosing alternative methods, e.g., reinforcement learning, Fourier transform
• The composition of data
• Subset Principle (Berwick 1985): the first Evaluation Measure to influence empirical work in language acquisition
DL(D,G) = |G|+ log
1
p(D|G)
Evaluation Measuring Parameters
Evaluation Measuring Parameters
• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)
Evaluation Measuring Parameters
• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)
• Parameters can be viewed as low dimensional description of syntactic variation, or MDL
Evaluation Measuring Parameters
• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)
• Parameters can be viewed as low dimensional description of syntactic variation, or MDL
• CUNY CoLAG Parameter Domain (Sakas & J.D.Fodor in press): 13 parameters, 3072 grammars, 48086 distinct degree-0 sentences
• Most parameters are favorable for the learner and can be set independently (thus “scattered” well)
Evaluation Measuring Parameters
• Aspects: “We want the hypotheses compatible with fixed data to be “scattered” in value, so that choice among them can be made relatively easily” (p61)
• Parameters can be viewed as low dimensional description of syntactic variation, or MDL
• CUNY CoLAG Parameter Domain (Sakas & J.D.Fodor in press): 13 parameters, 3072 grammars, 48086 distinct degree-0 sentences
• Most parameters are favorable for the learner and can be set independently (thus “scattered” well)
• Evidence for parameters in language acquisition
MDL in Action
MDL in Action
fly
bring
blow
think
catch
draw
Add -dwalk(Regulars)
flew
brought
blew
thought
caught
drew
walked(Regulars)
Stem Past tense
fly
bring
blow
think
catch
draw
Add -dwalk(Regulars)
flew
brought
blew
thought
caught
drew
walked(Regulars)
Stem Past tense
associations
MDL in Action
MDL in Action
MDL in Action
flybringblowthinkcatchdraw
add -t & Rime !/a/
add -ø & Rime !/u/
Add -dwalk(Regulars)
flewbroughtblewthoughtcaughtdrewwalked(Regulars)
Stem Past tense
MDL in Action
flybringblowthinkcatchdraw
add -t & Rime !/a/
add -ø & Rime !/u/
Add -dwalk(Regulars)
flewbroughtblewthoughtcaughtdrewwalked(Regulars)
Stem Past tense
“ ... the acceptance of these Laws (Grimm’s and Verner’s) as historicalfact is based wholly on considerations of simplicity”
Halle (1961: On the role of simplicity in linguistic descriptions)
Evidence from Children• Largest past tense analysis to date (Gorman & Yang 2011)
• Free-rider effect: verbs belonging to larger rules learned better
Evidence from Children• Largest past tense analysis to date (Gorman & Yang 2011)
• Free-rider effect: verbs belonging to larger rules learned better
Spearman ρ Kendall τ G-K γ☞Abstract rules 0.276 0.191 0.202
Surface rules 0.267 0.180 0.190Words only 0.128 0.133 0.140
Evidence from Children• Largest past tense analysis to date (Gorman & Yang 2011)
• Free-rider effect: verbs belonging to larger rules learned better
Spearman ρ Kendall τ G-K γ☞Abstract rules 0.276 0.191 0.202
Surface rules 0.267 0.180 0.190Words only 0.128 0.133 0.140
Results and discussion • The effects of verb and pattern frequency on CUR (aggregated across all subjects) can be compared with rank correlation coefficients
• Similarly, verbs with low verb frequency and high pattern frequency verbs have significantly higher CURs than verbs with high verb frequency/low pattern frequency (two-tailed Mann-Whitney W = 193, p = 0.035)
• Both these results indicate that pattern frequency trumps verb frequency
•The relative effects of verb and pattern frequency are finally analyzed by fitting a logistic mixed effects regression model with the maximal random effects structure; as verb and pattern frequency are closely correlated (R = 0.635), log verb frequency is first residualized [14] against log pattern frequency
• Increases in both pattern and verb frequency result in fewer overregularizations
• The results are consistent with linguistic models [3, 6, 7, 15] which derive irregular pasts by stem change and suffixation, but storage models have no locus for the free-rider effect
• Future work will use this data to evaluate similarity-based models [5, 15, 16]
English irregular past tense errors and the free-rider effect
[1] S. Kuczaj. 1977. JoVL&VB 16. [2] S. Prasada, S. Pinker. 1993. L&CP 8 [3] M. Halle, K.P. Mohanan. 1985. LI 16. [4] J. Bybee, D. Slobin. 1982. Lg 58. [5] J. McClelland, K. Patterson. 2002. TiCS 6. [6] L. Stockall, A. Marantz. 2006. Mental Lexicon 1. [7] C. Yang. 2002. Knowledge and learning in natural language. Oxford: OUP. [8] R. Brown. 1973. A first language: The early stages. Cambridge: HUP. [9] K. Demuth, J. Culbertson, J. Alter. 2006. L&S 49. [10] W. Hall, W. Nagy, R. Linn. 1984. Spoken words: Effects of situation and social group on oral word usage and frequency. Hillsdale, NJ: Erlbaum. [11] G. Marcus, S. Pinker, M. Ullman, M. Hollander, J. Rosen, F. Xu. 1992. Overregularization in language acquisition. Chicago: UCP. [12] R. Maslen, A. Theakston, E. Lieven, M. Tomasello. 2004. JoSL&HR 47. [13] E. Brill. 1995. Computational Linguistics 21. [14] K. Gorman. 2010. In PWPL 16.2. [15] A. Albright and B. Hayes. 2003. Cognition 90. [16] S. Chandler. 2010. Cognitive Lingustics 21.
Kyle Gormanad • David Fabera • Elika Bergelsonbd • Charles Yangabcd
aDept. of Linguistics • bDept. of Psychology • cDept. of Computer & Information Science • dInstitute for Research in Cognitive Science
The debate • SOME BACKGROUND ON THE SUBJECT
• Storage accounts [1, 2]
• Other accounts, both symbolic [3, 4] and connectionist [5]
•
The free-rider effect • (definition of overregularization w/ example)
• (say we’ll ignore *broughted and similar errors)
• Yang [7] identifies the free-rider effect: children are quick to master low-frequency irregulars (e.g, wake/woke, forget/forgot) if they follow the same pattern as high-frequency irregulars (break/broke, get/got)
• This requires some representation to be shared by wake and break, and thus is antithetical to the storage account
• Recent experiments on adults also provide support for the derivational account: irregular past tense verbs (e.g., woke) fully prime their base (e.g., wake) in lexical decision [6]
• This is the largest ever study of children’s overregularization errors
Methods • The raw data are CHILDES transcripts of the spontaneous speech of 50 children ages 2-5 who are acquiring monolingual English and have no known cognitive or hearing deficits [8-12]
• A part-of-speech tagger [13] is used to locate verbs in the transcripts; correct and incorrect past tense tokens of irregular verbs are automatically tabulated and then verified by hand; however, immediate repetitions of a verb by a child are counted only one time
• A separate script aggregates the speech of parents and computes verb input frequency
• This coding is validated by comparing the resulting per-verb correct usage rate (CUR) for each of the children studied by Marcus et al. [11] using the Spearman rank correlation coefficient; the results indicates a close match (Adam: ρ = 0.92, Eve: ρ = 0.99, Sarah: ρ = 0.97)
•The procedure yields 11,735 tokens of 97 irregular verb types, and 839 overregularization tokens across 75 verb types (CUR = 92.9%)
• Irregular verbs are grouped into 22 patterns according to the stem change and/or suffix found in the past tense (relative to the present)
• Unlike prior attempts to categorize the English irregular verbs [4, 7, 11], this larger set of patterns uniquely determines the observed past tense from the present form
Pattern Example N Pattern Example N No change burst/burst 13 [ɩ] slide/slid 3
[oʊ] drive/drove 10 [ɔ] wear/wore 3
[ʌ] dig/dug 9 [oʊ...-d] sell/sold 2
[æ] sit/sat 8 [aʊ] find/found 2
[u:] blow/blew 7 [ɛ...-d] say/said 1
[ɛ] breed/bred 7 [ɔ...-t] lose/lost 1
[ɛ...-t] keep/kept 7 [ɛɚ...-d] hear/heard 1
[-t] build/built 5 was be/was 1
[eɩ] eat/ate 5 went go/went 1
rime→[ɔt] buy/bought 5 make make/made 1
[ɑ] shoot/shot 3 stood stand/stood 1
Spearman ρ Kendall τb G-K γ Pattern frequency 0.267 0.180 0.190
Verb frequency 0.128 0.133 0.144
Estimate Std. err. p-value Log pattern frequency 0.488 0.160 0.002
Resid. log verb frequency 0.413 0.081 4.2e-7
< avg. rule freq. > avg. verb freq. > avg. rule freq. < avg. verb freq.
two-tailed Mann-Whitney W=156.5, p=0.019
Yesterday he ____
Yesterday he ____
Yesterday he ____
• The forgotten Wug test (Berko 1958)
• Only one out of 86 children produced bing-bang, gling-glang
Yesterday he ____
• The forgotten Wug test (Berko 1958)
• Only one out of 86 children produced bing-bang, gling-glang
• Children over-regularize: 8-10%
Yesterday he ____
• The forgotten Wug test (Berko 1958)
• Only one out of 86 children produced bing-bang, gling-glang
• Children over-regularize: 8-10%
• Children over-irregularize: <0.2%
Yesterday he ____
• The forgotten Wug test (Berko 1958)
• Only one out of 86 children produced bing-bang, gling-glang
• Children over-regularize: 8-10%
• Children over-irregularize: <0.2%
• Children’s Evaluation Measure produces a binary outcome: productive or lexical
• probability spreading insufficient
Exceptions in Evaluation Measure
Exceptions in Evaluation Measure
“core”, “basic word order”, “default case”, “unmarked form”vs.
“periphery”, “lexical listing”, “exceptional marking”, “diacritics”
Exceptions in Evaluation Measure
• SPE: “Clearly, we must design our linguistic theory in such a way that the existence of exceptions does not prevent the systematic formulation of those regularities that remain ... Finally, an overriding consideration is that the evaluation measure must be designed in such a way that the wider and more varied the class of exceptions to a rule, the less highly valued is the grammar” (p172)
“core”, “basic word order”, “default case”, “unmarked form”vs.
“periphery”, “lexical listing”, “exceptional marking”, “diacritics”
Exceptions in Evaluation Measure
• SPE: “Clearly, we must design our linguistic theory in such a way that the existence of exceptions does not prevent the systematic formulation of those regularities that remain ... Finally, an overriding consideration is that the evaluation measure must be designed in such a way that the wider and more varied the class of exceptions to a rule, the less highly valued is the grammar” (p172)
• But majority doesn’t rule: 90% of English words in speech are stress initial (Cutler & Carter 1987); Legate & Yang poster
“core”, “basic word order”, “default case”, “unmarked form”vs.
“periphery”, “lexical listing”, “exceptional marking”, “diacritics”
Measuring Rules
Measuring Rules
Measuring Rules
• Optimization again: instead of space, it’s time
Measuring Rules
• Optimization again: instead of space, it’s time
• Exceptions = Real time processing slowdown
• “kicked the bucket” faster than “lifted the bucket” by 51ms (Swinney & Cutler 1979)
• Production: German irregular past participle (-n) faster than regular (-t) by 38ms (Clahsen & Fleischhauer 2011)
• Lexical decision: English irregular verbs faster than regulars by 19ms (English Lexicon Project; Lignos 2011)
Measuring Rules
• Optimization again: instead of space, it’s time
• Exceptions = Real time processing slowdown
• “kicked the bucket” faster than “lifted the bucket” by 51ms (Swinney & Cutler 1979)
• Production: German irregular past participle (-n) faster than regular (-t) by 38ms (Clahsen & Fleischhauer 2011)
• Lexical decision: English irregular verbs faster than regulars by 19ms (English Lexicon Project; Lignos 2011)
• Exceptions delay rule computation
Measuring Rules
• Optimization again: instead of space, it’s time
• Exceptions = Real time processing slowdown
• “kicked the bucket” faster than “lifted the bucket” by 51ms (Swinney & Cutler 1979)
• Production: German irregular past participle (-n) faster than regular (-t) by 38ms (Clahsen & Fleischhauer 2011)
• Lexical decision: English irregular verbs faster than regulars by 19ms (English Lexicon Project; Lignos 2011)
• Exceptions delay rule computation
•Exception 1•Exception 2 •Exception 3•...•Rule
Tolerance Principle
Tolerance Principle
N
lnN
To be productive, the maximum exceptions to a rule/process applicable to N items is
Tolerance Principle
N
lnN
To be productive, the maximum exceptions to a rule/process applicable to N items is
• If English has 150 irregular verbs, we need 900 regulars to have a productive -ed rule: 1050/ln(1050) = 150
Tolerance Principle
N
lnN
To be productive, the maximum exceptions to a rule/process applicable to N items is
• If English has 150 irregular verbs, we need 900 regulars to have a productive -ed rule: 1050/ln(1050) = 150
• Children start over-regularization when they reach the tipping point
Tolerance Principle
N
lnN
To be productive, the maximum exceptions to a rule/process applicable to N items is
• If English has 150 irregular verbs, we need 900 regulars to have a productive -ed rule: 1050/ln(1050) = 150
• Children start over-regularization when they reach the tipping point
• N (e.g., vocabulary size) and the number of exceptions may vary from speaker to speaker, accounting for certain individual patterns in language acquisition and sociolinguistic variation
A Birth-er Problem
A Birth-er Problem
• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)
A Birth-er Problem
• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)
• While some -er’s are real (hunt-hunter), some are not (corn-corner, cent-center, sock-soccer): children need to learn -er despite exceptions
A Birth-er Problem
• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)
• While some -er’s are real (hunt-hunter), some are not (corn-corner, cent-center, sock-soccer): children need to learn -er despite exceptions
• English Lexicon Project (Balota et al. 2007)
• hunt-hunter type: 94, cent-center type: 18
• The suffix -er is productive: 18 < 112/ln(112)=24
A Birth-er Problem
• The suffix -er is productive and segmented in real time even for broth-er (Rastle, Davis & New 2004) resulting in slowdown (Lignos 2011)
• While some -er’s are real (hunt-hunter), some are not (corn-corner, cent-center, sock-soccer): children need to learn -er despite exceptions
• English Lexicon Project (Balota et al. 2007)
• hunt-hunter type: 94, cent-center type: 18
• The suffix -er is productive: 18 < 112/ln(112)=24
• The suffix -th fails to reach productivity: warmth, width, depth etc. overwhelmed by tooth, booth, filth, forth, ...
stride-strode-??? N
lnN
stride-strode-???
• Mere majority is not sufficient; filibuster proof majority required
N
lnN
stride-strode-???
• Mere majority is not sufficient; filibuster proof majority required
• [-lexical insertion] (Halle 1972, esp. fn1)
• gaps only arise in unproductive corners of morphology
N
lnN
stride-strode-???
• Mere majority is not sufficient; filibuster proof majority required
• [-lexical insertion] (Halle 1972, esp. fn1)
• gaps only arise in unproductive corners of morphology
• 102 out of 161 irregular verbs (36%) show preterite and past participle syncretism
• Tolerance Principle only allows 1/ln(161)=20% exceptions
• *forwent, *sightsaw, *stridden
N
lnN
Evaluation Metrics
Evaluation Metrics
• What not to do: Computer chess
Evaluation Metrics
• What not to do: Computer chess
• Resource bounded optimization
Evaluation Metrics
• What not to do: Computer chess
• Resource bounded optimization
• Convergence of methods and disciplines
Evaluation Metrics
• What not to do: Computer chess
• Resource bounded optimization
• Convergence of methods and disciplines
• Simple theories are usually right ones
Thank you, to my teachers
• Bob Berwick
• Noam Chomsky
• Morris Halle