11
Proceedings of the Society for Computation in Linguistics Volume 1 Article 18 2018 e Organization of Lexicons: a Cross-Linguistic Analysis of Monosyllabic Words Shiying Yang Brown University, [email protected] Chelsea Sanker Brown University, [email protected] Uriel Cohen Priva Brown University, [email protected] Follow this and additional works at: hps://scholarworks.umass.edu/scil Part of the Computational Linguistics Commons , and the Phonetics and Phonology Commons is Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of the Society for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Recommended Citation Yang, Shiying; Sanker, Chelsea; and Cohen Priva, Uriel (2018) "e Organization of Lexicons: a Cross-Linguistic Analysis of Monosyllabic Words," Proceedings of the Society for Computation in Linguistics: Vol. 1 , Article 18. DOI: hps://doi.org/10.7275/R58P5XPZ Available at: hps://scholarworks.umass.edu/scil/vol1/iss1/18

The Organization of Lexicons: a Cross-Linguistic Analysis

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Organization of Lexicons: a Cross-Linguistic Analysis

Proceedings of the Society for Computation in Linguistics

Volume 1 Article 18

2018

The Organization of Lexicons: a Cross-LinguisticAnalysis of Monosyllabic WordsShiying YangBrown University, [email protected]

Chelsea SankerBrown University, [email protected]

Uriel Cohen PrivaBrown University, [email protected]

Follow this and additional works at: https://scholarworks.umass.edu/scil

Part of the Computational Linguistics Commons, and the Phonetics and Phonology Commons

This Paper is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Proceedings of theSociety for Computation in Linguistics by an authorized editor of ScholarWorks@UMass Amherst. For more information, please [email protected].

Recommended CitationYang, Shiying; Sanker, Chelsea; and Cohen Priva, Uriel (2018) "The Organization of Lexicons: a Cross-Linguistic Analysis ofMonosyllabic Words," Proceedings of the Society for Computation in Linguistics: Vol. 1 , Article 18.DOI: https://doi.org/10.7275/R58P5XPZAvailable at: https://scholarworks.umass.edu/scil/vol1/iss1/18

Page 2: The Organization of Lexicons: a Cross-Linguistic Analysis

The organization of lexicons: A cross-linguistic analysis ofmonosyllabic words

Shiying Yang, Chelsea Sanker, and Uriel Cohen Privashiying_yang,chelsea_sanker,[email protected]

Abstract

Lexicons utilize a fraction of licit struc-tures. Different theories predict either thatlexicons prioritize contrastiveness or struc-tural economy. Study 1 finds that themonosyllabic lexicon of Mandarin is nomore distinctive than a randomly sampledbaseline using the phonological inventory.Study 2 finds that the lexicons of Mandarinand American English have fewer phono-tactically complex words than the randombaseline: Words tend not to have mul-tiple low-probability components. Thissuggests that phonological constraints canhave superadditive penalties for combinedviolations, consistent with e.g. Albright(ms.).

1 Introduction

Lexicons can be considered mappings betweenword meanings and phonotactically-valid se-quences of phonemes. There are several dimen-sions of forces shaping lexicons, based on thefrequency of each item and its phonetic distinc-tiveness from similar items, as well as the phono-tactic probability of the phonological sequenceswithin each item. For instance, underlying pres-sures on the lexicon influence the frequency dis-tribution of items within a lexicon (Zipf, 1929;Piantadosi et al., 2009); Zipf’s law predicts thatfrequent words should be preferentially mappedto shorter segmental sequences.

In the absence of other pressures, syllablesand words should be maximally distinct fromone another, in order to minimize ambiguity

and potential for confusion. This pressure hasbeen demonstrated within phonological inven-tories; vowel systems tend to maximize the dis-tance between vowels (Flemming, 2004), thoughother work has found a tendency for economy,in which each feature tends to be used for mul-tiple contrasts, particularly among consonants(Clements, 2003; Dautriche et al., 2017). Wedelet al. (2013) demonstrate that contrastivenessis important in shaping lexicons; phonologicalmergers are less likely when more lexical con-trasts depend on the phonological contrast. Thepressure for contrastiveness has been demon-strated in various experiments, in which wordswith higher neighborhood density are identifiedmore slowly than words with lower neighbor-hood density (Luce and Pisoni, 1998). If lex-icons are not maximizing how distinct lexicalitems are, there must be other pressures out-weighing contrastiveness.

Using two computational studies, we exam-ine some of the factors influencing the shapesof items within lexicons, by comparing actuallexicons to generated lexicons given the samephonotactic restrictions.

1.1 Competing pressures in a lexicon

Zipf (1929) proposed the principle of least effortas a primary force shaping phonological inven-tories, claiming that the frequencies of soundswithin a language are negatively correlated withtheir articulatory and perceptual complexity,given a set number of contrasts. Thus, the prob-ability of a sound would reflect its overall per-

164Proceedings of the Society for Computation in Linguistics (SCiL) 2018, pages 164-173.

Salt Lake City, Utah, January 4-7, 2018

Page 3: The Organization of Lexicons: a Cross-Linguistic Analysis

ceptual and articulatory cost. Consistent withthis proposal is the strong correlation betweenthe cross-linguistic frequency of phonemes (i.e.what percentage of languages in UPSID havethem) and their frequency within particular lan-guages (Sanker, 2016).

In line with this functional view, emphasizingthe communicative goal of language, Flemming’s(2004) Dispersion Theory of contrast translatedthe trade-off between speaker and listener intothree conflicting goals: “maximizing the dis-tinctiveness of contrasts,”“minimizing articula-tory effort” and“maximizing the number of con-trasts.” He proposed that a phonological inven-tory would strike a balance between these goals,providing the most distinctive vowel system pos-sible with a given number of contrasts, with ar-ticulatory effort only as motivated by achiev-ing distinctiveness. This principle should alsoextend to lexicons: All else being equal, lexi-cons should be maximally distinct. This is addi-tionally supported by perceptual evidence thatdense lexical neighborhoods slow down process-ing (Luce and Pisoni, 1998).

However, lexicons seem to be less dispersedthan would be expected from the pressureof maximizing contrastiveness. Dautriche etal. (2017) looked at the lexicons of fourIndo-European languages and found that theywere more regular (“clumpy”) than expectedby chance. Words were more similar toeach other in these languages than in gen-erated phonotactically-controlled baseline lexi-cons. This result parallels some work in thesegmental domain, which shows that languagestend to reuse phonological features (Clements,2003). However, one potential limitation of thisstudy is that the phonotactic restrictions weretightly controlled, with environments extendingout to four segments, which could have con-strained the generated lexicons beyond just cap-turing the intended phonological constraints; inlong words, it can be unclear what segmentalrange best captures the inherent phonotacticpatterns.

In order to expand the data into an unrelatedlanguage and in particular address whether thelexicon would pattern differently in a language

with shorter words and a denser lexicon, we com-pared generated phonotactically-controlled lexi-cons to the real lexicon of Mandarin Chinese inStudy 1.

1.2 An explanation of phonologicallyclustered lexicons

In contrast to the dispersion account whichbases the drive for distinctiveness on commu-nicative efficiency, Dautriche et al. (2017) at-tributed their findings to a pressure for regular-ity that is driven by the goal of lowering cog-nitive costs in language acquisition and lexicalaccess. A different possibility is that our un-derstanding of the forces driving a language’sphonotactics are flawed.

Within phonological theories that addressgradient phenomena, models are generally mul-tiplicative. For instance, in MaxEnt, as pre-sented by Hayes and Wilson (2008), the proba-bility assigned to a phonotactic form is e raisedto the negative sum of the weighted constraintviolations. Calculated differently, this is theproduct of the probability of each individual vi-olation occurring.

Thus, MaxEnt treats constraints as being in-dependent (Hayes and Wilson, 2008). How-ever, multiple languages have constraint combi-nations which are more limited in combinationthan would be predicted from their independentprobabilities (Albright, ms; Green and Davis,2014; Shih, 2016). For example, English /æ/and coda /z/ are attested with somewhat lowfrequency, but their combination is extremelyuncommon, far below the product of their in-dependent probabilities (Kessler and Treiman,1997). Such patterns have been explained as“superadditivity”(Albright, ms) or “supercumu-lativity”(Shih, 2016), a phenomenon in whichcombinations of marked structures incur addi-tional penalties, though their co-occurrence isnot categorically disallowed.

The superadditivity effect might underliesome of the patterns of lexicons, as it wouldproduce a faster drop-off in the occurrence oflow probability forms, resulting in more clus-tering around higher probability forms than ispredicted by models in which all phonotactic

165

Page 4: The Organization of Lexicons: a Cross-Linguistic Analysis

constraints are independent. Study 2 was de-vised to test the null hypothesis of a multiplica-tive grammar, in which the probability of a cer-tain form appearing as a word is the productof the probabilities of each of its components,against a counter-hypothesis of a grammar in-cluding additional penalties for combinations oflow-probability sequences; see section 3.2.

1.3 The null hypotheses: A lexiconselected by chance

Similar to the resampling procedures used byDautriche et al. (2017), sample lexicons weregenerated to estimate statistics of a baselinepopulation distribution as predicted from thephonotactic constraints and lexicon size of Man-darin, to be tested against measurements of thereal lexicon. A lexicon can be thought of as aset of word forms drawn from a pool of all formsthat are licit within the phonotactic constraintsof a language. To draw a lexicon with k con-trasting items from a constrained pool of n licitshapes, there are

(nk

)possibilities for lexicons;

generated lexicons are drawn from this pool ofpossibilities.

If the lexicon is not under any pressure tomaximize either distinctiveness or regularity, thesampling procedure from the pool of candidateword-forms will be random; Study 1 tests thepredictions made by random sampling.

If independent phonotactic constraints aresufficient to capture well-formedness and thuspredict frequency distributions in lexicons, prob-abilities of phonemic shapes will follow fromprobabilities of their subparts (Albright, ms).Study 2 tests the predictions made by indepen-dent evaluation of constraints; if constraints areindependent, generated lexicons that are ran-domly sampled from the pool of forms based onprobabilities produced by the phonotactic con-straints of a language without any constraint in-teraction should have distributions similar to thereal lexicon.

Both Study 1 and Study 2 are based on con-structing phonotactically-constrained pools ofwords from which generated lexicons are sam-pled. The word-pools and the artificial lexiconsare generated according to the parameters laid

out in the following sections, to create baselinesfor evaluating what factors are influencing thereal lexicons. We aim to show that real lexi-cons cannot be explained by randomly samplingfrom a constrained phonological space and thatthe constraints on the phonological space call fora model that includes superadditivity.

2 Study 1: Evaluating thedistinctiveness of Mandarinmonosyllabic lexicon

2.1 Background

Study 1 investigated monosyllabic words inMandarin Chinese. Mandarin has a densephonological space and limited licit syllablestructures, which make it possible to enumerateall phonologically permissible forms with rela-tively few assumptions.

Mandarin syllables are limited to a structurewith at most four phonemes: CGVX (Li andThompson, 1987). C stands for a consonantin the onset position; G stands for a glide; Vstands for a vowel; and X can either be a nasal/n/ or /N/, or the off-glide of a diphthong. Ev-ery syllable must have a vowel, but all otherpositions can be empty (Duanmu, 2009). Inaddition, each syllable has one of four phono-logical tones. Given only these structural con-straints, the phonological inventory would allow7,600 possible syllables (Duanmu, 2009). Mostwords in Mandarin are monosyllabic or disyl-labic, so limitations in licit syllables result in arather small number of possible words.

If there is a pressure towards contrastivenesswithin the lexicon, it should be particularly ap-parent in a language with such a small numberof phonotactically licit forms. Thus, our predic-tion was that the real Mandarin lexicon wouldbe more dispersed than the randomly sampledgenerated lexicons.

2.2 Methods

For Study 1, we used the LDC Mandarin Lexi-con and the corresponding frequency data fromthe LDC Mandarin Callhome training tran-scripts (Huang et al., 1997). Words which in-clude the 5th tone (‘neutral tone’) or lack a nu-

166

Page 5: The Organization of Lexicons: a Cross-Linguistic Analysis

clear vowel were excluded from analysis, to avoidclitics (Chao, 1968), which would be outside thescope of this analysis.

The crucial aspect of lexical contrast is phono-logical form, so we based perceptual distinctive-ness on phonemic representations rather thanphonetic measurements, using features to calcu-late distance between consonants and distancebetween formants to calculate distance betweenvowels. Based on misperception studies, percep-tual distance between the presence and absenceof a segment is highly sensitive to the segmentand its environment (Tang, 2015; Sanker, 2016);such differences do not clearly fit into the samesystem as contrasts between phonemes, so wordswith different syllable structures were consideredseparately. Focusing just on the monosyllabiclexicon of Mandarin, we looked at CV (open syl-lable) and CVX (closed syllable) structures.

2.2.1 Defining the licit structures

In order to sample generated lexicons of Man-darin from the hypothesized phonological spacedescribed in 1.3, a list of well-formed syllableswas generated for CV and CVX forms, to rep-resent candidate word-forms. In order to gener-ate such lists, all combinations of CV and CVXstructures were laid out, based on the phonologi-cal segment inventory of Mandarin; then the licitword-forms of the two structures were filteredthrough phonotactic models, using n-grams forphonological sequences (Jurafsky and Martin,2008).

For CV words, well-formedness was deter-mined using a phonological bi-gram (bi-phone)model, in which the probability of a word wasdefined as the product of the individual proba-bilities for all segments given the phoneme im-mediately preceding each; the probability of thetone was conditioned on the vowel. CVX wordswere evaluated similarly, but with a tri-phonemodel instead of a bi-phone model due to the ex-tra degree of freedom induced by the coda. Theprobability of a word was defined as the prod-uct of the individual probabilities for all seg-ments given the two phonemes preceding each,and tone was still conditioned on the vowel. Be-cause only monosyllabic words were considered,

there is no possibility of long-distance dependen-cies. Segment probabilities were based on all at-tested syllables in the LDC lexicon. Under thismodel, words with probabilities higher than 0were considered well-formed. Beyond that, theprobabilities produced by this model were notused for Study 1.

The resulting lists of forms contain 304 CVsyllables (out of 360 structurally possible com-binations) and 544 CVX syllables (out of 1440structurally possible combinations), which rep-resent the number of phonotactically licit syl-lables of these shapes. Of these, there are 187monosyllabic words with CV structure attestedin the LDC Mandarin Lexicon and 327 wordswith CVX structure. The two filtered lists ofwords serve as phonologically licit pools of wordsfor the sampling procedure described in 2.2.3.

2.2.2 Defining the distinctiveness oflexicons

In evaluating dispersion within lexicons, thedistinctiveness between any two segments σk

and σv is denoted as d(σk, σv). Comparisonswere conducted with corresponding segmentsfrom the syllables being compared, e.g. com-paring onsets to onsets.

In order to reflect the perceptual differencesbetween segments, the metrics for distinctive-ness differed for consonants and for vowels. Forconsonants, the distinctiveness between eachpair of sounds was determined by the numberof featural differences, which has been shownto correlate with perceptual measures of dis-tinctiveness (Bailey and Hahn, 2005; White andMorgan, 2008). For example, d(/ph/, /th/) = 1,because the two phonemes differ only in place of

articulation; d(/f/, />úùh/) = 4, because the two

phonemes differ in place, continuance, delayedrelease and aspiration.

For vowels, the distinctiveness was based onthe Manhattan distance between each vowelpair in the three-dimensional vowel space de-fined in Flemming (2004), where F1, F2 andF3 values are mapped onto a set of integersin each dimension, given the number of cross-linguistically possible contrasts making use of

167

Page 6: The Organization of Lexicons: a Cross-Linguistic Analysis

each dimension.1 This choice of metric, ratherthan a feature-based metric, was due to per-ception studies suggesting that acoustic differ-ences provide a better model for vowel percep-tion than a feature model does (Ettlinger andJohnson, 2009). In order to have equal weight-ing of contrasts between consonants and con-trasts between vowels, measurements of voweldistinctiveness were scaled down by 1/3.2

Each tone was treated like a distinct segment,but with a binary measure of distinctiveness: 1(different) or 0 (the same). This decision wasbased on the paucity of available data on tonemisperception patterns among native speakersof Mandarin and based on the variation in whatdistinctiveness patterns are suggested by resultsfrom different tasks (Huang and Johnson, 2010).

The distinctiveness of a word from each otherword was measured with the log-transformedsum of each segment’s distinctiveness from thecorresponding segment in the other word. Forexample, for a word of CVX structure Sr, a seg-ment σk in position Φn of the syllable is denotedby σr,Φn . The distinctiveness between 2 syllablesSr and St, as denoted by d(Sr, St), is the log-transformed sum of d(σr,Φpos , σt,Φpos) for all posi-tions of the syllable POS (2.1). In addition, a 1was added to the sum before log-transformationso that minimal pairs would have a distinctive-ness score larger than 0.

(2.1) The distinctiveness between word Sr andSt

d(Sr, St) = log[∑

pos ∈ POS

d(σr,Φpos , σt,Φpos) + 1]

An average distinctiveness of all pairs of wordsin the given phonological system M’ was cal-culated for each generated lexicon (2.2). Thehigher this number is, the more distinctions inthe possible phonological space the lexicon hasused.

1For example, /u/ is represented in the vowel spaceas [F1: 1, F2: 1, F3: 1] and /i/ is represented as [F1: 1,F2: 6, F3: 3], so /u/ and /i/ differ by 7 units in total.

2This scale was based on aligning the featurally-defined distinctiveness of the three Mandarin glides (/w/,/j/, and /4/) with the formant-based distinctiveness ofthe three corresponding vowels (/u/, /i/ and /y/).

(2.2) The average distinctiveness (by word pairs)of a size N lexicon from a given system M’

DM ′ =

∑w ∈ WM′

∑w ∈ WM′

d(Sw, Sw)/2

P (N, 2)/2

2.2.3 Generating baseline inventories

Baseline inventories for each syllable struc-ture were generated in three steps. First, thesummed frequency of CV (or CVX) words in thereal monosyllabic lexicon was used to generatea random set of words within the phonologicallylicit space defined in 2.2.1. The generated lexi-cons were then optimized to minimize differencesfrom the real Mandarin lexicon in lexicon size,word frequency distribution, and individual seg-ment frequencies. By minimizing the differencesin these parameters, we ensured that the gener-ated baseline lexicons would be comparable tothe Mandarin lexicon. Finally, the generatedlexicons were filtered to further ensure a closematch with these parameters, limiting the gen-erated lexicons to those with a size within 5%of the original lexicon size and a correlation ofat least 0.95 between their segment frequenciesand the segment frequencies of Mandarin, andbetween their word frequency distribution andthat of Mandarin.

These parameters served to hold articulatoryeffort constant, with variation only in distinc-tiveness, based on the assumption that the over-all effort of a language is the mean of the effortneeded for all words of the language and the ef-fort associated with each word is the sum of theeffort associated with all of its segments.

2.3 Results

Consistent with the central limit theorem andthe independent sampling process, the distri-bution of distinctiveness scores of the gener-ated lexicons of both CV and CVX struc-tures conform to normality, as confirmed by theKolmogorov-Smirnov test (CV: p = 0.963, CVX:p = 0.881).

168

Page 7: The Organization of Lexicons: a Cross-Linguistic Analysis

Figure 1: Standardized D·(distinctiveness) of 389 gen-

erated CV monosyllabic lexicons.

The shaded area in Figure 1 demonstratesthe distribution of standardized distinctivenessscores of generated lexicons of monosyllabic CVwords. Standardized distinctiveness of the realCV monosyllabic lexicon of Mandarin (indicatedby the heavy dashed line) was greater thanroughly 67.6% of generated counterparts (indi-cated by the light grey area). While the reallexicon is above the mean, this result is not con-clusive evidence that the real lexicon differs fromlexicons drawn randomly from the phonologicalspace, given that the real lexicon is not an out-lier or at all close to the top or bottom 2.5% ofthe distribution.

Figure 2: Standardized D·(distinctiveness) of 627 gen-

erated CVX monosyllabic lexicons.

As indicated by Figure 2, the distinctivenessof the real monosyllabic Mandarin CVX lexicon

is better than only 17.7% of baselines. However,the real lexicon is not enough far enough towardsthe edge of the distribution to demonstrate thatit differs from lexicons drawn randomly fromthe phonological space, because while it is lowerthan the mean, it is not an outlier.

2.4 Discussion

Mandarin words of both CV structure and CVXstructure displayed similarly inconclusive pat-terns. Compared to randomly generated lexi-cons following the same parameters of phonol-ogy, word frequency, and size, the real lexiconwas not an outlier in distinctiveness either inCV or CVX syllables, though the real CV lexi-con was slightly better than average among thegenerated lexicons and the real CVX lexicon wasworse. Because the CV phonological space issmaller and more saturated, it is not surpris-ing that CV portion of Mandarin monosyllabiclexicon would be relatively more efficient thanits CVX counterpart. However, in general, theresults did not support the hypothesis that dis-persion plays a large role in shaping lexicons,and are more consistent with the opposite pat-tern of clumping, as seen in Dautriche et al.’s(2017) results.

The inconclusive results might in part be dueto issues with the metrics used for distinctive-ness, as perceptual data suggests that differentpositions in a syllable are not equally salient. Atleast within English, listeners are most sensitiveto mispronunciations in onsets, less so in codas,and least sensitive in nuclei (Franklin and Mor-gan, 2017), and are more accurate in perceivingonsets than codas, though this can vary depend-ing on listeners’ native language, even for thesame stimuli (Sanker, 2016). Given such find-ings, different syllable positions might best begiven different weights in distinctiveness whengenerating sample lexicons. Further researchinto Mandarin speakers’ patterns of mispercep-tions at the word level and the segment levelwould further help in accurately quantifying dis-tinctiveness.

169

Page 8: The Organization of Lexicons: a Cross-Linguistic Analysis

3 Study 2: Evaluating thewell-formedness of generatedMandarin and Englishmonosyllabic lexicons

3.1 Background

Study 1 shows that a lexicon might not be asdispersed as the functional goal of communica-tive clarity would predict. In Study 2, we ex-amine whether this lack of dispersion can bepartially explained by gradient well-formednessconstraints shaping the lexicon, disproportion-ately favoring words with high-probability se-quences.

3.2 Methods

In Study 2, English and Mandarin were used aslanguages for a preliminary cross-linguistic in-vestigation. The same LDC Mandarin Lexiconfrom Study 1 was used for Mandarin and theCMU Dictionary (Weide, 2008) was used for thephonemic representations for American English.CMU Dictionary entries were spell-checked withGNU Aspell to exclude rare names and borrow-ings from other languages. Function words andwords with the rarest 1% of onsets and codaswere also excluded, due to the uniqueness oftheir phonological structure, as many functionwords are clitics and can be reduced more thanother words, and words with highly unusual se-quences are likely to have unique etymologiesthat do not reflect the overall pressures of thelanguage.

Only monosyllabic words were used. As dis-cussed in 2.2.1, this limitation meant there wereno long-distance dependencies that needed to beaccounted for. Phonotactics were represented bya tri-phone model of sound sequences (as intro-duced in 2.2.1), with the predictability of eachsound based on the two preceding phonemes.Study 2 focused on the word probabilities as-signed to forms within the generated lexicons.Frequencies, as captured by n-gram models inthis study, were used to approximate distribu-tional markedness (Albright, ms). The distinc-tion between frequency and markedness is be-yond the scope of this paper.

Sampling followed a sampling procedure sim-

ilar to that of Study 1. First, all combina-tions of segments in all possible syllable posi-tions were laid out, producing lists of potentialwords. Then the real monosyllabic lexicons ofMandarin and English were used to train thetri-phone phonotactic models for each language,assigning log probabilities to all forms in theword lists based on the sum of log probabilitiesof each word’s components. The wordlists werethen filtered, only retaining forms with probabil-ities larger than 0, meaning that they were well-formed within the tri-phone model. Finally, inorder to generate the artificial lexicons for En-glish, words were randomly taken from the fil-tered English list, with the number of words ofdifferent lengths kept consistent with the realEnglish lexicon. The same was done for Man-darin. Thus, the generated baseline lexicons hadthe same distribution of word lengths and thesame size as the real lexicons.

Distributions of log probabilities of the base-line lexicons were compared to the real lexicons,to test whether the probability distributions ofreal lexicons differ from randomly generated lex-icons based on phonotactic models which assumeindependence of subparts more than one seg-ment apart. Logarithmic scales for probability,with probabilities of subparts combined multi-plicatively, have been found previously to havea strong positive correlation with gradient well-formedness ratings and decisions about accept-ability of nonce words (Frisch et al., 2000; Cole-man and Pierrehumbert, 1997), though thesestudies did not look for patterns in where thedata deviated from the model.

3.3 Results

Both in English and Mandarin, the real lex-icons exhibited over-representation of high-probability forms and under-representation oflow-probability forms.

The independent sampling process was essen-tially producing replications which could be usedto bootstrap variance estimation for the esti-mators of interest, so standard errors and con-fidence intervals of estimators other than themean were constructed with the bootstrap dis-tributions calculated using the generated sample

170

Page 9: The Organization of Lexicons: a Cross-Linguistic Analysis

lexicons.

Figure 3: Probability density distributions of the origi-

nal and 100 generated Mandarin monosyllabic lexicons

Figure 3 illustrates the probability density dis-tributions of the original and generated sampleMandarin monosyllabic lexicons over word prob-abilities as defined in the phonotactic model.The figure demonstrates that the real mono-syllabic Mandarin lexicon (indicated by thedashed line) is more clustered around the higher-probability types than the sample lexicons (in-dicated by the light solid lines).

Three statistics were used to test whether thedistribution of log probabilities in the real lexi-con is likely to come from the population distri-bution. The sample means were tested againstthe mean of the original lexicon in a two-tailedtest (t = −219.463, p ≈ 0.000). This showsthat the mean of word probabilities in the actualMandarin monosyllabic lexicon was significantlyhigher than the true mean of the population dis-tribution generated from the null hypothesis, asmeasured from the generated sample lexicons.

In addition to the mean, the bootstrap per-centile confidence intervals (CI) of variance andskewness of the hypothesized Mandarin popu-lation distribution were approximated. The re-sults show that the variance of the probabilitydistribution of the real lexicon is significantlylower than the variance of the generated popula-

tion (σ2original = 0.26, 95% CI: (0.93, 1.18)). The

probability distribution of the original lexiconand the estimated population distribution areboth left-skewed (negative skewness), but theabsolute value of the skewness of the original lex-icon is significantly smaller (skoriginal = −1.38,95% CI: (−2.01, −1.70)).

Figure 4: Probability density distributions of the origi-

nal and 100 generated English monosyllabic lexicons

The English data (illustrated in Figure 4)were analyzed with the same statistical meth-ods as the Mandarin data. As in the Mandarinresults, the mean of the probability distribu-tion of the real English monosyllabic lexicon issignificantly higher than that of the populationdistribution generated from the null hypothesis(t = −561.967, p ≈ 0.000).

The comparison in variance and skewness be-tween the original English lexicon and the hy-pothesized English population distribution alsoexhibit results similar to the Mandarin data.The variance of the real lexicon is significantlylower than the variance of the generated popu-lation (σ2

original = 1.36, 95% CI: (3.05, 3.25)),and the absolute value of the skewness of thereal lexicon is significantly smaller than thatof the skewness of the hypothesized population(skoriginal = −0.16, 95% CI: (−0.52, −0.39)).

171

Page 10: The Organization of Lexicons: a Cross-Linguistic Analysis

3.4 Discussion

If the lexicon is shaped only by local phonolog-ical constraints, as controlled for in the phono-tactic models used in the current study, samplelexicons generated by the models should followroughly the same distribution as the real Man-darin and English monosyllabic lexicons. Theresults of this study provide strong evidenceagainst this null hypothesis.

Within both English and Mandarin, the realmonosyllabic lexicons have higher means andsmaller variance than the generated baselines,which indicates that the real lexicons make useof more high-probability word types than wouldbe expected by the phonotactic models used.Additionally, the real lexicons were less skewedthan the baselines, meaning that the probabilitydistribution of the two real lexicons have thinneror shorter tails than their corresponding baselinelexicons, which also suggests that real lexiconstend towards higher probability words, with afast drop-off in the frequency of lower probabil-ity words. These results seem to suggest thatthere is a strong superadditivity effect that pe-nalizes words with multiple low-probability sub-parts, potentially in combination with a ten-dency to re-use high-probability sequences, assuggested by Dautriche et al. (2017).

4 General Discussion

Study 1 found that in Mandarin Chinese, disper-sion is not a prominent force in the shaping ofthe lexicon; evidence for a pressure towards clus-tering was somewhat more suggestive, thoughnot decisive. Study 2 showed that the lexi-cons of Mandarin and English have more wordsof higher probability and fewer words of lowerprobability than would be expected by a phono-logical model in which constraints are indepen-dent. This result reinforces the results in Study1, indicating a lack of dispersion and instead atrend towards clumping within high probabilityforms.

These results can fit into Albright’s (ms.)proposed grammar of weighted constraints, inwhich he suggests that inputs with multiplemarkedness violations have a superadditive ef-

fect that can overcome a threshold of well-formedness, resulting in forms which are unat-tested despite not being directly prohibited. Thenext step of Study 2 is to expand it to more lan-guages, to test how consistent the effect of su-peradditivity is cross-linguistically. Future workshould also investigate whether the observedpatterns in lexicons are driven by the interactionof particular constraints, or if they result from ageneral pattern in how all constraints combine.

The superadditivity account and the pre-sented evidence are consistent with Dautricheet al.’s (2017) findings that lexicons are moreregular than expected. However, a pressure for“clumpiness” and a superadditivity effect makedifferent predictions. According to Dautriche etal. (2017), regularity in the lexicon is due tore-use of phonological patterns, which shouldproduce particularly high peaks among high-probability forms, with less of an effect onthe low-probability tail. In the superadditiv-ity account, regularity is due to combinationsof markedness violations resulting in such lowprobabilities that many of them never appear,resulting in a shorter and thinner tail, with lessof an effect on the shape of the peak. Both pres-sures could also co-exist. It would be informa-tive for future work to tease apart the predic-tions made by each account.

172

Page 11: The Organization of Lexicons: a Cross-Linguistic Analysis

References

Adam Albright. ms. Cumulative violations and com-plexity thresholds.

Todd M. Bailey and Ulrike Hahn. 2005. Phonemesimilarity and confusability. Journal of Memoryand Language, 52(3):339–362.

Yuen Ren Chao. 1968. A Grammar of Spoken Chi-nese. University of California Press, Berkeley andLos Angeles.

George N. Clements. 2003. Feature economy insound systems. Phonology, 20(3):287–333.

John Coleman and Janet Pierrehumbert. 1997.Stochastic phonological grammars and acceptabil-ity. In John Coleman, editor, Computationalphonology: Third meeting of the ACL special in-terest group in computational phonology, pages 49–56, Somerset, NJ. Association for ComputationalLinguistics.

Isabelle Dautriche, Kyle Mahowald, Edward Gib-son, Anne Christophe, and Steven T. Piantadosi.2017. Words cluster phonetically beyond phono-tactic regularities. Cognition, 163:128–145.

San Duanmu. 2009. Syllable Structure: The Limitsof Variation. Oxford University Press, New York,USA.

Marc Ettlinger and Keith Johnson. 2009. Vowel dis-crimination by english, french and turkish speak-ers: Evidence for an exemplar-based approach tospeech perception. Phonetica, 66(4):222–242.

Edward Flemming, 2004. chapter Contrast and Per-ceptual Distinctiveness, pages 232 – 276. Cam-bridge University Press.

Lauren Franklin and James Morgan. 2017. On thenature of vocalic representation during lexical ac-cess. The Journal of the Acoustical Society ofAmerica, 141(5):4038–4038.

Stefan A Frisch, Nathan R Large, and David BPisoni. 2000. Perception of wordlikeness: Effectsof segment probability and length on the process-ing of nonwords. Journal of memory and language,42(4):481–496.

Christopher Green and Stuart Davis, 2014. Per-spectives on phonological theory and development,in honor of Daniel A. Dinnsen, chapter Superad-ditivity and limitations on syllable complexity inBambara words, pages 223–247. John Benjamins,Amsterdam.

Bruce Hayes and Colin Wilson. 2008. A maximumentropy model of phonotactics and phonotacticlearning. Linguistic Inquiry, 39(3):379–440.

Tsan Huang and Keith Johnson. 2010. Languagespecificity in speech perception: Perception of

mandarin tones by native and nonnative listeners.Phonetica, 67(4):243–267.

Shudong Huang, Xuejun Bian, Grace Wu, and Cyn-thia McLemore, 1997. LDC Mandarin Lexicon.University of Pennsylvania.

Daniel Jurafsky and James H. Martin. 2008. Speechand language processing: an introduction to natu-ral language processing, computational linguisticsand speech recognition. Prentice Hall. Pearson Ed-ucation, Inc., Upper Saddle River, New Jersey,2nd edition.

Brett Kessler and Rebecca Treiman. 1997. Syllablestructure and the distribution of phonemes in en-glish syllables. Journal of Memory and language,37(3):295–311.

Charles N. Li and Sandra A. Thompson, 1987. TheWorld’s Major Languages, chapter Chinese. Ox-ford University Press.

Paul A Luce and David B Pisoni. 1998. Recogniz-ing spoken words: The neighborhood activationmodel. Ear and Hearing, 19(1):1.

Steven T. Piantadosi, Harry J. Tily, and EdwardGibson. 2009. The communicative lexicon hy-pothesis. In The 31st annual meeting of the Cogni-tive Science Society (CogSci09), pages 2582–2587.

Chelsea Sanker. 2016. Patterns Of Misperception OfArabic Guttural And Non-Guttural Consonants.Ph.D. thesis.

Stephanie S. Shih. 2016. Super additive simi-larity in dioula tone harmony. In Kyeong minKim, Pocholo Umbal, Trevor Block, QueenieChan, Tanie Cheng, Kelli Finney, Mara Katz, So-phie Nickel-Thompson, and Lisa Shorten, editors,Proceedings of the 33rd West Coast Conferenceon Formal Linguistics, pages 361–370. CascadillaProceedings Project, Somerville, MA, USA.

Kevin Tang. 2015. Naturalistic speech mispercep-tion. Ph.D. thesis.

Andrew Wedel, Scott Jackson, and Abby Kaplan.2013. Functional load and the lexicon: Evidencethat syntactic category and frequency relation-ships in minimal lemma pairs predict the loss ofphoneme contrasts in language change. Languageand speech, 56(3):395–417.

Robert L. Weide, 2008. The CMU pronunciation dic-tionary. Carnegie Mellon University, 0.7a edition.

Katherine S. White and James L. Morgan. 2008.Sub-segmental detail in early lexical representa-tions. Journal of Memory and Language, 59:114–132.

George K. Zipf. 1929. Relative frequency as a de-terminant of phonetic change. Harvard Studies inClassical Philology, 40:1–95.

173