15
MIT: 150 Noam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during MIT's 150th birthday party, Technology Review reports that Prof. Noam Chomsky derided researchers in machine learning who use purely statistical methods to produce behavior that mimics something in the world, but who don't try to understand the meaning of that behavior. The transcript is now available, so let's quote Chomsky himself: It's true there's been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success ... which I think is novel in the history of science. It interprets success as approximating unanalyzed data. This essay discusses what Chomsky said, speculates on what he might have meant, and tries to determine the truth and importance of his claims. Chomsky's remarks were in response to Steven Pinker's question about the success of probabilistic models trained with statistical methods. What did Chomsky mean, and is he right? 1. What is a statistical model? 2. How successful are statistical language models? 3. Is there anything like their notion of success in the history of science? 4. What doesn't Chomsky like about statistical models? 5. What did Chomsky mean, and is he right? I take Chomsky's points to be the following: Statistical language models have had engineering success, but that is irrelevant to science. A. Accurately modeling linguistic facts is just butterfly collecting; what matters in science (and specifically linguistics) is the underlying principles. B. Statistical models are incomprehensible; they provide no insight. C. Statistical models may provide an accurate simulation of some phenomena, but the simulation is done completely the wrong way; people don't decide what the third word of a sentence should be by consulting a probability table keyed on the previous two words, rather they map from an internal semantic form to a syntactic tree-structure, which is then linearized into words. This is done without any probability or statistics. D. Statistical models have been proven incapable of learning language; therefore language must be innate, so why are these statistical modelers wasting their time on the wrong enterprise? E. Is he right? That's a long-standing debate. These are my answers: I agree that engineering success is not the goal or the measure of science. But I observe that science and engineering develop together, and that engineering success shows that something is working right, and so is evidence (but not proof) of a scientifically successful model. A. Science is a combination of gathering facts and making theories; neither can progress on its own. I B. On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html 1 of 17 1/2/19, 3:32 PM

On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

MIT: 150

Noam Chomsky

On Chomsky and the Two Cultures of StatisticalLearningAt the Brains, Minds, and Machines symposium held during MIT's 150th birthday party, TechnologyReview reports that Prof. Noam Chomsky

derided researchers in machine learning who use purely statistical methods toproduce behavior that mimics something in the world, but who don't try tounderstand the meaning of that behavior.

The transcript is now available, so let's quote Chomsky himself:

It's true there's been a lot of work on trying to apply statistical models to various linguisticproblems. I think there have been some successes, but a lot of failures. There is a notion ofsuccess ... which I think is novel in the history of science. It interprets success asapproximating unanalyzed data.

This essay discusses what Chomsky said, speculates on what he might have meant, and tries todetermine the truth and importance of his claims.

Chomsky's remarks were in response to Steven Pinker's question about the success ofprobabilistic models trained with statistical methods.

What did Chomsky mean, and is he right?1. What is a statistical model?2. How successful are statistical language models?3. Is there anything like their notion of success in the history of science?4. What doesn't Chomsky like about statistical models?5.

What did Chomsky mean, and is he right?I take Chomsky's points to be the following:

Statistical language models have had engineering success, but that is irrelevant to science.A. Accurately modeling linguistic facts is just butterfly collecting; what matters in science (andspecifically linguistics) is the underlying principles.

B.

Statistical models are incomprehensible; they provide no insight.C. Statistical models may provide an accurate simulation of some phenomena, but the simulation isdone completely the wrong way; people don't decide what the third word of a sentence should beby consulting a probability table keyed on the previous two words, rather they map from aninternal semantic form to a syntactic tree-structure, which is then linearized into words. This isdone without any probability or statistics.

D.

Statistical models have been proven incapable of learning language; therefore language must beinnate, so why are these statistical modelers wasting their time on the wrong enterprise?

E.

Is he right? That's a long-standing debate. These are my answers:

I agree that engineering success is not the goal or the measure of science. But I observe thatscience and engineering develop together, and that engineering success shows that something isworking right, and so is evidence (but not proof) of a scientifically successful model.

A.

Science is a combination of gathering facts and making theories; neither can progress on its own. IB.

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

1 of 17 1/2/19, 3:32 PM

Page 2: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

think Chomsky is wrong to push the needle so far towards theory over facts; in the history ofscience, the laborious accumulation of facts is the dominant mode, not a novelty. The science ofunderstanding language is no different than other sciences in this respect.I agree that it can be difficult to make sense of a model containing billions of parameters.Certainly a human can't understand such a model by inspecting the values of each parameterindividually. But one can gain insight by examing the properties of the model—where it succeedsand fails, how well it learns as a function of data, etc.

C.

I agree that a Markov model of word probabilities cannot model all of language. It is equally truethat a concise tree-structure model without probabilities cannot model all of language. What isneeded is a probabilistic model that covers words, trees, semantics, context, discourse, etc.Chomsky dismisses all probabilistic models because of shortcomings of particular 50-year oldmodels. I understand how Chomsky arrives at the conclusion that probabilistic models areunnecessary, from his study of the generation of language. But the vast majority of people whostudy interpretation tasks, such as speech recognition, quickly see that interpretation is aninherently probabilistic problem: given a stream of noisy input to my ears, what did the speakermost likely mean? Einstein said to make everything as simple as possible, but no simpler. Manyphenomena in science are stochastic, and the simplest model of them is a probabilistic model; Ibelieve language is such a phenomenon and therefore that probabilistic models are our best toolfor representing facts about language, for algorithmically processing language, and forunderstanding how humans process language.

D.

In 1967, Gold's Theorem showed some theoretical limitations of logical deduction on formalmathematical languages. But this result has nothing to do with the task faced by learners of naturallanguage. In any event, by 1969 we knew that probabilistic inference (over probabilistic context-free grammars) is not subject to those limitations (Horning showed that learning of PCFGs ispossible). I agree with Chomsky that it is undeniable that humans have some innate capability tolearn natural language, but we don't know enough about that capability to rule out probabilisticlanguage representations, nor statistical learning. I think it is much more likely that humanlanguage learning involves something like probabilistic and statistical inference, but we just don'tknow yet.

E.

Now let me back up my answers with a more detailed look at the remaining questions.

What is a statistical model?A statistical model is a mathematical model which is modified or trained by the input of data points.Statistical models are often but not always probabilistic. Where the distinction is important we will becareful not to just say "statistical" but to use the following component terms:

A mathematical model specifies a relation among variables, either in functional form that mapsinputs to outputs (e.g. y = m x + b) or in relation form (e.g. the following (x, y) pairs are part of therelation).A probabilistic model specifies a probability distribution over possible values of randomvariables, e.g., P(x, y), rather than a strict deterministic relationship, e.g., y = f(x).A trained model uses some training/learning algorithm to take as input a collection of possiblemodels and a collection of data points (e.g. (x, y) pairs) and select the best model. Often this is inthe form of choosing the values of parameters (such as m and b above) through a process ofstatistical inference.

For example, a decade before Chomsky, Claude Shannon proposed probabilistic models ofcommunication based on Markov chains of words. If you have a vocabulary of 100,000 words and asecond-order Markov model in which the probability of a word depends on the previous two words, thenyou need a quadrillion (1015) probability values to specify the model. The only feasible way to learn

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

2 of 17 1/2/19, 3:32 PM

Page 3: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

Claude Shannon

these 1015 values is to gather statistics from data and introduce some smoothingmethod for the many cases where there is no data. Therefore, most (but not all)probabilistic models are trained. Also, many (but not all) trained models areprobabilistic.

As another example, consider the Newtonian model of gravitational attraction, whichsays that the force between two objects of mass m1 and m2 a distance r apart is givenby

F = G m1 m2 / r2

where G is the universal gravitational constant. This is a trained model because the gravitationalconstant G is determined by statistical inference over the results of a series of experiments that containstochastic experimental error. It is also a deterministic (non-probabilistic) model because it states anexact functional relationship. I believe that Chomsky has no objection to this kind of statistical model.Rather, he seems to reserve his criticism for statistical models like Shannon's that have quadrillions ofparameters, not just one or two.

(This example brings up another distinction: the gravitational model is continuous and quantitativewhereas the linguistic tradition has favored models that are discrete, categorical, and qualitative: aword is or is not a verb, there is no question of its degree of verbiness. For more on these distinctions,see Chris Manning's article on Probabilistic Syntax.)

A relevant probabilistic statistical model is the ideal gas law, which describes the pressure P of a gas interms of the the number of molecules, N, the temperature T, and Boltzmann's constant, K:

P = N k T / V.

The equation can be derived from first principles using the tools of statistical mechanics. It is anuncertain, incorrect model; the true model would have to describe the motions of individual gasmolecules. This model ignores that complexity and summarizes our uncertainty about the location ofindividual molecules. Thus, even though it is statistical and probabilistic, even though it does notcompletely model reality, it does provide both good predictions and insight—insight that is not availablefrom trying to understand the true movements of individual molecules.

Now let's consider the non-statistical model of spelling expressed by the rule "I before E except after C."Compare that to the probabilistic, trained statistical model:

P(IE) = 0.0177 P(CIE) = 0.0014 P(*IE) = 0.163P(EI) = 0.0046 P(CEI) = 0.0005 P(*EI) = 0.0041

This model comes from statistics on a corpus of a trillion words of English text. The notation P(IE) isthe probability that a word sampled from this corpus contains the consecutive letters "IE." P(CIE) is theprobability that a word contains the consecutive letters "CIE", and P(*IE) is the probability of any letterother than C followed by IE. The statistical data confirms that IE is in fact more common than EI, andthat the dominance of IE lessens wehn following a C, but contrary to the rule, CIE is still more commonthan CEI. Examples of "CIE" words include "science," "society," "ancient" and "species." Thedisadvantage of the "I before E except after C" model is that it is not very accurate. Consider:

Accuracy("I before E") = 0.0177/(0.0177+0.0046) = 0.793Accuracy("I before E except after C") = (0.0005+0.0163)/(0.0005+0.0163+0.0014+0.0041) = 0.753

A more complex statistical model (say, one that gave the probability of all 4-letter sequences, and/or ofall known words) could be ten times more accurate at the task of spelling, but offers little insight intowhat is going on. (Insight would require a model that knows about phonemes, syllabification, and

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

3 of 17 1/2/19, 3:32 PM

Page 4: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

language of origin. Such a model could be trained (or not) and probabilistic (or not).)

As a final example (not of statistical models, but of insight), consider the Theory of Supreme CourtJustice Hand-Shaking: when the supreme court convenes, all attending justices shake hands with everyother justice. The number of attendees, n, must be an integer in the range 0 to 9; what is the total numberof handshakes, h for a given n? Here are three possible explanations:

Each of n justices shakes hands with the other n - 1 justices, but that counts Alito/Breyer andBreyer/Alito as two separate shakes, so we should cut the total in half, and we end up with h = n ×(n - 1) / 2.

A.

To avoid double-counting, we will order the justices by seniority and only count a more-senior/more-junior handshake, not a more-junior/more-senior one. So we count, for each justice,the shakes with the more junior justices, and sum them up, giving h = Σi = 1 .. n (i - 1).

B.

Just look at this table:n: 0 1 2 3 4 5 6 7 8 9

h: 0 0 1 3 6 10 15 21 28 36

C.

Some people might prefer A, some might prefer B, and if you are slow at doing multiplication oraddition you might prefer C. Why? All three explanations describe exactly the same theory — the samefunction from n to h, over the entire domain of possible values of n. Thus we could prefer A (or B) overC only for reasons other than the theory itself. We might find that A or B gave us a better understandingof the problem. A and B are certainly more useful than C for figuring out what happens if Congressexercises its power to add an additional associate justice. Theory A might be most helpful in developinga theory of handshakes at the end of a hockey game (when each player shakes hands with players on theopposing team) or in proving that the number of people who shook an odd number of hands at the MITSymposium is even.

How successful are statistical language models?Chomsky said words to the effect that statistical language models have had some limited success insome application areas. Let's look at computer systems that deal with language, and at the notion of"success" defined by "making accurate predictions about the world." First, the major application areas:

Search engines: 100% of major players are trained and probabilistic. Their operation cannot bedescribed by a simple function.Speech recognition: 100% of major systems are trained and probabilistic, mostly relying onprobabilistic hidden Markov models.Machine translation: 100% of top competitors in competitions such as NIST use statisticalmethods. Some commercial systems use a hybrid of trained and rule-based approaches. Of the4000 language pairs covered by machine translation systems, a statistical system is by far the bestfor every pair except Japanese-English, where the top statistical system is roughly equal to the tophybrid system.Question answering: this application is less well-developed, and many systems build heavily onthe statistical and probabilistic approach used by search engines. The IBM Watson system thatrecently won on Jeopardy is thoroughly probabilistic and trained, while Boris Katz's START is ahybrid. All systems use at least some statistical techniques.

Now let's look at some components that are of interest only to the computational linguist, not to the enduser:

Word sense disambiguation: 100% of top competitors at the SemEval-2 competition usedstatistical techniques; most are probabilistic; some use a hybrid approach incorporating rules from

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

4 of 17 1/2/19, 3:32 PM

Page 5: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

sources such as Wordnet.Coreference resolution: The majority of current systems are statistical, although we shouldmention the system of Haghighi and Klein, which can be described as a hybrid system that ismostly rule-based rather than trained, and performs on par with top statistical systems.Part of speech tagging: Most current systems are statistical. The Brill tagger stands out as asuccessful hybrid system: it learns a set of deterministic rules from statistical data.Parsing: There are many parsing systems, using multiple approaches. Almost all of the mostsuccessful are statistical, and the majority are probabilistic (with a substantial minority ofdeterministic parsers).

Clearly, it is inaccurate to say that statistical models (and probabilistic models) have achieved limitedsuccess; rather they have achieved a dominant (although not exclusive) position.

Another measure of success is the degree to which an idea captures a community of researchers. AsSteve Abney wrote in 1996, "In the space of the last ten years, statistical methods have gone from beingvirtually unknown in computational linguistics to being a fundamental given. ... anyone who cannot atleast use the terminology persuasively risks being mistaken for kitchen help at the ACL [Association forComputational Linguistics] banquet."

Now of course, the majority doesn't rule -- just because everyone is jumping on some bandwagon, thatdoesn't make it right. But I made the switch: after about 14 years of trying to get language models towork using logical rules, I started to adopt probabilistic approaches (thanks to pioneers like GeneCharniak (and Judea Pearl for probability in general) and to my colleagues who were early adopters, likeDekai Wu). And I saw everyone around me making the same switch. (And I didn't see anyone going inthe other direction.) We all saw the limitations of the old tools, and the benefits of the new.

And while it may seem crass and anti-intellectual to consider a financial measure of success, it is worthnoting that the intellectual offspring of Shannon's theory create several trillion dollars of revenue eachyear, while the offspring of Chomsky's theories generate well under a billion.

This section has shown that one reason why the vast majority of researchers in computational linguisticsuse statistical models is an engineering reason: statistical models have state-of-the-art performance, andin most cases non-statistical models perform worst. For the remainder of this essay we will concentrateon scientific reasons: that probabilistic models better represent linguistic facts, and statistical techniquesmake it easier for us to make sense of those facts.

Is there anything like [the statistical model] notion of success inthe history of science?When Chomsky said "That's a notion of [scientific] success that's very novel. I don't know of anythinglike it in the history of science" he apparently meant that the notion of success of "accurately modelingthe world" is novel, and that the only true measure of success in the history of science is "providinginsight" — of answering why things are the way they are, not just describing how they are.

A dictionary definition of science is "the systematic study of the structure and behavior of the physicaland natural world through observation and experiment," which stresses accurate modeling over insight,but it seems to me that both notions have always coexisted as part of doing science. To test that, Iconsulted the epitome of doing science, namely Science. I looked at the current issue and chose a titleand abstract at random:

Chlorinated Indium Tin Oxide Electrodes with High Work Function for Organic DeviceCompatibility

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

5 of 17 1/2/19, 3:32 PM

Page 6: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

In organic light-emitting diodes (OLEDs), a stack of multiple organic layersfacilitates charge flow from the low work function [~4.7 electron volts (eV)] ofthe transparent electrode (tin-doped indium oxide, ITO) to the deep energylevels (~6 eV) of the active light-emitting organic materials. We demonstrate achlorinated ITO transparent electrode with a work function of >6.1 eV thatprovides a direct match to the energy levels of the active light-emittingmaterials in state-of-the art OLEDs. A highly simplified green OLED with amaximum external quantum efficiency (EQE) of 54% and power efficiency of230 lumens per watt using outcoupling enhancement was demonstrated, as were EQE of50% and power efficiency of 110 lumens per watt at 10,000 candelas per square meter.

It certainly seems that this article is much more focused on "accurately modeling the world" than on"providing insight." The paper does indeed fit in to a body of theories, but it is mostly reporting onspecific experiments and the results obtained from them (e.g. efficiency of 54%).

I then looked at all the titles and abstracts from the current issue of Science:

Comparative Functional Genomics of the Fission YeastsDimensionality Control of Electronic Phase Transitions in Nickel-Oxide SuperlatticesCompetition of Superconducting Phenomena and Kondo Screening at the NanoscaleChlorinated Indium Tin Oxide Electrodes with High Work Function for Organic DeviceCompatibilityProbing Asthenospheric Density, Temperature, and Elastic Moduli Below the Western UnitedStatesImpact of Polar Ozone Depletion on Subtropical PrecipitationFossil Evidence on Origin of the Mammalian BrainIndustrial Melanism in British Peppered Moths Has a Singular and Recent Mutational OriginThe Selaginella Genome Identifies Genetic Changes Associated with the Evolution of VascularPlantsChromatin "Prepattern" and Histone Modifiers in a Fate Choice for Liver and PancreasSpatial Coupling of mTOR and Autophagy Augments Secretory PhenotypesDiet Drives Convergence in Gut Microbiome Functions Across Mammalian Phylogeny andWithin HumansThe Toll-Like Receptor 2 Pathway Establishes Colonization by a Commensal of the HumanMicrobiotaA Packing Mechanism for Nucleosome Organization Reconstituted Across a Eukaryotic GenomeStructures of the Bacterial Ribosome in Classical and Hybrid States of tRNA Binding

and did the same for the current issue of Cell:

Mapping the NPHP-JBTS-MKS Protein Network Reveals Ciliopathy Disease Genes andPathwaysDouble-Strand Break Repair-Independent Role for BRCA2 in Blocking Stalled Replication ForkDegradation by MRE11Establishment and Maintenance of Alternative Chromatin States at a Multicopy Gene LocusAn Epigenetic Signature for Monoallelic Olfactory Receptor ExpressionDistinct p53 Transcriptional Programs Dictate Acute DNA-Damage Responses and TumorSuppressionAn ADIOL-ERβ-CtBP Transrepression Pathway Negatively Regulates Microglia-MediatedInflammationA Hormone-Dependent Module Regulating Energy BalanceClass IIa Histone Deacetylases Are Hormone-Activated Regulators of FOXO and MammalianGlucose Homeostasis

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

6 of 17 1/2/19, 3:32 PM

Page 7: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

and for the 2010 Nobel Prizes in science:

Physics: for groundbreaking experiments regarding the two-dimensional material grapheneChemistry: for palladium-catalyzed cross couplings in organic synthesisPhysiology or Medicine: for the development of in vitro fertilization

My conclusion is that 100% of these articles and awards are more about "accurately modeling the world"than they are about "providing insight," although they all have some theoretical insight component aswell. I recognize that judging one way or the other is a difficult ill-defined task, and that you shouldn'taccept my judgements, because I have an inherent bias. (I was considering running an experiment onMechanical Turk to get an unbiased answer, but those familiar with Mechanical Turk told me thesequestions are probably too hard. So you the reader can do your own experiment and see if you agree.)

What doesn't Chomsky like about statistical models?I said that statistical models are sometimes confused with probabilistic models; let's first consider theextent to which Chomsky's objections are actually about probabilistic models. In 1969 he famouslywrote:

But it must be recognized that the notion of "probability of a sentence" is an entirely uselessone, under any known interpretation of this term.

His main argument being that, under any interpretation known to him, the probability of a novelsentence must be zero, and since novel sentences are in fact generated all the time, there is acontradiction. The resolution of this contradiction is of course that it is not necessary to assign aprobability of zero to a novel sentence; in fact, with current probabilistic models it is well-known how toassign a non-zero probability to novel occurrences, so this criticism is invalid, but was very influentialfor decades. Previously, in Syntactic Structures (1957) Chomsky wrote:

I think we are forced to conclude that ... probabilistic models give no particular insight intosome of the basic problems of syntactic structure.

In the footnote to this conclusion he considers the possibility of a useful probabilistic/statistical model,saying "I would certainly not care to argue that ... is unthinkable, but I know of no suggestion to thiseffect that does not have obvious flaws." The main "obvious flaw" is this: Consider:

I never, ever, ever, ever, ... fiddle around in any way with electrical equipment.1. She never, ever, ever, ever, ... fiddles around in any way with electrical equipment.2. * I never, ever, ever, ever, ... fiddles around in any way with electrical equipment.3. * She never, ever, ever, ever, ... fiddle around in any way with electrical equipment.4.

No matter how many repetitions of "ever" you insert, sentences 1 and 2 are grammatical and 3 and 4 areungrammatical. A probabilistic Markov-chain model with n states can never make the necessarydistinction (between 1 or 2 versus 3 or 4) when there are more than n copies of "ever." Therefore, aprobabilistic Markov-chain model cannot handle all of English.

This criticism is correct, but it is a criticism of Markov-chain models—it has nothing to do withprobabilistic models (or trained models) at all. Moreover, since 1957 we have seen many types ofprobabilistic language models beyond the Markov-chain word models. Examples 1-4 above can in factbe distinguished with a finite-state model that is not a chain, but other examples require moresophisticated models. The best studied is probabilistic context-free grammar (PCFG), which operatesover trees, categories of words, and individual lexical items, and has none of the restrictions of finite-state models. We find that PCFGs are state-of-the-art for parsing performance and are easier to learn

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

7 of 17 1/2/19, 3:32 PM

Page 8: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

from data than categorical context-free grammars. Other types of probabilistic models cover semanticand discourse structures. Every probabilistic model is a superset of a deterministic model (because thedeterministic model could be seen as a probabilistic model where the probabilities are restricted to be 0or 1), so any valid criticism of probabilistic models would have to be because they are too expressive,not because they are not expressive enough.

In Syntactic Structures, Chomsky introduces a now-famous example that is anothercriticism of finite-state probabilistic models:

Neither (a) 'colorless green ideas sleep furiously' nor (b) 'furiously sleep ideasgreen colorless', nor any of their parts, has ever occurred in the past linguisticexperience of an English speaker. But (a) is grammatical, while (b) is not.

Chomsky appears to be correct that neither sentence appeared in the publishedliterature before 1955. I'm not sure what he meant by "any of their parts," but certainly every two-wordpart had occurred, for example:

"It is neutral green, colorless green, like the glaucous water lying in a cellar." The Paris weremember, Elisabeth Finley Thomas (1942)."To specify those green ideas is hardly necessary, but you may observe Mr. [D. H.] Lawrence inthe role of the satiated aesthete." The New Republic: Volume 29 p. 184, William White (1922)."Ideas sleep in books." Current Opinion: Volume 52, (1912).

But regardless of what is meant by "part," a statistically-trained finite-state model can in fact distinguishbetween these two sentences. Pereira (2001) showed that such a model, augmented with word categoriesand trained by expectation maximization on newspaper text, computes that (a) is 200,000 times moreprobable than (b). To prove that this was not the result of Chomsky's sentence itself sneaking intonewspaper text, I repeated the experiment, using a much cruder model with Laplacian smoothing and nocategories, trained over the Google Book corpus from 1800 to 1954, and found that (a) is about 10,000times more probable. If we had a probabilistic model over trees as well as word sequences, we couldperhaps do an even better job of computing degree of grammaticality.

Furthermore, the statistical models are capable of delivering the judgment that both sentences areextremely improbable, when compared to, say, "Effective green products sell well." Chomsky's theory,being categorical, cannot make this distinction; all it can distinguish is grammatical/ungrammatical.

Another part of Chomsky's objection is "we cannot seriously propose that a child learns the values of109 parameters in a childhood lasting only 108 seconds." (Note that modern models are much larger thanthe 109 parameters that were contemplated in the 1960s.) But of course nobody is proposing that theseparameters are learned one-by-one; the right way to do learning is to set large swaths of near-zeroparameters simultaneously with a smoothing or regularization procedure, and update the high-probability parameters continuously as observations comes in. And noone is suggesting that Markovmodels by themselves are a serious model of human language performance. But I (and others) suggestthat probabilistic, trained models are a better model of human language performance than arecategorical, untrained models. And yes, it seems clear that an adult speaker of English does knowbillions of language facts (for example, that one says "big game" rather than "large game" when talkingabout an important football game). These facts must somehow be encoded in the brain.

It seems clear that probabilistic models are better for judging the likelihood of a sentence, or its degreeof sensibility. But even if you are not interested in these factors and are only interested in thegrammaticality of sentences, it still seems that probabilistic models do a better job at describing thelinguistic facts. The mathematical theory of formal languages defines a language as a set of sentences.That is, every sentence is either grammatical or ungrammatical; there is no need for probability in this

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

8 of 17 1/2/19, 3:32 PM

Page 9: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

Butterflies

framework. But natural languages are not like that. A scientific theory of natural languages must accountfor the many phrases and sentences which leave a native speaker uncertain about their grammaticality(see Chris Manning's article and its discussion of the phrase "as least as"), and there are phrases whichsome speakers find perfectly grammatical, others perfectly ungrammatical, and still others will flip-flopfrom one occasion to the next. Finally, there are usages which are rare in a language, but cannot bedismissed if one is concerned with actual data. For example, the verb quake is listed as intransitive indictionaries, meaning that (1) below is grammatical, and (2) is not, according to a categorical theory ofgrammar.

The earth quaked.1. ? It quaked her bowels.2.

But (2) actually appears as a sentence of English. This poses a dilemma for the categorical theory. When(2) is observed we must either arbitrarily dismiss it as an error that is outside the bounds of our model(without any theoretical grounds for doing so), or we must change the theory to allow (2), which oftenresults in the acceptance of a flood of sentences that we would prefer to remain ungrammatical. AsEdward Sapir said in 1921, "All grammars leak." But in a probabilistic model there is no difficulty; wecan say that quake has a high probability of being used intransitively, and a low probability of transitiveuse (and we can, if we care, further describe those uses through subcategorization).

Steve Abney points out that probabilistic models are better suited for modeling language change. Hecites the example of a 15th century Englishman who goes to the pub every day and orders "Ale!" Undera categorical model, you could reasonably expect that one day he would be served eel, because the greatvowel shift flipped a Boolean parameter in his mind a day before it flipped the parameter in thepublican's. In a probabilistic framework, there will be multiple parameters, perhaps with continuousvalues, and it is easy to see how the shift can take place gradually over two centuries.

Thus it seems that grammaticality is not a categorical, deterministic judgment but rather an inherentlyprobabilistic one. This becomes clear to anyone who spends time making observations of a corpus ofactual sentences, but can remain unknown to those who think that the object of study is their own set ofintuitions about grammaticality. Both observation and intuition have been used in the history of science,so neither is "novel," but it is observation, not intuition that is the dominant model for science.

Now let's consider what I think is Chomsky's main point of disagreement with statistical models: thetension between "accurate description" and "insight." This is an old distinction. Charles Darwin(biologist, 1809–1882) is best known for his insightful theories but he stressed the importance ofaccurate description, saying "False facts are highly injurious to the progress of science, for they oftenendure long; but false views, if supported by some evidence, do little harm, for every one takes asalutary pleasure in proving their falseness." More recently, Richard Feynman (physicist, 1918–1988)wrote "Physics can progress without the proofs, but we can't go on without the facts."

On the other side, Ernest Rutherford (physicist, 1871–1937) disdained meredescription, saying "All science is either physics or stamp collecting." Chomskystands with him: "You can also collect butterflies and make many observations. If youlike butterflies, that's fine; but such work must not be confounded with research,which is concerned to discover explanatory principles."

Acknowledging both sides is Robert Millikan (physicist, 1868–1953) who said in hisNobel acceptance speech "Science walks forward on two feet, namely theory andexperiment ... Sometimes it is one foot that is put forward first, sometimes the other,but continuous progress is only made by the use of both."

The two cultures

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

9 of 17 1/2/19, 3:32 PM

Page 10: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

Leo Breiman

Bill O'Reilly

After all those distinguished scientists have weighed in, I think the most relevantcontribution to the current discussion is the 2001 paper by Leo Breiman (statistician,1928–2005), Statistical Modeling: The Two Cultures. In this paper Breiman, alludingto C.P. Snow, describes two cultures:

First the data modeling culture (to which, Breiman estimates, 98% of statisticianssubscribe) holds that nature can be described as a black box that has a relativelysimple underlying model which maps from input variables to output variables (withperhaps some random noise thrown in). It is the job of the statistician to wiselychoose an underlying model that reflects the reality of nature, and then use statisticaldata to estimate the parameters of the model.

Second the algorithmic modeling culture (subscribed to by 2% of statisticians and many researchers inbiology, artificial intelligence, and other fields that deal with complex phenomena), which holds thatnature's black box cannot necessarily be described by a simple model. Complex algorithmic approaches(such as support vector machines or boosted decision trees or deep belief networks) are used to estimatethe function that maps from input to output variables, but we have no expectation that the form of thefunction that emerges from this complex algorithm reflects the true underlying nature.

It seems that the algorithmic modeling culture is what Chomsky is objecting to most vigorously. It is notjust that the models are statistical (or probabilistic), it is that they produce a form that, while accuratelymodeling reality, is not easily interpretable by humans, and makes no claim to correspond to thegenerative process used by nature. In other words, algorithmic modeling describes what does happen,but it doesn't answer the question of why.

Breiman's article explains his objections to the first culture, data modeling. Basically, the conclusionsmade by data modeling are about the model, not about nature. (Aside: I remember in 2000 hearingJames Martin, the leader of the Viking missions to Mars, saying that his job as a spacecraft engineer wasnot to land on Mars, but to land on the model of Mars provided by the geologists.) The problem is, if themodel does not emulate nature well, then the conclusions may be wrong. For example, linear regressionis one of the most powerful tools in the statistician's toolbox. Therefore, many analyses start out with"Assume the data are generated by a linear model..." and lack sufficient analysis of what happens if thedata are not in fact generated that way. In addition, for complex problems there are usually manyalternative good models, each with very similar measures of goodness of fit. How is the data modeler tochoose between them? Something has to give. Breiman is inviting us to give up on the idea that we canuniquely model the true underlying form of nature's function from inputs to outputs. Instead he asks usto be satisfied with a function that accounts for the observed data well, and generalizes to new,previously unseen data well, but may be expressed in a complex mathematical form that may bear norelation to the "true" function's form (if such a true function even exists). Chomsky takes the oppositeapproach: he prefers to keep a simple, elegant model, and give up on the idea that the model willrepresent the data well. Instead, he declares that what he calls performance data—what people actuallydo—is off limits to linguistics; what really matters is competence—what he imagines that they shoulddo.

In January of 2011, television personality Bill O'Reilly weighed in on more than oneculture war with his statement "tide goes in, tide goes out. Never amiscommunication. You can't explain that," which he proposed as an argument forthe existence of God. O'Reilly was ridiculed by his detractors for not knowing thattides can be readily explained by a system of partial differential equations describingthe gravitational interaction of sun, earth, and moon (a fact that was first worked outby Laplace in 1776 and has been considerably refined since; when asked byNapoleon why the creator did not enter into his calculations, Laplace said "I had noneed of that hypothesis."). (O'Reilly also seems not to know about Deimos and

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

10 of 17 1/2/19, 3:32 PM

Page 11: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

Laplace

Dana Carvey

Phobos (two of my favorite moons in the entire solar system, along with Europa, Io,and Titan), nor that Mars and Venus orbit the sun, nor that the reason Venus has nomoons is because it is so close to the sun that there is scant room for a stable lunarorbit.) But O'Reilly realizes that it doesn't matter what his detractors think of hisastronomical ignorance, because his supporters think he has gotten exactly to the keyissue: why? He doesn't care how the tides work, tell him why they work. Why is themoon at the right distance to provide a gentle tide, and exert a stabilizing effect onearth's axis of rotation, thus protecting life here? Why does gravity work the way itdoes? Why does anything at all exist rather than not exist? O'Reilly is correct thatthese questions can only be addressed by mythmaking, religion or philosophy, not byscience.

Chomsky has a philosophy based on the idea that we should focus on the deep whys and that mereexplanations of reality don't matter. In this, Chomsky is in complete agreement with O'Reilly. (Irecognize that the previous sentence would have an extremely low probability in a probabilistic modeltrained on a newspaper or TV corpus.) Chomsky believes a theory of language should be simple andunderstandable, like a linear regression model where we know the underlying process is a straight line,and all we have to do is estimate the slope and intercept.

For example, consider the notion of a pro-drop language from Chomsky's Lectures on Government andBinding (1981). In English we say, for example, "I'm hungry," expressing the pronoun "I". But inSpanish, one expresses the same thought with "Tengo hambre" (literally "have hunger"), dropping thepronoun "Yo". Chomsky's theory is that there is a "pro-drop parameter" which is "true" in Spanish and"false" in English, and that once we discover the small set of parameters that describe all languages, andthe values of those parameters for each language, we will have achieved true understanding.

The problem is that reality is messier than this theory. Here are some droppedpronouns in English:

"Not gonna do it. Wouldn't be prudent." (Dana Carvey, impersonating GeorgeH. W. Bush)"Thinks he can outsmart us, does he?" (Evelyn Waugh, The Loved One)"Likes to fight, does he?" (S.M. Stirling, The Sunrise Lands)"Thinks he's all that." (Kate Brian, Lucky T)"Go for a walk?" (countless dog owners)"Gotcha!" "Found it!" "Looks good to me!" (common expressions)

Linguists can argue over the interpretation of these facts for hours on end, but the diversity of languageseems to be much more complex than a single Boolean value for a pro-drop parameter. We shouldn'taccept a theoretical framework that places a priority on making the model simple over making itaccurately reflect reality.

From the beginning, Chomsky has focused on the generative side of language. From this side, it isreasonable to tell a non-probabilistic story: I know definitively the idea I want to express—I'm startingfrom a single semantic form—thus all I have to do is choose the words to say it; why can't that be adeterministic, categorical process? If Chomsky had focused on the other side, interpretation, as ClaudeShannon did, he may have changed his tune. In interpretation (such as speech recognition) the listenerreceives a noisy, ambiguous signal and needs to decide which of many possible intended messages ismost likely. Thus, it is obvious that this is inherently a probabilistic problem, as was recognized early onby all researchers in speech recognition, and by scientists in other fields that do interpretation: theastronomer Laplace said in 1819 "Probability theory is nothing more than common sense reduced tocalculation," and the physicist James Maxwell said in 1850 "The true logic for this world is the calculusof Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

11 of 17 1/2/19, 3:32 PM

Page 12: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

Plato's cave

Lascaux Horse

reasonable man's mind."

Finally, one more reason why Chomsky dislikes statistical models is that they tend to make linguistics anempirical science (a science about how people actually use language) rather than a mathematical science(an investigation of the mathematical properties of models of formal language). Chomsky prefers thelater, as evidenced by his statement in Aspects of the Theory of Syntax (1965):

Linguistic theory is mentalistic, since it is concerned with discovering a mental realityunderlying actual behavior. Observed use of language ... may provide evidence ... but surelycannot constitute the subject-matter of linguistics, if this is to be a serious discipline.

I can't imagine Laplace saying that observations of the planets cannot constitute the subject-matter oforbital mechanics, or Maxwell saying that observations of electrical charge cannot constitute the subject-matter of electromagnetism. It is true that physics considers idealizations that are abstractions from themessy real world. For example, a class of mechanics problems ignores friction. But that doesn't meanthat friction is not considered part of the subject-matter of physics.

So how could Chomsky say that observations of language cannot be thesubject-matter of linguistics? It seems to come from his viewpoint as aPlatonist and a Rationalist and perhaps a bit of a Mystic. As in Plato's allegoryof the cave, Chomsky thinks we should focus on the ideal, abstract forms thatunderlie language, not on the superficial manifestations of language thathappen to be perceivable in the real world. That is why he is not interested inlanguage performance. But Chomsky, like Plato, has to answer where theseideal forms come from. Chomsky (1991) shows that he is happy with aMystical answer, although he shifts vocabulary from "soul" to "biological endowment."

Plato's answer was that the knowledge is 'remembered' from an earlier existence. Theanswer calls for a mechanism: perhaps the immortal soul ... rephrasing Plato's answer interms more congenial to us today, we will say that the basic properties of cognitive systemsare innate to the mind, part of human biological endowment.

It was reasonable for Plato to think that the ideal of, say, a horse, was more importantthan any individual horse we can perceive in the world. In 400BC, species werethought to be eternal and unchanging. We now know that is not true; that the horseson another cave wall—in Lascaux—are now extinct, and that current horses continueto evolve slowly over time. Thus there is no such thing as a single ideal eternal"horse" form.

We also now know that language is like that as well: languages are complex, random, contingentbiological processes that are subject to the whims of evolution and cultural change. What constitutes alanguage is not an eternal ideal form, represented by the settings of a small number of parameters, butrather is the contingent outcome of complex processes. Since they are contingent, it seems they can onlybe analyzed with probabilistic models. Since people have to continually understand the uncertain.ambiguous, noisy speech of others, it seems they must be using something like probabilistic reasoning.Chomsky for some reason wants to avoid this, and therefore he must declare the actual facts of languageuse out of bounds and declare that true linguistics only exists in the mathematical realm, where he canimpose the formalism he wants. Then, to get language from this abstract, eternal, mathematical realminto the heads of people, he must fabricate a mystical facility that is exactly tuned to the eternal realm.This may be very interesting from a mathematical point of view, but it misses the point about whatlanguage is, and how it works.

Thanks

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

12 of 17 1/2/19, 3:32 PM

Page 13: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

Thanks to Ann Farmer, Fernando Pereira, Dan Jurafsky, Hal Varian, and others for comments andsuggestions on this essay.

Annotated BibliographyAbney, Steve (1996) Statistical Methods and Linguistics, in Klavans and Resnik (eds.) TheBalancing Act: Combining Symbolic and Statistical Approaches to Language, MIT Press.

An excellent overall introduction to the statistical approach to language processing,and covers some ground that is not addressed often, such as language change andindividual differences.

1.

Breiman, Leo (2001) Statistical Modeling: The Two Cultures, Statistical Science, Vol. 16, No. 3,199-231.

Breiman does a great job of describing the two approaches, explaining the benefits ofhis approach, and defending his points in the vary interesting commentary witheminent statisticians: Cox, Efron, Hoadley, and Parzen.

2.

Chomsky, Noam (1956) Three Models for the Description of Language, IRE Transactions onInformation theory (2), pp. 113-124.

Compares finite state, phrase structure, and transformational grammars. Introduces"colorless green ideas sleep furiously."

3.

Chomsky, Noam (1967) Syntactic Structures, Mouton.

A book-length exposition of Chomsky's theory that was the leading exposition oflinguistics for a decade. Claims that probabilistic models give no insight into syntax.

4.

Chomsky, Noam (1969) Some Empirical Assumptions in Modern Philosophy of Language, inPhilosophy, Science and Method: Essays in Honor or Ernest Nagel, St. Martin's Press.

Claims that the notion "probability of a sentence" is an entirely useless notion.

5.

Chomsky, Noam (1981) Lectures on government and binding, de Gruyer.

A revision of Chomsky's theory; this version introduces Universal Grammar. We cite itfor the coverage of parameters such as pro-drop.

6.

Chomsky, Noam (1991) Linguistics and adjacent fields: a personal view, in Kasher (ed.), AChomskyan Turn, Oxford.

I found the Plato quotes in this article, published by the Communist Party of GreatBritain, and apparently published by someone with no linguistics training whatsoever,but with a political agenda.

7.

Gold, E. M. (1967) Language Identification in the Limit, Information and Control, Vol. 10, No. 5,pp. 447-474.

Gold proved a result in formal language theory that we can state (with some artisticlicense) as this: imagine a game between two players, guesser and chooser. Choosersays to guesser, "Here is an infinite number of languages. I'm going to choose one ofthem, and start reading sentences to you that come from that language. On your N-th

8.

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

13 of 17 1/2/19, 3:32 PM

Page 14: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

birthday there will be a True-False quiz where I give you 100 sentences you haven'theard yet, and you have to say whether they come from the language or not." Thereare some limits on what the infinite set looks like and on how the chooser can picksentences (he can be deliberately tricky, but he can't just repeat the same sentenceover and over, for example). Gold's result is that if the infinite set of languages are allgenerated by context-free grammars then there is no strategy for guesser thatguarantees she gets 100% correct every time, no matter what N you choose for thebirthday. This result was taken by Chomsky and others to mean that it is impossiblefor children to learn human languages without having an innate "language organ."As Johnson (2004) and others show, this was an invalid conclusion; the task ofgetting 100% on the quiz (which Gold called language identification) really hasnothing in common with the task of language acquisition performed by children, soGold's Theorem has no relevance.

Horning, J. J. (1969) A study of grammatical inference, Ph.D. thesis, Stanford Univ.

Where Gold found a negative result—that context-free languages were not identifiablefrom examples, Horning found a positive result—that probabilistic context-freelanguages are identifiable (to within an arbitrarily small level of error). Nobodydoubts that humans have unique innate capabilities for understanding language(although it is unknown to what extent these capabilities are specific to language andto what extent they are general cognitive abilities related to sequencing and formingabstractions). But Horning proved in 1969 that Gold cannot be used as a convincingargument for an innate language organ that specifies all of language except for thesetting of a few parameters.

9.

Johnson, Kent (2004) Gold's Theorem and cognitive science, Philosophy of Science, Vol. 71, pp.571-592.

The best article I've seen on what Gold's Theorem actually says and what has beenclaimed about it (correctly and incorrectly). Concludes that Gold has something tosay about formal languages, but nothing about child language acquisition.

10.

Lappin, Shalom and Shieber, Stuart M. (2007) Machine learning theory and practice as a source ofinsight into universal grammar., Journal of Linguistics, Vol. 43, No. 2, pp. 393-427.

An excellent article discussing the poverty of the stimulus, the fact that all modelshave bias, the difference between supervised and unsupervised learning, and modern(PAC or VC) learning theory. It provides alternatives to the model of UniversalGrammar consisting of a fixed set of binary parameters.

11.

Manning, Christopher (2002) Probabilistic Syntax, in Bod, Hay, and Jannedy (eds.), ProbabilisticLinguistics, MIT Press.

A compelling introduction to probabilistic syntax, and how it is a better model forlinguistic facts than categorical syntax. Covers "the joys and perils of corpuslinguistics."

12.

Norvig, Peter (2007) How to Write a Spelling Corrector, unpublished web page.

Shows working code to implement a probabilistic, statistical spelling correctionalgorithm.

13.

Norvig, Peter (2009) Natural Language Corpus Data, in Seagran and Hammerbacher (eds.),14.

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

14 of 17 1/2/19, 3:32 PM

Page 15: On Chomsky and the Two Cultures of Statistical LearningNoam Chomsky On Chomsky and the Two Cultures of Statistical Learning At the Brains, Minds, and Machines symposium held during

Beautiful Data, O'Reilly.

Expands on the essay above; shows how to implement three tasks: text segmentation,cryptographic decoding, and spelling correction (in a slightly more complete formthan the previous essay).

Pereira, Fernando (2002) Formal grammar and information theory: together again?, in Nevin andJohnson (eds.), The Legacy of Zellig Harris, Benjamins.

When I set out to write the page you are reading now, I was concentrating on theevents that took place in Cambridge, Mass., 4800 km from home. After doing someresearch I was surprised to learn that the authors of two of the three best articles onthis subject sit within a total of 10 meters from my desk: Fernando Pereira and ChrisManning. (The third, Steve Abney, sits 3700 km away.) But perhaps I shouldn't havebeen surprised. I remember giving a talk at ACL on the corpus-based languagemodels used at Google, and having Fernando, then a professor at U. Penn., comment"I feel like I'm a particle physicist and you've got the only super-collider." A few yearslater he moved to Google. Fernando is also famous for his quote "The older I get, thefurther down the Chomsky Hierarchy I go." His article here covers some of the sameground as mine, but he goes farther in explaining the range of probabilistic modelsavailable and how they are useful.

15.

Plato (c. 380BC) The Republic.

Cited here for the allegory of the cave.

16.

Shannon, C.E. (1948) A Mathematical Theory of Communication, The Bell System TechnicalJournal, Vol. 27, pp. 379-423.

An enormously influential article that started the field of information theory andintroduced the term "bit" and the noisy channel model, demonstrated successiven-gram approximations of English, described Markov models of language, definedentropy with respect to these models, and enabled the growth of thetelecommunications industry.

17.

Peter Norvig

On Chomsky and the Two Cultures of Statistical Learning http://norvig.com/chomsky.html

15 of 17 1/2/19, 3:32 PM