What's Happened Since the First SIGDAT Meeting?

What's Happened Since the First SIGDAT Meeting?

Kenneth Ward Church

AT&T Labs-Research

[email protected]

The First SIGDAT Meeting

• WVLC-1 was held just before ACL-93

• Great turnout!– More like a conference than a workshop

• We knew that corpora were “hot,”– but didn't appreciate just how hot they

would turn out to be.

Sister meetings have also done very well since 1993

• Information Retrieval – http://www.acm.org/sigir/

• Digital Libraries– http://fox.cs.vt.edu/DL99/

• Machine Learning– http://www.cs.cmu.edu/Web/Groups/NIPS

• Data-mining, Databases, Data Warehousing– http://www.acm.org/sigkdd/

– http://www.vldb.org/

Empiricism has a long history

• In the 1950’s, empiricism dominated a broad set of fields:– from psychology (behaviorism)– to electrical engineering (information theory).

• At the time, it was common practice in linguistics to classify words not only on the basis of their meanings– but also on the basis of their co-occurrence with other words.

– ``You shall know a word by the company it keeps” (Firth, 1957)

• Regrettably, interest in empiricism faded in the 1960’s:– Chomsky's criticism of ngrams in Syntactic Structures (1957) and

– Minsky and Papert's criticism of neural nets in Perceptrons (1969).

1990’s Revival

• Empiricism regained a dominant position:– Ngrams and Hidden Markov Models (HMMs) became

the method of choice in Speech.

– Neural Networks (Perceptrons + Hidden Layers) helped create Machine Learning.

• Empiricism Rationalism Empiricism– Oscillates about once a career

• Mark Twain: Grandparents and Grandchildren have a natural alliance.

Why the Revival?“It was a bad idea then, and it is still a bad idea now”

• More powerful computers??

• Availability of massive quantities of data!!

– Text is available like never before.– Not long ago, the Brown Corpus was considered large.

– But now, text is available like never before!

• First came collection efforts (www.ldc.upenn.org),

• And now everyone has access to the Web!

• Experiments are routinely carried out on gigabytes of text.

• Some researchers are even working with terabytes.

Big Changes Since 1993

• The Web, stupid!– Demos– Data

• Research: – Shared resources + evaluation– Scale: How large is very large?– Increased breadth: Geography, Topics

• Commercial: Wall Street & Main Street

The Web, Stupid!

• If you publish a paper about neat stuff, it is expected that you will post it on the web.

• I’ll mention just a few examples of neat stuff on the web.– Demos– Data– Tools

Lots of Neat Demos on the Web

• Web Searching with Machine Translation– www.altavista.com(uses Systran)

• Cross-Language Information Retrieval (CLIR): – www.xrce.xerox.com

• Parallel Corpora: www-rali.iro.umontreal.ca

• Latent Semantic Indexing (LSI)– superbook.bellcore.com/~remde/lsi

– lsa.colorado.edu

• Speech Synthesis: www.bell-labs.com/project/tts

• Dotplot: www.cs.unm.edu/~jon/dotplot

Lots of Neat Data on the Web

• Wordnet: www.cogsci.princeton.edu/~wn• Linguistic Data Consortium (LDC):

– www.ldc.upenn.org

• SIGLEX: www.clres.com/siglex.html

• Discourse Resource Initiative (DRI)– www.georgetown.edu/luperfoy/Discourse-Treebank/

dri-home.html

• The Federalist Papers: – www.mcs.net/~knautzr/fed

More Neat Data on the Web(in Lots of Languages)

• Chinese:– rocling.iis.sinica.edu.tw– www.sinica.edu.tw

• Japanese: cl.aist-nara.ac.jp/lab/resource/resource.html– Electronic Dictionary Research (EDR): www.iijnet.or.jp/edr – Advanced Telecommunications Research (ATR): www.atr.co.jp– www.rdt.monash.edu.au/~jwb/japanese.html

• Korean: korterm.kaist.ac.kr• European Language Resources Association (ELRA)

– www.icp.grenet.fr/ELRA

• Parallel Text (Resnik, ACL-99)– Canadian Hansards: WWW.Parl.GC.CA– Turkish: www.nlp.cs.bilkent.edu.tr– Swedish: svenska.gu.se

Lots of Neat Tools on the Web

• Penntools (links to all over the world) – www.cis.upenn.edu/~adwait/penntools.html

• Part of Speech Taggers (see above)• Juman/Chasen

– pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html– cl.aist-nara.ac.jp/lab/nlt/chasen.html

• Suffix Arrays– http://cm.bell-labs.com/cm/cs/who/doug/ssort.c



Research: – Shared resources + evaluation– Scale: How large is very large?– Increased breadth: Geography, Topics


Shared Resources + Evaluation

• Common tasks: – Trec (trec.nist.gov), Tipster, MUC

• Common benchmark corpora: Brown, Penn Treebank, Wall Street Journal, Switchboard

• Shared lexical resources: Wordnet (www.cogsci.princeton.edu/~wn/)

• Common labeling conventions/standards in all areas of NLP from Speech to Discourse

• Evaluation, evaluation, evaluation– Required to get a paper accepted anywhere.

In 1993, it wasn’t like this...

• Invited talks at ACL-93– “Planning Multimodal Discourse”– “Transfers of Meaning”– “Quantificational Domains and Recursive

Contexts”

• Less sharing of resources

• Evaluation not required

Empiricism vs. Rationalism

• Pluses: Clear measurable progress– Speech Recognition– Part of Speech Tagging– Parsing

• Minuses: Herd mentality, incrementalism, mindless metrics, duplicated effort– Recall: empiricism fell out of favor in 1960s

when methodology became too burdensome.



• Research: – Shared resources + evaluation– Scale: How large is very large?– Increased breadth: Geography, Topics

Commercial: Wall Street & Main Street

Main Street:Big change since 1993

• Large corpora are now having an impact on

ordinary users:

– Web search engines/portals

– Managing gigabytes, not just a popular book,

but something that ordinary users are beginning

to take for granted.

Huge Commercial Successes(Since 1993)

• Information Retrieval & Digital Libraries– Web search engines/portals: highly successful on

both Wall Street as well as Main Street• Invited talks from Lycos (1997) & Infoseek (1998)

• Machine Translation & Speech– Available wherever software is sold

– Can’t use a phone without talking to a computer



• Research: – Shared resources + evaluationScale: How large is very large?– Increased breadth: Geography, Topics


How Large is Very Large?

Year Source Size (words)

1788 Federalist Papers 1/5 million

1982 Brown Corpus 1 million

1987 Birmingham Corpus 20 million

1988- Associate Press (AP) 50 million(per year)

1993 MUC, TREC, Tipster

Mirror, mirror on the wall

• Who is the largest of them all?– The Web?– Lexis-Nexis?– West?

• We have had invited talks from all three– Web: Lycos (1997) & Infoseek (1998)– Lexis-Nexis (1993)– West (1997)



• Research: – Shared resources + evaluation– Scale: How large is very large?Increased breadth: Geography, Topics


Internationalization

• SIGDAT-93: Nearly equal participation– America : 4 papers

– Asia: 4 papers

– Europe: 3 papers

• Great growth in activity around the world, especially Asia

• SIGDAT has met in a dozen cities (50% in America)– America: Columbus, Cambridge, Philadelphia, Providence,

Montreal, College Park

– Asia: Kyoto, Beijing, Hong Kong

– Europe: Dublin, Copenhagen, Grenada

Some Topics that are Behind the International Expansion

• Classic Issues– Machine Translation (MT) / Tools

– Input Method Editor (IME): MS-IME98

– Morphology: Juman, Chasen

• New Issues– Cross-language Information Retrieval (CLIR)

– Browsing the Internet: integrate IME + CLIR + MT

– Parallel and comparable corpora– Terminology Extraction & Alignment

– Suffix Arrays



• Research: – Shared resources + evaluation– Scale: How large is very large?Increased breadth: Geography, Topics


Broader (and More Applied) View of Computational Linguistics

• Data-mining, Databases, Data Warehousing• Digital Libraries• Information Retrieval, Categorization, Extraction• Lexicography• Machine Learning• Machine Translation• Speech• Text Analysis

Data-Mining Issues(How Large is Very Large?)

• Similar technology to corpus-based methods

• But much larger datasets– Newswire (AP): 1 million words per week

– Telephone calls: 1-10 billion per month

– IP packets: expected to be even larger

• Tasks: Fraud, Marketing, Operations, Care– Identify knobs that business partners can turn

• Increase demand (buy TV ads, reduce price)

• Increase supply (buy network capacity, enhance operations)

– Target opportunities for improvement (marketing prospects)

– Track market response in real time (supply/demand by knob)

Best of SIGDAT

• Best Invited Talk

• Work of Note

• Work of Note (in Related Fields)

Best Invited Talkat a SIGDAT Meeting

• Henry Kučera and Nelson Francis– Third Workshop on Very Large Corpora (1995)– Massachusetts Institute of Technology (MIT)– Cambridge, MA, USA

• Described their work on the Brown Corpus– At a time when empiricism was out of fashion– especially at MIT– Personal & Touching (received standing ovation)

Work of Note

• Statistical Machine Translation / Alignment– Brown et al.

• Statistical Parsing (In 1993, poor use of lexical info)– Jelinek, Magerman, Charniak, Collins

• Statistical PP Attachment– Hindle and Rooth

• Word-sense Disambiguation– Yarowsky

• Text-tiling (Discourse Parsing)– Hearst

Work of Note(in Related Fields)

• Learning– Classification and Regression Trees (CART)– Riper

• Web Tools– Managing Gigabytes, Harvest, SGML XML

• Representation– Suffix Arrays– Latent Semantic Indexing

Summary:Reaching a Wider Audience

• Commercial Successes– Main Street & Wall Street

• Internationalization– Goal: equal rep from America, Asia & Europe

• More topic areas– Information Retrieval, Speech, Machine

Translation, Machine Learning, Data-mining

Self-organizing vs. EDA

• Self-organizing: Learning, HMM

– Statistics do it all

• Manual

– Wilks’ Stone Soup: Statistics don’t do nothing

• Exploratory Data Analysis (EDA)

– Hybrid of above

Time for a little controversy:Two types of Empiricism

• New Linguistic Insights vs. Methodology• Reviewers do what reviewers do

– Safe, conservative, seek precedents, case law

– Reviewers go easy on methodology papers

• Grim historical reminder:– Recall: empiricism fell out of favor in 1960s when

methodology became too burdensome.

• Shouldn’t let the methodology get in the way of what we are here to do.

Documents

What's Happened Since the First SIGDAT Meeting?