95
CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York

CAS May 24, 2010 Creating a science base to support new directions in computer science John Hopcroft Cornell University Ithaca, New York

Embed Size (px)

Citation preview

CAS May 24, 2010

Creating a science base to support new directions in

computer science

John HopcroftCornell UniversityIthaca, New York

CAS May 24, 2010

Time of change

The information age is a fundamental revolution that is changing all aspects of our lives.

Those individuals and nations who recognize this change and position themselves for the future will benefit enormously.

CAS May 24, 2010

Drivers of change

Merging of computing and communications

Data available in digital form Networked devices and sensors Computers becoming ubiquitous

CAS May 24, 2010

Internet search engines are changing

When was Einstein born?

Einstein was born at Ulm, in Wurttemberg, Germany, on March 14, 1879.

List of relevant web pages

CAS May 24, 2010

CAS May 24, 2010

Internet queries will be different

Which car should I buy? What are the key papers in Theoretical

Computer Science? Construct an annotated bibliography on

graph theory. Where should I go to college? How did the field of CS develop?

CAS May 24, 2010

Which car should I buy?

Search engine response: Which criteria below are important to you? Fuel economy Crash safety Reliability Performance Etc.

CAS May 24, 2010

Make Cost Reliability Fuel economy

Crash safety

Links to photos/ articles

Toyota Prius 23,780 Excellent 44 mpg Fair

Honda Accord 28,695 Better 26 mpg Excellent

Toyota Camry 29,839 Average 24 mpg Good

Lexus 350 38,615 Excellent 23 mpg Good

Infiniti M35 47,650 Excellent 19 mpg Good

CAS May 24, 2010

CAS May 24, 2010

2010 Toyota Camry - Auto ShowsToyota sneaks the new Camry into the Detroit Auto Show.

Usually, redesigns and facelifts of cars as significant as the hot-selling Toyota Camry are accompanied by a commensurate amount of fanfare. So we were surprised when, right about the time that we were walking by the Toyota booth, a chirp of our Blackberries brought us the press release announcing that the facelifted 2010 Toyota Camry and Camry Hybrid mid-sized sedans were appearing at the 2009 NAIAS in Detroit.

We’d have hardly noticed if they hadn’t told us—the headlamps are slightly larger, the grilles on the gas and hybrid models go their own way, and taillamps become primarily LED. Wheels are also new, but overall, the resemblance to the Corolla is downright uncanny. Let’s hear it for brand consistency!

Four-cylinder Camrys get Toyota’s new 2.5-liter four-cylinder with a boost in horsepower to 169 for LE and XLE grades, 179 for the Camry SE, all of which are available with six-speed manual or automatic transmissions. Camry V-6 and Hybrid models are relatively unchanged under the skin.

Inside, changes are likewise minimal: the options list has been shaken up a bit, but the only visible change on any Camry model is the Hybrid’s new gauge cluster and softer seat fabrics. Pricing will be announced closer to the time it goes on sale this March.

Toyota Camry

› Overview

› Specifications › Price with Options › Get a Free Quote

News & Reviews

2010 Toyota Camry - Auto Shows

Top Competitors

Chevrolet Malibu

Ford Fusion Honda Accord sedan

CAS May 24, 2010

Which are the key papers in Theoretical Computer Science? Hartmanis and Stearns, “On the computational complexity of algorithms” Blum, “A machine-independent theory of the complexity of recursive functions” Cook, “The complexity of theorem proving procedures” Karp, “Reducibility among combinatorial problems” Garey and Johnson, “Computers and Intractability: A Guide to the Theory of NP-Completeness” Yao, “Theory and Applications of Trapdoor Functions” Shafi Goldwasser, Silvio Micali, Charles Rackoff , “The Knowledge Complexity of Interactive

Proof Systems” Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan, and Mario Szegedy, “Proof

Verification and the Hardness of Approximation Problems”

CAS May 24, 2010

Temporal Cluster Histograms: NIPS Results

12: chip, circuit, analog, voltage, vlsi11: kernel, margin, svm, vc, xi10: bayesian, mixture, posterior, likelihood,

em9: spike, spikes, firing, neuron, neurons8: neurons, neuron, synaptic, memory,

firing7: david, michael, john, richard, chair6: policy, reinforcement, action, state,

agent5: visual, eye, cells, motion, orientation4: units, node, training, nodes, tree3: code, codes, decoding, message, hints2: image, images, object, face, video1: recurrent, hidden, training, units, error0: speech, word, hmm, recognition, mlp

NIPS k-means clusters (k=13)

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11 12 13 14Year

Num

ber o

f Pap

ers

Shaparenko, Caruana, Gehrke, and Thorsten

CAS May 24, 2010

Fed Ex package tracking

CAS May 24, 2010

CAS May 24, 2010

CAS May 24, 2010

CAS May 24, 2010

CAS May 24, 2010

CAS May 24, 2010

NEXRAD Radar

Binghamton, Base Reflectivity 0.50 Degree Elevation Range 124 NMI — Map of All US Radar Sites Animate MapStorm TracksTotal PrecipitationShow SevereRegional RadarZoom Map Click:Zoom InZoom OutPan Map

(Full Zoom Out)

       

                                                                                                                                                                                                 

»

A

dvanced

R

adar

Types

C

LIC

K

»

 

BGMN0R042.40547-76.51950Ithaca, NY000.12574445412029999999900

CAS May 24, 2010

Collective Inference on Markov Models for Modeling Bird Migration

Space

time

CAS May 24, 2010

Daniel Sheldon, M. A. Saleh Elmohamed, Dexter Kozen

CAS May 24, 2010

Science base to support activities

Track flow of ideas in scientific literature Track evolution of communities in social networks Extract information from unstructured data sources.

CAS May 24, 2010

Tracking the flow of ideas in scientific literature

Yookyung Jo

CAS May 24, 2010

Tracking the flow of ideas in the scientific literatureYookyung Jo

Page rank

Web

Link

GraphRetrieval

Query

Search

Text

Web

Page

Search

Rank

Web

Chord

Usage

Index

Probabilistic

TextFile

Retrieve

Text

Index

Discourse

Word

Centering

Anaphora

CAS May 24, 2010Original papers

CAS May 24, 2010Original papers cleaned up

CAS May 24, 2010Referenced papers

CAS May 24, 2010

Referenced papers cleaned up. Three distinct categories of papers

CAS May 24, 2010

Topic evolution thread

• Seed topic :– 648 : making c program

type-safe by separating pointer types by their usage to prevent memory errors

• 3 subthreads :– Type :

– Garbage collection :

– Pointer analysis :

Yookyung Jo, 2010

CAS May 24, 2010

Tracking communities in social networks

Liaoruo Wang

CAS May 24, 2010

“Statistical Properties of Community Structure in Large Social and Information Networks”, Jure Leskovec; Kevin Lang; Anirban Dasgupta; Michael Mahoney

Studied over 70 large sparse real-world networks.

Best communities approximately size 100 to 150.

CAS May 24, 2010

Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small scales, and at larger size scales, the best possible communities gradually "blend in" with the rest of the network and thus become less "community-like."

CAS May 24, 2010

Conductance

Size of community100

CAS May 24, 2010

Giant component

CAS May 24, 2010

Whisker: A component with v vertices connected by edgese v

CAS May 24, 2010

Our view of a community

TCS

Me

Colleagues at Cornell

Classmates

Family and friendsMore connections outside than inside

CAS May 24, 2010

Should we remove all whiskers and search for communities in core?

Core

CAS May 24, 2010

Should we remove whiskers?

Does there exist a core in social networks? Yes

Experimentally it appears at p=1/n in G(n,p) model

Is the core unique? Yes and No In G(n,p) model should we require that a

whisker have only a finite number of edges connecting it to the core?

Laura Wang

CAS May 24, 2010

Algorithms How do you find the core? Are there communities in the core of social

networks?

CAS May 24, 2010

How do we find whiskers?

NP-complete if graph has a whisker There exists graphs with whiskers for

which neither the union of two whiskers nor the intersection of two whiskers is a whisker

CAS May 24, 2010

Graph with no unique core

31 1

CAS May 24, 2010

Graph with no unique core

1

CAS May 24, 2010

What is a community?How do you find them?

CAS May 24, 2010

Communities

Conductance Can we compress graph?

Rosvall and Bergstrom, “An informatio-theoretic framework for resolving community structure in complex networks”

Hypothesis testing Yookyung Jo

CAS May 24, 2010

Description of graph with community structure

Specify which vertices are in which communities

Specify the number of edges between each pair of communities

CAS May 24, 2010

Information necessary to specify graph given community structure m=number of communities ni=number of vertices in ith community

lij number of edges between ith and jth communities

1

1 / 2log

mi ji i

i i j ijii

n nn nH Z

ll

CAS May 24, 2010

Description of graph consists of description of community structure plus specification of graph given structure. Specify community for each edge and the

number of edges between each community

Can this method be used to specify more complex community structure where communities overlap?

community structure

12 1 lognlogm m m l H Z

CAS May 24, 2010

Hypothesis testing

Null hypothesis: All edges generated with some probability p0

Hypothesis: Edges in communities generated with probability p1, other edges with probability p0.

CAS May 24, 2010

Massively overlapping communities

Are there a small number of massively overlapping communities that share a common core?

Are there massively overlapping communities in which one can move from one community to a totally disjoint community?

CAS May 24, 2010

Massively overlapping communities with a common core

CAS May 24, 2010

Massively overlapping communities

CAS May 24, 2010

Clustering Social networksMishra, Schreiber, Stanton, and Tarjan Each member of community is connected

to a beta fraction of community No member outside the community is

connected to more than an alpha fraction of the community

Some connectivity constraint

CAS May 24, 2010

In sparse graphs

How do you find alpha – beta communities?

What if each person in the community is connected to more members outside the community then inside?

CAS May 24, 2010

Sucheta Soundarajan

Transmission paths for viruses, flow of ideas, or

influence

CAS May 24, 2010

Trivial model

Half of contacts Two third of contacts

CAS May 24, 2010

Time of first item

Tim

e of second item

CAS May 24, 2010

Theory to supportnew directions Large graphs Spectral analysis High dimensions and dimension reduction Clustering Collaborative filtering Extracting signal from noise

CAS May 24, 2010

Theory of Large Graphs

Large graphs with billions of vertices Exact edges present not critical Invariant to small changes in definition Must be able to prove basic theorems

CAS May 24, 2010

Erdös-Renyin vertices

each of n2 potential edges is present with independent probability

Nn

pn (1-p)N-n

vertex degreebinomial degree distribution

numberof

vertices

CAS May 24, 2010

CAS May 24, 2010

Generative models for graphs

Vertices and edges added at each unit of time

Rule to determine where to place edges

Uniform probability Preferential attachment - gives rise to

power law degree distributions

CAS May 24, 2010Vertex degree

Number

of

vertices

Preferential attachment gives rise to the power law degree distribution common in many graphs

CAS May 24, 2010

Protein interactions

2730 proteins in data base

3602 interactions between proteins SIZE OF COMPONENT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1000

NUMBER OF COMPONENTS

48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 0

Science 1999 July 30; 285:751-753

Only 899 proteins in components, where are 1851 missing proteins?

CAS May 24, 2010

Protein interactions

2730 proteins in data base

3602 interactions between proteins

SIZE OF COMPONENT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1851

NUMBER OF COMPONENTS

48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 1

Science 1999 July 30; 285:751-753

CAS May 24, 2010

Giant Component

1.Create n isolated vertices

2.Add Edges randomly one by one

3.Compute number of connected components

CAS May 24, 2010

Giant Component1

1000

1 2

998 1

1 2 3 4 5 6 7 8 9 10 11

548 89 28 14 9 5 3 1 1 1 1

CAS May 24, 2010

Giant Component

1 2 3 4 5 6 7 8

367 70 24 12 9 3 2 2

9 10 12 13 14 20 55 101

2 2 1 2 2 1 1 1

CAS May 24, 2010

Giant Component

1 2 3 4 5 6 7 8 9 10

252 39 13 6 3 6 2 1 1 0

11 12 13 14 15 16 17 18 ••• 514

1 0 0 0 0 0 0 0 0 1

CAS May 24, 2010

Science base

What do we mean by science base?

Example: High dimensions

CAS May 24, 2010

High dimension is fundamentally different from 2 or 3 dimensional space

CAS May 24, 2010

High dimensional data is inherently unstable Given n random points in d dimensional

space essentially all n2 distances are equal.

22

1

d

i ii

x yx y

CAS May 24, 2010

High Dimensions

Intuition from two and three dimensions not valid for high dimension

Volume of cube is one in all dimensions

Volume of sphere goes to zero

CAS May 24, 2010

1

Unit sphere

Unit square

2 Dimensions

2 2 11 10.707

2 2 2

CAS May 24, 2010

2 2 2 21 1 1 1

12 2 2 2

4 Dimensions

CAS May 24, 2010

21

22

dd

d Dimensions

1

CAS May 24, 2010

Almost all area of the unit cube is outside the unit sphere

CAS May 24, 2010

Gaussian distribution

Probability mass concentrated between dotted lines

CAS May 24, 2010

Gaussian in high dimensions

3

√d

CAS May 24, 2010

Two Gaussians

3√d

CAS May 24, 2010

CAS May 24, 2010

Distance between two random points from same Gaussian

Points on thin annulus of radius

Approximate by sphere of radius

Average distance between two points is

(Place one pt at N. Pole other at random. Almost surely second point near the equator.)

d

d

2d

CAS May 24, 2010

CAS May 24, 2010

2d

d

d

CAS May 24, 2010

Expected distance between pts from two Gaussians separated by δ

2 2d

2d

CAS May 24, 2010

Can separate points from two Gaussians if

2

14

2

12 2

2

2 2

2 1 2

1

2 2

2 2

d

d d

d d

d

d

CAS May 24, 2010

Dimension reduction

Project points onto subspace containing centers of Gaussians

Reduce dimension from d to k, the number of Gaussians

CAS May 24, 2010

Centers retain separation Average distance between points reduced

by dk

1 2 1 2, , , , , , ,0, ,0d k

i i

x x x x x x

d x k x

CAS May 24, 2010

Can separate Gaussians provided

2 2 2k k

> some constant involving k and γ independent of the dimension

Finding centers of Gaussians1

2

n

a

a

a

1a2a

na

1v

First singular vector v1 minimizes the perpendicular distance to points

CAS May 24, 2010

The first singular vector goes through center of Gaussian and minimizes distance to points

Gaussian

CAS May 24, 2010

Best k-dimensional space for Gaussian is any space containing the line through the center of the Gaussian

CAS May 24, 2010

Given k Gaussians, the top k singular vectors define a k dimensional space that contains the k lines through the centers of the k Gaussian and hence contain the centers of the Gaussians

CAS May 24, 2010

CAS May 24, 2010

We have just seen what a science base for high dimensional data might look like.

What other areas do we need to develop a science base for?

CAS May 24, 2010

Ranking is important Restaurants, movies, books, web pages Multi billion dollar industry

Collaborative filtering When a customer buys a product what else is

he likely to buy Dimension reduction Extracting information from large data

sources Social networks

CAS May 24, 2010

Conclusions

We are in an exciting time of change.

Information technology is a big driver of that change.

The computer science theory needs to be developed to support this information age.