Making Sense at Scale with Algorithms, Machines & People

Making Sense at Scale withAlgorithms, Machines & People

Michael FranklinEECS, Computer Science

UC Berkeley

Emory UDecember 7, 2011

Defining the Big Data Problem

Complexity

=Answers that don’t meet quality, time and cost requirements.

The State of the Art

Algorithms

Machines

People

search

Watson/IBM

Needed: A Holistic Approach

search

Watson/IBM

Machines

People

Algorithms

AMP Team• 8 (primary) Faculty at Berkeley

• Databases, Machine Learning, Networking, Security, Systems, …

• 4 Partner ApplicationsParticipatory Sensing Mobile Millenium (Alex Bayen, Civ Eng)Collective Discovery Opinion Space (Ken Goldberg, IEOR)Urban Planning and Simulation UrbanSim (Paul Waddell, Env. Des.)Cancer Genomics/Personalized Medicine (Taylor Sittler, UCSF)

Big Data Opportunity• The Cancer Genome Atlas (TCGA)

– 20 cancer types x 500 patients each x 1 tumor genome + 1 normal genome = 5 petabytes

– David Haussler (UCSC) Datacenter online 12/11?– Intel donate AMP Lab cluster, put it next to TCGA

Slide from David Haussler, UCSC, “Cancer Genomics,” AMP retreat, 5/24/11

Berkeley Systems Lab Model

Industrial Collaboration:

“Two Feet In” Model:

Berkeley Data Analytics System (BDAS)

Infra. Builder

Algo/Tools

Data Collector

Data Analyst

Higher Query Languages / Processing FrameworksResource Management

Storage Data Collector

Crowd Interface

Analytics Libraries, Data Integration

Data Source Selector

Result Control Center Visualization

itorin

A Top-to-Bottom Rethinking of the Big Data Analytics Stack integratingAlgorithms, Machines, and People

Algos: More Data = Better AnswersGiven an inferential goal and a fixed computational

budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

Error bars on every answer!

# of data points

true answer

Towards ML/Systems Co-Design

• Some ingredients of a system that can estimate and manage statistical risk:– Distributed bootstrap (bag of little bootstraps; BLB)– Subsampling (stratified)– Active sampling (cf. crowd-sourcing)– Bias estimation (especially with crowd-sourced data)– Distributed optimization– Streaming versions of classical ML algorithms– Streaming distributed bootstrap

• These all must be scalable, and robust

Machines Agenda• New software stack to

• Effectively manage cluster resources• Effectively extract value out of big data

• Projects:• “Datacenter OS”

• Extend Mesos distributed resource manager

• Common Runtime• Structured, Unstructured, Streaming, Sampling, …

• New Processing frameworks & storage systems• E.g., Spark – parallel environment for iterative algorithms

• Example: Quicksilver Query Processor• Allows users to navigate tradeoff space (quality, time,

and cost) for complex queries11

QuickSilver: Where do We Want to Go?

“simple” queries on PBs of data take hoursToday:

sub-second arbitrary queries on PBs of dataIdeal:

compute complex queries on PBs of data in < x seconds with < y% error

People• Make people an integrated part of

the system!• Leverage human activity• Leverage human intelligence

(crowdsourcing)Use the crowd to:Find missing dataIntegrate dataMake subjective comparisonsRecognize patternsSolve problems

Machines + Algorithms

tivity

ns Answ

Human-Tolerant ComputingPeople throughout the analytics lifecycle?

• Inconsistent answer quality• Incentives• Latency & Variance• Open vs. Closed world• Hybrid Human/Machine Design

Approaches:• Statistical Methods for error and bias• Quality-conscious Interface design• Cost (time, quality)-based optimization

CROWDSOURCING EXAMPLES

Citizen ScienceNASA “Clickworkers” circa 2000

Citizen Journalism/Participatory Sensing

Expert Advice

Data collection

• Freebase

One View of Crowdsourcing

From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.

Industry View

Participatory Culture - Explicit

Participatory Culture – Implicit

John Murrell: GM SV 9/17/09…every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

Types of Tasks

Task Granularity Examples

Complex Tasks • Build a website• Develop a software system• Overthrow a government?

Simple Projects • Design a logo and visual identity• Write a term paper

Macro Tasks • Write a restaurant review• Test a new website feature• Identify a galaxy

Micro Tasks • Label an image• Verify an address• Simple entity resolution

Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009

Amazon Mechanical Turk (AMT)

A Programmable Interface

• Amazon Mechanical Turk API• Requestors place Human Intelligence Tasks

(HITs) via “createHit()” API• Parameters include: #of replicas, expiration, User

Interface,…

• Requestors approve jobs and payment “getAssignments()”, “approveAssignments()”

• Workers (a.k.a. “turkers”) choose jobs, do them, get paid

Worker’s View

Requestor’s Veiew

CrowdDB: A Radical New Idea?

“The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”

Problem: DB-hard Queries

SELECT Market_CapFrom CompaniesWhere Company_Name = “IBM”

Number of Rows: 0Problem: Entity Resolution

Company_Name Address Market Cap

Google Googleplex, Mtn. View CA $170Bn

Intl. Business Machines Armonk, NY $203Bn

Microsoft Redmond, WA $206Bn

DB-hard Queries

SELECT Market_CapFrom CompaniesWhere Company_Name = “Apple”

Number of Rows: 0Problem: Closed World Assumption

Company_Name Address Market Cap

Google Googleplex, Mtn. View CA $170Bn

Intl. Business Machines Armonk, NY $203Bn

Microsoft Redmond, WA $206Bn

DB-hard Queries

SELECT Top_1 ImageFrom PicturesWhere Topic = “Business Success”Order By Relevance

Number of Rows: 0Problem: Subjective Comparison

CrowdDB

Use the crowd to answer DB-hard queries

Where to use the crowd:• Find missing data• Make subjective

comparisons• Recognize patterns

But not:• Anything the computer already

does well

M. Franklin et al. CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011

CrowdSQL

SELECT * FROM companies WHERE Name ~= “Big Blue”

CREATE CROWD TABLE department ( university STRING, department STRING, phone_no STRING) PRIMARY KEY (university, department);

CREATE TABLE company ( name STRING PRIMARY KEY, hq_address CROWD STRING);

DML Extensions:

SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");

DDL Extensions:

CROWDORDER operators (currently UDFs):CrowdEqual:

Crowdsourced columns Crowdsourced tables

User Interface Generation

• A clear UI is key to response time and answer quality.

• Can leverage the SQL Schema to auto-generate UI (e.g., Oracle Forms, etc.)

Subjective Comparisons

MTFunction • implements the CROWDEQUAL and CROWDORDER

comparison • Takes some description and a type (equal, order) parameter• Quality control again based on majority vote• Ordering can be further optimized (e.g., Three-way

comparisions vs. Two-way comparisons)

Does it Work?: Picture ordering

Query:SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");

Data-Size: 30 subject areas, with 8 pictures eachBatching: 4 orderings per HITReplication: 3 Assignments per HITPrice: 1 cent per HIT

(turker-votes, turker-ranking, expert-ranking)

User Interface vs. Quality

(Department first) (Professor first) (De-normalized Probe)≈20% Error-Rate ≈80% Error-Rate

39≈10% Error-Rate

To get informationabout Professorsand their Departments…

Can we build a “Crowd Optimizer”?

Select *From RestaurantWhere city = …

Price vs. Response Time

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 600%

$0.01 $0.02 $0.03 $0.04

Time (mins)

ed5 Assignments, 100 HITs

Turker Affinity and Errors

42 [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]

Turker Rank

Turker Affinity and Errors

43 [Franklin, Kossmann, Kraska, Ramesh, Xin: CrowdDB: Answering Queries with Crowdsourcing. SIGMOD,2011]

Turker Rank

Can we build a “Crowd Optimizer”?

Select *From RestaurantWhere city = …

I would do work for this requester again.

This guy should be shunned.

I advise not clicking on his “information about restaurants” hits.

Hmm... I smell lab rat material.

be very wary of doing any work for this requester…

Processor Relations? Tim Klas Kraska

HIT Group » I recently did 299 HITs for this requester.… Of the 299 HITs I completed, 11 of them were rejected without any reason being given. Prior to this I only had 14 rejections, a .2% rejection rate. I currently have 8522 submitted HITs, with a .3% rejection rate after the rejections from this requester (25 total rejections). I have attempted to contact the requester and will update if I receive a response. Until then be very wary of doing any work for this requester, as it appears that they are rejecting about 1 in every 27 HITs being submitted. posted by …fair:2 / 5 fast:4 / 5 pay:2 / 5 comm:0 / 5

Open World = No Semantics?

Select * From Crowd_Sourced_Table

• What does the above query return?

• In the old world, it was just a table scan.• In the crowdsourced world:

• Which answers are “Right”?• When to stop?

• BioStatistics to the Rescue????

Open World = No Semantics?

Select * From Crowd_Sourced_Table

Species Acquisition Curve for Data

Why the Crowd is Different• Classical Approaches don’t quite work• Incorrect Answers

– Chicago is not a State– How to spell Mississippi?

• Streakers vs. Samplers– Individuals sample without replacement– and Worker/Task affinity

• List Walking– e.g., Google “Ice Cream Flavors”

• The above can be detected and mitigated to some extent.

How Can You Trust the Crowd?

• General Techniques– Approval Rate / Demographic Restrictions– Qualification Test– Gold Sets/Honey Pots– Redundancy– Verification/Review– Justification/Automatic Verification

• Query Specific Techniques• Worker Relationship Management49

Making Sense at Scale• Data Size is only part of the challenge• Balance quality, cost and time for a given problem• To address, we must Holistically integrate

Algorithms, Machines, and People amplab.cs.berkeley.edu

Making Sense at Scale with Algorithms, Machines & People

Documents

Leveraged Vector Machines · Leveraged Vector Machines 611 that the resulting algorithms are not boosting algorithms in the PAC sense. For instance, the weak-Iearnability assumption

Can machines sense irony? · Can machines sense irony? Exploring automatic irony detection on social media Automatische ironiedetectie op sociale media

Making Sense at Scale with Algorithms, Machines & People · Making Sense at Scale with Algorithms, Machines & People Michael Franklin EECS, Computer Science . UC Berkeley . Microsoft

SEQUENTIAL ABSTRACT STATE MACHINES CAPTURE SEQUENTIAL ALGORITHMS

A Survey on Load Balancing Algorithms for Virtual Machines ... · A Survey on Load Balancing Algorithms for Virtual Machines Placement in Cloud ... of existing literature ... A SURVEY

Text Mining with Support Vector Machines and Non-Negative Matrix Factorization Algorithms

Making Sense at Scale with Algorithms, Machines & People Michael Franklin EECS, Computer Science UC Berkeley Emory U December 7, 2011

Large-Scale Support Vector Machines: Algorithms and …cseweb.ucsd.edu/~akmenon/ResearchExamTalk.pdf · Large-Scale Support Vector Machines: Algorithms and Theory Research Exam

A Survey on Load Balancing Algorithms for Virtual Machines ... · 2016; 00:1–20 A Survey on Load Balancing Algorithms for Virtual Machines Placement in Cloud Computing Minxian Xu

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster

PART II: ALGORITHMS, MACHINES, and THEORYPART II: ALGORITHMS, MACHINES, and THEORY Section 7.4. 15. Turing Machines •Context •A simple model of computation •Universality •Computability

“Definition and evaluation of algorithms for dynamic ...€¦ · “Definition and evaluation of algorithms for dynamic bandwidth allocation in ACCORDANCE” ... Carrier Sense Multiple

DEPLOYING AND RESEARCHING HADOOP ALGORITHMS ON VIRTUAL MACHINES AND ANALYZING LOG FILES

Course Outline Introduction in algorithms and applications Introduction in algorithms and applications Parallel machines and architectures Parallel machines

Reoptimization Algorithms and Persistent Turing Machines

ConceptNet: Teaching Machines Common Sense Dr. Catherine Havasi - Luminoso ConceptNet: Teaching Machines Common Sense

PART II: ALGORITHMS, MACHINES, and THEORY II: ALGORITHMS, MACHINES, and THEORY CS.15.B.Turing.Machine. Starting point 7 Goals

Laser-Camera systems and algorithms to sense the environment for robotic applications

Algorithms for time-optimal control of CNC machines along ......Algorithms for time-optimal control of CNC machines along curved tool paths dramatis personæ (in alphabetical order):

SEQUENTIAL ABSTRACT STATE MACHINES CAPTURE SEQUENTIAL ALGORITHMS