26
1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/)

CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

1

CS 294: Big Data System Research: Trends and Challenges

Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/)

Page 2: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper

Today every major system & networking conference has Big Data sessions

Page 3: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Big Data Impact Already helped create new business Already helped disrupt existing businesses » Retail » Rental » Taxi » home appliances » …

Page 4: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Big Data Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Page 5: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Hadoop Stack

Data Processing Layer

Resource Management Layer

Storage Layer

… Hadoop MR

Hive PigImpala Storm

Hadoop Yarn

HDFS, S3, …

Page 6: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

The Berkeley AMPLab January 2011 – 2017 » 8 faculty » > 40 students » 3 software engineer team

Organized for collaboration

3 day retreats (twice a year)

lgorithms    

achines     eople    

220 campers (100+ companies)

AMPCamp3 (August, 2013)

Page 7: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

The Berkeley AMPLab Governmental and industrial funding:

Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Page 8: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

BDAS Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Mesos

Spark

SparkStreaming Shark SQL

BlinkDBGraphX

MLlib

MLBase

HDFS, S3, … Tachyon

Page 9: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Mesos

HDFS, S3, … Tachyon

Spark

SparkStreaming Shark SQL

BlinkDBGraphX

MLlib

MLBase

BDAS & Hadoop fitting together

Hadoop Yarn

HDFS, S3, …

Page 10: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Mesos

HDFS, S3, … Tachyon

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Spark

SparkStreaming Shark SQL

BlinkDBGraphX

MLlib

MLBaseSpark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark Hadoop MR

Hive PigImpala Storm

Page 11: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Mesos

HDFS, S3, … Tachyon

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark Hadoop MR

Hive PigImpala Storm

Page 12: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

This Class Learn about state-of-art research in Big Data Work on an exciting project

Hopefully start next generation of impactful projects

Page 13: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

13

Grading Project: 60% Class presentations: 40% » Around 2 papers per student » See Randy’s guidelines for leading discussion on

papers •  http://bnrg.eecs.berkeley.edu/~randy/Courses/

CS294.F07/LeadingPapers.pdf

Page 14: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Administrative Information Class website: http://www.cs.berkeley.edu/~istoica/classes/cs294/15/

Office Hours (Soda 465D): » TBA

Create an (anonymized) blog account for paper reviews if you don’t have one yet (e.g., www.blogger.com) » Sent me an e-mail by Monday, August 31, with your blog url » Preferred e-mail for the class e-mail list

14

Page 15: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

15

Papers Is the problem real? What is the solution’s main idea (nugget)? Why is solution different from previous work? » Are system assumptions different? » Is workload different? » Is problem new?

Does the paper (or do you) identify any fundamental/hard trade-offs?

Page 16: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

16

Papers (cont’d) Do you think the work will be influential in 10 years? » Why or why not?

Predicting the future hard, but worth a try » Look at past examples for inspiration

Page 17: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

17

Streaming Over TCP Countless papers: » Why cannot be done… » New protocols to do it…

Today » Virtually all streaming over TCP » Trend to stream over HTTP!

Page 18: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

18

Why did it Succeed?

Page 19: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

19

Multicast Countless papers: » Why world will come to a standstill without multicast… » New protocols to do it…

Today » Multicast is used only in enterprise settings at best » Overlay multicast widely used in the Internet

•  CDN based, e.g., WorldCup, March Madness, Iinagurations, ...

•  P2P, mostly popular outside US (e.g., China)

Page 20: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

20

Why Did it Fail?

Page 21: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

21

Shared Memory Countless papers: » How shared memory simplifies programming

parallel computers » Many, many systems proposed and build

Today: » Message passing (MPI) took over as the de facto

standard for writing parallel applications

Page 22: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

22

Why Did it Fail?

Page 23: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

23

Network Computer Big in 90s » Promoted by an alliance of Sun, Oracle, Acorn

Promise: many of advantages of cloud computing » Easy to manage » Application sharing » …

Failed miserably

Page 24: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

24

Why Did it Fail?

Page 25: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

Coming Back: ChromeOS Will it succeed this time?

25

Page 26: CS 294: Big Data System Research: Trends and Challenges · Big Data First papers: » 2003: The Google file system paper » 2004: The MapReduce paper Today every major system & networking

26

What are Hard/Fundamental Tradeoffs?

Brewer’s CAP conjecture: “Consistency, Availability, Partition-tolerance”, you can have only two in a distributed system In a in-order, reliable communication protocol cannot minimize overhead and latency simultaneously Hard to simultaneously maximize evolvability and performance