25
Sector: An Open Source Cloud for Data Intensive Computing Robert Grossman University of Illinois at Chicago Open Data Group October 20, 2009

Sector - Presentation at Cloud Computing & Its Applications 2009

Embed Size (px)

DESCRIPTION

This is a presentation about Sector that I gave at the Cloud Computing and Its Applications (CCA 09) Workshop that took place in Chicago on October 20, 2009. Sector is an open source cloud computing framework designed for data intensive computing.

Citation preview

Page 1: Sector - Presentation at Cloud Computing & Its Applications 2009

Sector: An Open Source Cloud for Data Intensive Computing

Robert GrossmanUniversity of Illinois at Chicago

Open Data Group

October 20, 2009

Page 2: Sector - Presentation at Cloud Computing & Its Applications 2009

Part 1. Sector

2

http://sector.sourceforge.net

Page 3: Sector - Presentation at Cloud Computing & Its Applications 2009

Sector Overview

Sector is fastest open source large data cloud– As measured by MalStone & Terasort

Sector is easy to program– Supports UDFs, MapReduce & Python over streams

Sector is secure– A HIPAA compliant Sector cloud is being set up

Sector is reliable– Sector v1.24 has a backup master node server

3

Page 4: Sector - Presentation at Cloud Computing & Its Applications 2009

About Sector

Yunhong Gu from the Laboratory for Advanced Computing at the University of Illinois at Chicago is the Lead Developer of Sector.

Sector is open source (BSD License) and available from sector.sourceforge.net

The current version is 1.24a

4

Page 5: Sector - Presentation at Cloud Computing & Its Applications 2009

Target Configurations

Sector is designed to run on racks of commodity computers

Typical rack configuration today (Oct, 2009)– Rack of 32 quad-core 1U computers – Each computer has 4 x 1TB disks – Each computer has 1 Gbps connection to a top of

a rack switch Sometimes these are called Raywulf clusters

5

Page 6: Sector - Presentation at Cloud Computing & Its Applications 2009

Google’s Large Data Cloud

Storage Services

Data Services

Compute Services

6

Google’s Stack

Applications

Google File System (GFS)

Google’s MapReduce

Google’s BigTable

Page 7: Sector - Presentation at Cloud Computing & Its Applications 2009

Hadoop’s Large Data Cloud

Storage Services

Compute Services

7

Hadoop’s Stack

Applications

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services

Page 8: Sector - Presentation at Cloud Computing & Its Applications 2009

Sector’s Large Data Cloud

Storage Services

Compute Services

8

Sector’s Stack

Applications

Sector’s Distributed File System (SDFS)

Sphere’s UDFs

Routing & Transport Services

UDP-based Data Transport Protocol (UDT)

Data Services

Page 9: Sector - Presentation at Cloud Computing & Its Applications 2009

Comparing Sector and Hadoop

Hadoop SectorStorage Cloud Block-based file

systemFile-based

Programming Model

MapReduce UDF & MapReduce

Protocol TCP UDP-based protocol (UDT)

Replication At time of writing PeriodicallySecurity Not yet HIPAA capableLanguage Java C++

9

Page 10: Sector - Presentation at Cloud Computing & Its Applications 2009

Terasort - Sector vs Hadoop Performance1 Rack 2 Racks 3 Racks 4 Racks

Nodes 32 64 96 128

Cores 128 256 384 512

Hadoop 85m 49s 37m 0s 25m 14s 17m 45s

Sector 28m 25s 15m 20s 10m 19s 7m 56s

Speed up 3.0 2.4 2.4 2.2

Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.

Page 11: Sector - Presentation at Cloud Computing & Its Applications 2009

MalStone (OCC-Developed Benchmark)MalStone A MalStone B

Hadoop 455m 13s 840m 50s

Hadoop streaming with Python

87m 29s 142m 32s

Sector/Sphere 33m 40s 43m 44s

Speed up (Sector v Hadoop)

13.5x 19.2x

Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.

Page 12: Sector - Presentation at Cloud Computing & Its Applications 2009

How Do You Program A Data Center?

12

Page 13: Sector - Presentation at Cloud Computing & Its Applications 2009

Idea 1 – Support UDF’s Over Data Center

Think of MapReduce as– Map acting on (text) records– With fixed Shuffle and Sort– Followed by Reducing acting on (text) records

We generalize this framework as follows:– Support a sequence of User Defined Functions (UDF) acting on

segments (=chunks) of files.– MapReduce is one special case consisting of a user defined Map,

a system-defined shuffle and sort, and a user defined reduce– In both cases, framework takes care of assigning nodes to

process data, restarting failed processes, etc.

13

Page 14: Sector - Presentation at Cloud Computing & Its Applications 2009

Applying UDF using Sector/Sphere

14

Application Sphere Client

SPE SPE SPE

Outputstream

2. Locate & schedule Sphere Processing Engine (SPE)

1. Split data

3. Collect results

Input stream

Page 15: Sector - Presentation at Cloud Computing & Its Applications 2009

Sector Programming Model

Sector dataset consists of one or more physical files Sphere applies User Defined Functions over streams of

data consisting of data segments Data segments can be data records, collections of data

records, or files Example of UDFs: Map function, Reduce function, Split

function for CART, etc. Outputs of UDFs can be returned to originating node,

written to local node, or shuffled to another node.

15

Page 16: Sector - Presentation at Cloud Computing & Its Applications 2009

How Do Move Data in a Cloud & Between Clouds?

16

Option 1: Use TCP and close your eyes.

Option 2: ?????

Page 17: Sector - Presentation at Cloud Computing & Its Applications 2009

Idea 2: Sector is Built on Top of UDT

17

UDT is a specialized network transport protocol.

UDT can take advantage of wide area high performance 10 Gbps network

Sector is a wide area distributed file system built over UDT.

Sector is layered over the native file system (vs being a block-based file system).

Page 18: Sector - Presentation at Cloud Computing & Its Applications 2009

UDT Has Been Downloaded 25,000+ Times

18

Sterling Commerce

Nifty TVGlobus

Movie2Me

Power Folder

udt.sourceforge.net

http://udt.sourceforge.net

Page 19: Sector - Presentation at Cloud Computing & Its Applications 2009

Alternatives to TCP – Decreasing Increases AIMD Protocols

increase of packet sending rate x

x← x +α (x)

x← (1−β ) x

(x)

x

AIMD (TCP NewReno)

UDT

HighSpeed TCP

Scalable TCP

decrease factor

Page 20: Sector - Presentation at Cloud Computing & Its Applications 2009

UDT Makes Wide Area Clouds Possible

Using UDT, Sector can take advantage of wide area high performance networks (10+ Gbps)

20

10 Gbps per application

Page 21: Sector - Presentation at Cloud Computing & Its Applications 2009

What About Security?

21

Page 22: Sector - Presentation at Cloud Computing & Its Applications 2009

Idea 3: Add Security From the Start

Security server maintains information about users and slaves.

User access control: password and client IP address.

File level access control. Messages are encrypted

over SSL. Certificate is used for authentication.

Sector is HIPAA capable.

Security Server

Master Client

Slaves

dataAAA

SSLSSL

Page 23: Sector - Presentation at Cloud Computing & Its Applications 2009

For More Information About Sector

Yunhong Gu and Robert L Grossman, Sector and Sphere: Towards Simplified Storage and Processing of Large Scale Distributed Data, Philosophical Transactions of the Royal Society A, Volume 367, Number 1897, pages 2429--2445, 2009

http://arxiv.org/abs/0809.1181 http://rsta.royalsocietypublishing.org/

content/367/1897/2429

23

Page 24: Sector - Presentation at Cloud Computing & Its Applications 2009

For Related Information

Related information can be found at:– blog.rgrossman.com– www.rgrossman.com

24

Page 25: Sector - Presentation at Cloud Computing & Its Applications 2009

Sector Sponsors