40

- Oracle · “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,
Page 2: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

<Insert Picture Here>

Scalable Enterprise Data Processing for the CloudWith Oracle Grid Engine

Daniel TempletonProduct Manager, Oracle Grid Engine

Tom WhiteEngineer, Cloudera

Page 3: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

3

Oracle OpenWorld Latin America 2010

December 7–9, 2010

Page 4: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

4

Oracle OpenWorld Beijing 2010

December 13–16, 2010

Page 5: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

5

Oracle Products Available Online

Oracle Store

Buy Oracle license and support online today at

oracle.com/store

SHOP NOW

Page 6: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

6

<Insert Picture Here>

Program Agenda

• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources

Page 7: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

7

<Insert Picture Here>

Program Agenda

• The new data landscape• Data-oriented computing• Compute infrastructure• Data infrastructure• Data-oriented computing revisited• Additional resources

Page 8: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

8

The Data Landscape

• Structured data– Relational data, XML, etc.– Well-defined data structure – e.g. Schema, DTD

• Facilitates automated analysis – e.g. SQL, XSLT– Managed life cycle

• Unstructured data– Everything that's not structured– No predictable or useful structure

• Somewhat subjective– Analysis requires customization and manual intervention– No clear life cycle because no clear classification

Page 9: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

9

Unstructured Data

• Documents, logs, records, dumps, etc.– Distributed across files across machines across the network

• Growing rapidly– 85% of enterprise data

• Growing at 61.7% compounded annually

• Expensive to store it all– How to decide what to keep?

• Potentially massive source of business value– Business value locked behind lack of structure

Page 10: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

10

The Data Landscape

• NYSE is generating 1TB per day

• Facebook is generating 20TB per day– Compressed!

• CERN is generating 40TB per day

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

4

5

6

7

8

9

10

11

12

13

Worldwide Enterprise Disk Storage Consumption ModelRevenue by Segment, 2005–2014 ($B)

Traditional replicated data

Traditional structured data

Traditional unstructured data

$B

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Worldwide Enterprise Disk Storage Consumption ModelCapacity Shipment Share by Segment, 2005–2014 (%)

Traditional replicated

Traditional structured data

Traditional unstructured

Page 11: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

11

<Insert Picture Here>

Program Agenda

• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources

Page 12: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

12

Data-Oriented Computing

• Compute is now cheap; moving data is still expensive– Big change from a decade ago– More CPU cores than can be used effectively– More data than can be processed

• Do the work close to the data– “What to run” → “What data to process”

• Data no longer assumed to float in a SAN– Data locality is a core concept– The network is the data

Page 13: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

13

Structured Versus Unstructured

Compute Data

Page 14: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

14

Structured Versus Unstructured

Compute Database

Page 15: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

15

Structured Versus Unstructured

ComputeDatabase

Cluster

Page 16: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

16

Structured Versus Unstructured

Compute Local Disk

Page 17: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

17

Data-Oriented Computing and the Cloud

• Public clouds rapidly becoming the dominant storage vehicle

• Large data analytics fits well with private or public clouds– Mind the transfer!

• Bandwidth and latency issues make hybrid cloud solutions unfavorable 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Worldwide Enterprise Disk Storage Consumption ModelCapacity Shipment Share by Segment, 2005–2014 (%)

Traditional replicated

Traditional structured data

Traditional unstructured

Content depots/public cloud

Actualv

Page 18: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

18

Typical Data-Oriented Computing Use Cases

• Large data files– Implicitly chunked across network– Process massively in parallel

• Fragmented data records– Process in place– Aggregation implicit in the computation

• Hacking by determined developers– Now called “Data Science”

• Streaming data– Dump into storage and proceed as above

Page 19: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

19

Basic Data-Oriented Computing Building Blocks

• Data Infrastructure– Massively scalable

• Also in terms of cost– Network-centric– Data locality

• Compute Infrastructure– Highly scalable management of compute resources– Support for multi-tenancy

• Users & applications– Support for accounting and billing

• Fundamental to cloud model

Page 20: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

20

<Insert Picture Here>

Program Agenda

• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources

Page 21: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

21

Oracle Coherence

• Highly-scalable in-memory data grid– Aggregates total memory of nodes into a single cache

• More nodes = more cache space– Coherency maintained through extremely optimized protocol– No single point of failure

• Object oriented– Every object lives on a particular node– Objects replicated for redundancy

• Can be backed by a traditional data store– Write ahead, write behind, etc.

Page 22: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

22

Oracle Coherence Embedded Data Grid

Execution Host Execution Host Execution Host Execution Host

Master HostOracleGrid

Engine

OracleCoherence

Page 23: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

23

Apache Hadoop HDFS

• Highly-scalable on-disk data grid– Aggregates assigned disk space of nodes into a single pool

• More nodes = more storage space– Data locations maintained by a master node

• File oriented– Every file is broken into data blocks– Every block lives on a particular node– Blocks replicated for redundancy

• Core component of Hadoop– Powerful marriage of compute with data

Page 24: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

24

<Insert Picture Here>

Program Agenda

• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources

Page 25: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

25

Oracle Grid EngineBusiness-driven Workload Management

• Powerful workload manager– Efficiently match workload to available resources– Schedule according to business policies– Aggregate user and uses onto a set of resource pools– Extreme scalability– Full accounting

• Flexible resource broker– Share resources among services according to SLOs– Lease additional capacity from the cloud on demand– Set idle/underutilized machines into reduced power mode

Page 26: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

26

Award-winning Sun Grid EngineThousands of Successful Grids

Excellence in Cluster Technology

Page 27: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

27

Redefining the Enterprise Data Center

• Tear down application resource silos– Resource sharing according to needs and policies

• Reduce the cost of data center ownership– More efficient use of resources– Idle or underused machines powered down until needed

• On-demand scale-out to cloud resources– Insulates applications from cloud service providers– Facilitates private cloud model

• Support for data-oriented compute models– Apache Hadoop– Oracle Coherence

Page 28: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

28

Common Use Cases

Modeling/Processing

Streaming

Monte Carlo

Validation

Page 29: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

29

Map/Reduce

• Defined in a paper from Google in 2004– Apache Hadoop is the best known implementation

• Data processing in two steps– Map: process input data across network– Reduce: assemble intermediate results into final result

• Example: counting words in a book– Map: for each page, emit every word into a giant hash table– Reduce: merge all hash tables together and count the number

of values for each key

• Massively parallel processing – embarrassingly parallel– Inherently data aware

Page 30: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

30

<Insert Picture Here>

Program Agenda

• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources

Page 31: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

31

Rethinking Unstructured Data With Hadoop

• MapReduce provides unified interface– Rich ecosystem of tools for data analysis

• Hive, Pig, et al → Cloudera Distribution of Hadoop– Almost as accessible as structured data

• HDFS is a low-cost distributed file system– Adding capacity means just adding (cheap) nodes– Changes the economies of data storage

• Possible to extract the value from unstructured data and feasible to keep large amounts of it around– Tremendous opportunity for discovered knowledge

Page 32: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

32

“Hadoop is a key ingredient in allowing LinkedIn to build many of our most computationally difficult features, allowing us to harness our incredible data about the professional world for our users”Jay Kreps, Principal Engineer

What Linkedin Is Saying

Page 33: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

33

Word Count Example Revisited

MAP

MAPMAP

MAPMAPMAP

MAPMAP

REDUCEREDUCE

capacity: 14334intellect: 12377mind: 9574money: 5967truth: 5868...

Store datain HDFS

Map phase:count words

per data block

Shuffle Reduce phase:aggregate counts

Extract resultsfrom HDFS

Word Count Algorithm

Page 34: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

34

Unstructured Enterprise Data Analytics

• Not all necessarily Hadoop– MPI, Java, legacy, or even different Hadoop versions

• Grid Engine unifies the workload across the resources– Better efficiency– Lower cost of management– Cross-domain workflows

• Plus enterprise class features:– Demand-driven cloud connectivity and power management– Advanced scheduling policies

• Advance resource reservations– Full accounting and reporting suite

Page 35: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

35

<Insert Picture Here>

“The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data

intensive, Hadoop centered, computing. Oracle Grid Engine allows us to run Hadoop jobs within exactly

the same scheduling and submission environment we use for traditional scalar and parallel loads.”

Gianluigi ZanettiDirector,

Biomedical Applications,CRS4

Page 36: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

36

Word Count Example Re-Revisited

HDFS

Oracle Grid Engine

OpenMPI JavaMap/Reduce: Word Count

MA

P

MA

P

MA

P

MA

P /

RE

DU

CE

MA

P /

RE

DU

CE

Map/Reduce

MA

P

MA

P

MA

P /

RE

DU

CE

Page 37: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

37

<Insert Picture Here>

Program Agenda

• The new data landscape• Data-oriented computing• Data infrastructure• Compute infrastructure• Data-oriented computing revisited• Additional resources

Page 38: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

38

References For Getting Started

• Oracle Grid Engine OTN Page:– http://www.oracle.com/technetwork/oem/grid-engine-166852.html

• Hadoop Project Page:– http://hadoop.apache.org/

• Cloudera:– http://www.cloudera.com/– http://www.cloudera.com/hadoop-tutorial/

• Hadoop World 2010:– http://www.cloudera.com/company/press-center/hadoop-world-nyc/

Page 39: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

39

Page 40: - Oracle ·  “The Oracle Grid Engine 6.2 software has dramatically lowered for us the cost of data intensive, Hadoop centered,

40