Ibm leads way with hadoop and spark 2015 may 15

© 2015 IBM Corporation

IBM Leads the Way with Hadoop and SparkThe Keys to Getting Value out of Big Data

© 2015 IBM Corporation2

IBM’s Framework for Getting Value out of Big Data

All agree on Big Data’s potential, but wide divergence on how to exploit it

Pioneers who have started to harness Big Data have benefited greatly

We see Big Data adoption as a continual process – maturity levels

IBM’s approach enables faster adoption of Big Data technologies

Open source innovation (Hadoop, Spark)

Standards-based technologies (ODP, SQL, R)

Familiar interfaces and integration with established tools (IBM innovations)

Advanced analytics (IBM innovations)

IBM’s commitment for continued innovation


Hadoop and Spark Offer Significant Business Benefits

Operations Data Warehousing Line of Business

and Analytics

New Business

Imperatives

Big Data Maturity High

High

Low

Data-Informed

Decision Making

• Full dataset analysis

(no more sampling)

• Extract value from

non-relational data

• 360o

view of all

enterprise data

• Exploratory analysis

and discovery

Warehouse

Modernization

• Data lake

• Data offload

• ETL offload

• Queryable archive

and staging

Lower the Cost

of Storage

Business

Transformation

• Create new business

models

• Risk-aware decision

making

• Fight fraud and

counter threats

• Optimize operations

• Attract, grow, retain

customers

Value


IBM Investing in Four Catalysts for Big Data Adoption

Familiar Interfaces & Integration

with Established Tools

Open Source Innovation Technical Standards

New Analytics Capabilities


• Reliability

• Resiliency

• Security

• Multiple data sources

• Multiple applications

• Multiple users

Hadoop Advantages

• Files

• Semi-structured

• Databases

Unlimited Scale

Enterprise Platform

Wide Range of

Data Formats


Hadoop MapReduce Challenges

• Need deep Java skills

• Few abstractions available for

analysts

• No in-memory framework

• Application tasks write to disk

with each cycle

• Only suitable for batch

workloads

• Rigid processing model

In-Memory Performance

Ease of Development

Combine Workflows



Ease of Development• Easier APIs

• Python, Scala, Java

• Resilient Distributed Datasets

• Unify processing

Spark Advantages

• Batch

• Interactive

• Iterative algorithms

• Micro-batch

Combine Workflows


Spark Libraries

Apache Spark

Spark SQLSpark

StreamingGraphX MLlib SparkR


Spark on Hadoop

Apache Spark

Spark SQLSpark


Apache Hadoop-HDFS

Apache Hadoop-YARNResource

management

Storage

management

Compute

layerSlave node 1 Slave node 2 Slave node n…


Spark on Mesos

Apache Spark

Spark SQLSpark


Apache Hadoop-HDFS

Apache MesosResource

management

Storage

management

Compute

layerSlave node 1 Slave node 2 Slave node n…


Spark as a Service

Apache Spark

Spark SQLSpark


Amazon S3

Resource

management

Storage

management

Compute

layer

Apache Hadoop-YARN

Amazon EC2 node 1 Amazon EC2 node 2 Amazon EC2 node n…


Spark on the Amazon Cloud

Apache Spark

Spark SQLSpark


Amazon S3

Resource

management

Storage

management

Compute

layer

Apache Hadoop-YARN

Amazon EC2 node 1 Amazon EC2 node 2 Amazon EC2 node n…


Spark Running in Standalone Mode

Apache Spark

Spark SQLSpark


Single node, with local storage

Resource

management

Storage

management

Compute

layer


Spark Resilient Distributed Datasets

Slave node 1

c3 d2

a2 b1

partition3

partition1

partition2

Slave node 2

c2 d1

a1 b2

partition1

partition3

Slave node 3

c1 d2

a3 b3

partition2

partition2

partition1

RDD1

RDD2

RDD3

Spark RDDIn-memory distribution

HDFSOn-disk distribution


The Combination: The Flexibility of Spark on a Stable Hadoop Platform


Ease of Development

Combine Workflows

Unlimited Scale

Enterprise Platform

Wide Range of

Data Formats


IBM Open Platform with Apache Hadoop

100% open source code

Commitment to currency: “days, not months”

Includes Spark

Free for production use

Decoupled Apache Hadoop from IBM analytics and data science technologies

Production support offering available

Apache Open Source Components

HDFS

YARN

MapReduce

Ambari HBase

Spark

Flume

Hive Pig

Sqoop

HCatalog

Solr/Lucene



IBM is Committed to Open Source

Open source technologies are the base for IBM software and solutions

IBM’s long history of deep open source commitment

Apache Software Foundation: Founding member in 1999

Cloud Foundry: #1 contributor; Basis for Bluemix

OpenStack: #4 contributor; Basis for IBM’s IaaS

Linux: #3 contributor; IBM first enterprise backer of Linux

Hadoop/Spark: Extensive investment in open source contribution; Integration with

Analytics software

Infrastructure

Systems

Application


Goal of the Apache Software Foundation: Let 1000 Flowers Bloom!

• 249 Top Level Projects, 40 Incubating

• 2 Million+ Code Commits

• IBM co-founded the ASF in 1999 and

is a Gold Sponsor

• The “Apache Way” is about fostering

open innovation

• Not a standards organization


Apache Hadoop Ecosystem: Rapid Innovation, Few Standards

Distributions include different projects at different version levels“This proliferation of baskets [Hadoop distributions with different project versions] creates significant drag

when it comes to building reliable applications ... makes it harder for customers to assess which basket of

Hadoop that they need and harder for application developers to create solutions that work broadly.”

– Raymie Stata, CEO, Altiscale

Even though the project versions match, there are interface differences“Setting a baseline of Hive 13 so we get access to some new syntax. Try it on one, it works great... Try it

on another that says it also has Hive 13, and we get ‘syntax error’ …”

- Craig Rubendall, VP, SAS

If the industry is truly committed to developing big data technologies and solutions …, it will require an

ecosystem of providers … to create a consistent framework around which everyone can develop.

- Siki Giunta, SVP, Verizon

The Hadoop ecosystem is evolving at a faster pace than is comfortable“My personal speculation is that it comes from some who have been evaluating for a while seeing

change occur so rapidly that they are dropping back for another look.”

– Merv Adrian, VP, Gartner


Certify a standard “ODP Core” set of

open source Hadoop family projects

with specific versions and patch levels

Develop tools and methods to help

solution providers to test applications

against the ODP Core.

Contribute changes and fixes in the

ODP Core Hadoop family projects to

the ASF using the ASF processes.

http://opendataplatform.org/


Open Data Platform Initiative

Representation across the

Hadoop ecosystem…

• Hadoop distribution vendors

• Software application providers

• System integrators/consultants

• Hardware vendors

• Customers

… who all believe in the need for a community-based effort to

standardize Hadoop, which will lead to improved adoption


IBM Open Platform with Apache Hadoop adopts ODP Core

BigInsights will include ODP certified Apache packages

ODP will initially target core packages of a Hadoop distribution

Packages will expand over time

First certification set expected this summer

Our goal for BigInsights on ODP

Better compatibility and less testing against ecosystem software

Enable IBM Hadoop capabilities to run on other ODP-certified Hadoop

distributions

HDFS

YARN

MapReduce

Ambari HBase

Spark

Flume

Hive Pig

Sqoop

HCatalog

Solr/Lucene

ODP

* Candidate set of certified ODP modules – expected summer 2015

Apache Open Source Components



Goal of the ODP: Enable Innovation to Flourish on a Common Platform

• Complements the Apache Software

Foundation’s governance model

• ODP efforts focus on integration,

testing, and certifying a standard core

of Apache Hadoop ecosystem projects

• Fixes for issues found in ODP testing

will be contributed to the ASF projects

in line with ASF processes

• The ODP will not override or replace

any aspect of ASF governance


Text Analytics

POSIX Distributed File System

Multi-workload, Multi-tenant

scheduling

IBM BigInsights

Enterprise Management

Machine Learning

with Big R

Big R


IBM BigInsights

Data Scientist

IBM BigInsights

Analyst

Big SQL

BigSheets

Big SQL

BigSheets

for Apache Hadoop

IBM BigInsights for Apache Hadoop


IBM BigInsights for Apache Hadoop

IBM System zIBM PowerIntel Servers On Cloud

Your choice of infrastructure and deployment model


IBM Analytic Platform Capabilities

IBM Software Integrates and Extends Hadoop and Spark

Data Warehousing

PureData for Analytics, Operational Analytics

Entity Extraction and Matching

Big Match

Security and Compliance

Optim, Guardium Audit and Encryption

Data Integration and Governance

Information Server

Enterprise Search

Watson Explorer

Real-time Analytics

Streams

Predictive Modeling and Descriptive Statistics

SPSS, Big R and Scalable Algorithms

Analysis, Reporting, and Exploration

Watson Analytics, Cognos, BigSheets

Fast, ANSI SQL 2011, and Secure SQL

Big SQL

Enterprise File System

GPFS-FPO

Cluster Resource and Workload Management

Platform Symphony

Large Scale Text Extraction

Big Text



IBM Leads the Market and Analysts Agree

“IBM’s all-in bet on Apache Hadoop clearly has had the

biggest impact among developers we polled”

- Evans Big Data Survey

Leading Hadoop Distribution Leading Streaming Analytics Solution


IBM’s Investment in the Big Data Community

Over 250,000 benefit from free Big Data skills training

http://bigdatauniversity.com


Spark Technology Center

Focal point for IBM investment in Spark

Code contributions to Apache Spark project

Build industry solutions using Spark

Evangelize Spark technology inside/outside IBM

Agile engagement across IBM divisions

Systems: contribute enhancements to Spark core, and optimized

infrastructure (hardware/software) for Spark

Analytics: IBM Analytics software will exploit Spark processing

Research: build innovations above (solutions that use Spark), inside

(improvements to Spark core), and below (improve systems that execute

Spark) the Spark stack

Goal: To be the #1 contributor and adopter in the Spark ecosystem


The IBM Difference

IBM delivers the foundation for Big Data – now and in the future

Embraces open source

Establishes standards

Integrates with familiar interfaces and established systems

Delivers advanced analytic capabilities

Enables you to benefit from broader data and analytics capabilities

Data Integration and Governance

Predictive and Real-time Analytics

Provides expertise to help you on your journey

6,000 partners

Analytics services and solution centers