19
Intro to Big Data and Apache Hadoop Dr. Amr Awadallah, CTO/Founder @awadallah, [email protected]

Cw13 big data and apache hadoop by amr awadallah-cloudera

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Cw13 big data and apache hadoop by amr awadallah-cloudera

Intro to Big Data and Apache Hadoop

Dr. Amr Awadallah, CTO/Founder

@awadallah, [email protected]

Page 2: Cw13 big data and apache hadoop by amr awadallah-cloudera

Who is Cloudera?

2

What the Enterprise Requires

The market-leading Hadoop-based platform with batch and real-time processing frameworks

A comprehensive suite of system and data management software

Training and certification programs

Comprehensive support and consulting services

Extensive Partner Ecosystem

Over 400 partners across hardware, software and services

The Leader in Big Data

Management

Deliver a revolutionary data management platform based on Apache Hadoop

Enable organizations to improve operational efficiency and Ask Bigger Questions of all their data

Customers & Users Across Industries

More production deployments than all other vendors combined

©2013 Cloudera, Inc. All Rights Reserved.

Page 3: Cw13 big data and apache hadoop by amr awadallah-cloudera

Data Has Changed in the Last 30 Years D

ATA

GR

OW

TH

END-USER APPLICATIONS

THE INTERNET

MOBILE DEVICES

SOPHISTICATED MACHINES

STRUCTURED DATA – 10%

1980 2012

UNSTRUCTURED DATA – 90%

3 ©2013 Cloudera, Inc. All Rights Reserved.

Page 4: Cw13 big data and apache hadoop by amr awadallah-cloudera

What if you wanted to…

4

Data

Question

Speed

Usage

Type/Form

©2013 Cloudera, Inc. All Rights Reserved.

Page 5: Cw13 big data and apache hadoop by amr awadallah-cloudera

So what is Apache ?

Self-Healing

High-Bandwidth

Clustered Storage

Byte Streams

Fault-Tolerant

Distributed Processing

Schema-on-Read

1

2

3

4

5

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

Input File

HDFS storage distribution

Node A Node B Node C Node D Node E

1

2

3

4

5

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

Output File

MapReduce compute distribution

Node A Node B Node C Node D Node E

Storage

Compute

©2013 Cloudera, Inc. All Rights Reserved. 5

Page 6: Cw13 big data and apache hadoop by amr awadallah-cloudera

6

Next-Gen Data Management

©2013 Cloudera, Inc. All Rights Reserved.

Page 7: Cw13 big data and apache hadoop by amr awadallah-cloudera

The Key Benefit: Agility/Flexibility

7

Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS):

• Prescriptive Data Modeling:

• Create static DB schema

• Transform data into RDBMS

• Query data in RDBMS format

• New columns must be added explicitly before new data can propagate into the system.

• Good for Known Unknowns (Repetition)

• Descriptive Data Modeling:

• Copy data in its native format

• Create schema + parser

• Query Data in its native format (does ETL on the fly)

• New data can start flowing any time and will appear retroactively once the schema/parser properly describes it.

• Good for Unknown Unknowns (Exploration)

©2013 Cloudera, Inc. All Rights Reserved.

Page 8: Cw13 big data and apache hadoop by amr awadallah-cloudera

Scalable Technology + Scalable Development

8

Grows without requiring developers to re-architect their algorithms/application

©2013 Cloudera, Inc. All Rights Reserved.

AUTO SCALE

Page 9: Cw13 big data and apache hadoop by amr awadallah-cloudera

Low ROB (but still a ton of

aggregate value)

High ROB

Economics: Return on Byte

9 ©2013 Cloudera, Inc. All Rights Reserved.

Page 10: Cw13 big data and apache hadoop by amr awadallah-cloudera

Cloud Deployment

CDH: Cloudera Distribution incl. Apache Hadoop

Coordination

Data Integration

Fast Read/Write

Access

Batch Processing Languages

Web Console

Job Workflow

Metadata

APACHE ZOOKEEPER

APACHE FLUME, APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE

HUE

APACHE OOZIE

APACHE HIVE MetaStore Interactive SQL

Data Mining Lib

Impala

APACHE MAHOUT

APACHE WHIRR

Bu

ild/T

est

: APA

CH

E B

IGTO

P

Cloudera Manager Free Edition (Installation Wizard)

©2013 Cloudera, Inc. All Rights Reserved. 10

Hadoop Core Kernel MapReduce, HDFS

Connectivity

Data Processing Lib DataFu for Pig

ODBC/JDBC/FUSE/HTTPS

Page 11: Cw13 big data and apache hadoop by amr awadallah-cloudera

Cloudera Enterprise

11 ©2013 Cloudera, Inc. All Rights Reserved.

Page 12: Cw13 big data and apache hadoop by amr awadallah-cloudera

The Cloudera Solution Stack

12

CLOUDERA UNIVERSITY

DEVELOPER TRAINING

ADMINISTRATOR TRAINING

DATA SCIENCE TRAINING

CERTIFICATION PROGRAMS

PROFESSIONAL SERVICES

USE CASE DISCOVERY NEW HADOOP DEPLOYMENT PROOF-OF-CONCEPT

DEPLOYMENT CERTIFICATION PROCESS & TEAM DEVELOPMENT

PRODUCTION PILOTS

MANAGEMENT SOFTWARE & TECHNICAL SUPPORT (SUBSCRIPTION)

CDH

INGEST STORE EXPLORE PROCESS ANALYZE SERVE

CM CLOUDERA MANAGER

CS CLOUDERA SUPPORT

OSS APACHE HADOOP & OPEN SOURCE SOFTWARE

©2013 Cloudera, Inc. All Rights Reserved.

Page 13: Cw13 big data and apache hadoop by amr awadallah-cloudera

Powered by Cloudera Impala

13

BEFORE IMPALA

• With Impala: Interactive ANSI-92 SQL queries Native distributed query engine Optimized for low-latency

• Provides:

Answers as fast as you can ask Everyone can ask questions of all data Big data storage and analytics together

WITH IMPALA

• Unified storage: Supports HDFS and HBase Flexible file formats and schemas

• Unified Metastore • Unified Security • Unified Client Interfaces:

ODBC/JDBC SQL syntax Hue Beeswax Web UI

BATCH PROCESSING

USER INTERFACE

REAL-TIME ACCESS

©2013 Cloudera, Inc. All Rights Reserved.

Page 14: Cw13 big data and apache hadoop by amr awadallah-cloudera

Cloudera in the Enterprise Stack

14 ©2013 Cloudera, Inc. All Rights Reserved.

Page 15: Cw13 big data and apache hadoop by amr awadallah-cloudera

Use Case: A Major Financial Institution

©2013 Cloudera, Inc. All Rights Reserved. 15

The Challenge: • Current EDW at capacity; cannot support growing data depth and width • Performance issues in business critical apps; little room for innovation.

New solution saves tens of millions by optimizing existing EDW for analytics & reducing data storage costs by 99%

The Solution: • Cloudera Enterprise offloads data

storage (S), processing (T) & some analytics (Q) from the EDW.

• EDW resources can now be focused on repeatable operational analytics.

• Month data scan in 4 secs vs. 4 hours

Operational (44%)

ELT Processing (42%)

Analytics (11%)

DATA WAREHOUSE

Analytics Processing

Storage

CLOUDERA

Operational (50%)

Analytics (50%)

DATA WAREHOUSE

Page 16: Cw13 big data and apache hadoop by amr awadallah-cloudera

Beyond Data Warehousing

16

COMMUNICATIONS Location- based advertising

HEALTH CARE Patient sensors, monitoring, EHRs Quality of care

LAW ENFORCEMENT & DEFENSE Threat analysis, Social media monitoring, Photo analysis

EDUCATION & RESEARCH Experiment sensor analysis

FINANCIAL SERVICES Risk & portfolio analysis New products

ON-LINE SERVICES / SOCIAL MEDIA People & career matching Website optimization

UTILITIES Smart Meter analysis for network capacity

CONSUMER PACKAGED GOODS Sentiment analysis of what’s hot, customer service

MEDIA / ENTERTAINMENT Viewers / advertising effectiveness

TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment

LIFE SCIENCES Clinical trials Genomics

RETAIL Consumer sentiment Optimized marketing

AUTOMOTIVE Auto sensors reporting location, problems

HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis

OIL & GAS Drilling exploration sensor analysis

©2013 Cloudera, Inc. All Rights Reserved.

Page 17: Cw13 big data and apache hadoop by amr awadallah-cloudera

17

The Road Ahead

Bringing Compute to Data

Bringing Applications

to Data

2006-2012 2013-???

Page 18: Cw13 big data and apache hadoop by amr awadallah-cloudera

Flexibility • Store any data • Run any analysis • Keep’s pace with the rate of change of incoming data

Scalability • Proven growth to PBS/1,000s of nodes • No need to rewrite queries, automatically scales • Keep’s pace with the rate of growth of incoming data Economics • Cost per TB at a fraction of other options • Keep all of your data alive in an active archive • Powering the data beats algorithm movement

The Cloudera Platform for Big Data

18 ©2013 Cloudera, Inc. All Rights Reserved.

Page 19: Cw13 big data and apache hadoop by amr awadallah-cloudera

Dr. Amr Awadallah CTO/Founder @awadallah [email protected]