42
With Oracle Database and Hadoop Building the Integrated Data Warehouse Gwen Shapira, Senior Consultant

Integrated Data Warehouse with Hadoop and Oracle Database

Embed Size (px)

DESCRIPTION

Use cases and advice on integrating Hadoop with the enterprise data warehouse

Citation preview

Page 1: Integrated Data Warehouse with Hadoop and Oracle Database

With Oracle Database and Hadoop

Building the Integrated Data Warehouse

Gwen Shapira, Senior Consultant

Page 2: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Why Pythian •  Recognized Leader:

•  Global industry leader in data infrastructure managed services and consulting with expertise in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems administration

•  Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and Western Union to help manage their complex IT deployments

•  Expertise: •  One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8

Oracle ACEs/ACE Directors •  Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata,

Oracle GoldenGate & Oracle RAC

•  Global Reach & Scalability: •  24/7/365 global remote support for DBA and consulting, systems administration, special

projects or emergency response

Page 3: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

About Gwen Shapira • Oracle ACE Director • 13 Years with pager

•  7 as Oracle DBA

• Senior Consultant:

• Has MacBook, will travel.

•  @gwenshap

•  http://www.pythian.com/news/author/shapira/

Page 4: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Agenda • What is Big Data?

• Why do we care about Big Data?

• Why your DWH needs Hadoop?

•  Examples of Hadoop in the DWH

•  How to integrate Hadoop into your DWH

•  Avoiding major pitfalls

Page 5: Integrated Data Warehouse with Hadoop and Oracle Database

What is Big Data?

Page 6: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

MORE DATA THAN YOU CAN HANDLE

Page 7: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

MORE DATA THAN RELATIONAL DATABASES CAN HANDLE

Page 8: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

MORE DATA THAN RELATIONAL DATABASES CAN HANDLE CHEAPLY

Page 9: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Data Arriving at fast Rates Typically unstructured Stored without aggregation

Analyzed in Real Time For Reasonable Cost

Page 10: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Where does Big Data come from?

• Social media • Enterprise transactional data • Consumer behaviour • Multimedia • Sensors and embedded devices • Network devices

Page 11: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Why all the Excitement?

Page 12: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Complex Data Architecture

Page 13: Integrated Data Warehouse with Hadoop and Oracle Database

Your DWH needs Hadoop

Page 14: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Big Problems with Big Data •  It is:

•  Unstructured

•  Unprocessed

•  Un-aggregated

•  Un-filtered

•  Repetitive

•  And generally messy.

Oh, and there is a lot of it.

Page 15: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Technical Challenges •  Storage capacity

•  Storage throughput

•  Pipeline throughput

•  Processing power

•  Parallel processing

•  System Integration

• Data Analysis

Scalable storage

Massive Parallel Processing

Ready to use tools

Page 16: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop Principles Bring Code to Data Share Nothing

Page 17: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop in a Nutshell

Map-Reduce: Framework for writing massively parallel jobs

HDFS: ���Replicated Distributed Big-Data File System

Page 18: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop Benefits •  Reliable solution based on unreliable hardware

• Designed for large files

•  Load data first, structure later

• Designed to maximize throughput of large scans

• Designed to maximize parallelism

• Designed to scale

•  Flexible development platform

•  Solution Ecosystem

Page 19: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop Limitations • Hadoop is scalable but not fast • Batteries not included •  Instrumentation not included either • Well-known reliability limitations

Page 20: Integrated Data Warehouse with Hadoop and Oracle Database

Use Cases and Customer Stories

Hadoop in the Data Warehouse

Page 21: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

ETL for Unstructured Data

Logs Web servers, app server, clickstreams

Flume Hadoop Cleanup,

aggregation Longterm storage

DWH BI,

batch reports

Page 22: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

ETL for Structured Data

OLTP Oracle, MySQL,

Informix…

Sqoop, Perl

Hadoop Transformation

aggregation Longterm storage

DWH BI,

batch reports

Page 23: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Bring the World into Your Datacenter

Page 24: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Rare Historical Report

Page 25: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Find Needle in Haystack

Page 26: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

We are not doing SQL anymore

Page 27: Integrated Data Warehouse with Hadoop and Oracle Database

Connecting the (big) Dots

Page 28: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Sqoop Queries

Page 29: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Sqoop is Flexible (for import)

• Select <columns> from <table> where <condition> • Or <write your own query>

• Split column

• Parallel

• Incremental

• File formats

Page 30: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Sqoop Import Examples

•  Sqoop  import  -­‐-­‐connect  jdbc:oracle:thin:@//dbserver:1521/masterdb    -­‐-­‐username  hr  -­‐-­‐table  emp    -­‐-­‐where  “start_date  >  ’01-­‐01-­‐2012’”  

 

•  Sqoop  import  jdbc:oracle:thin:@//dbserver:1521/masterdb    -­‐-­‐username  myuser    -­‐-­‐table  shops  -­‐-­‐split-­‐by  shop_id    -­‐-­‐num-­‐mappers  16  

Must be indexed or partitioned to avoid 16 full table scans

Page 31: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Less Flexible Export

• 100 row batch inserts • Commit every 100 batches

• Parallel export

• Update mode Example:

sqoop  export    -­‐-­‐connect  jdbc:oracle:thin:@//dbserver:1521/masterdb    -­‐-­‐table  bar    -­‐-­‐export-­‐dir  /results/bar_data  

Page 32: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Fuse-DFS • Mount HDFS on Oracle server:

• sudo yum install hadoop-0.20-fuse

• hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point>

• Use external tables to load data into Oracle • File Formats may vary • All ETL best practices apply

Page 33: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Oracle Loader for Hadoop • Load data from Hadoop into Oracle • Map-Reduce job inside Hadoop • Converts data types. • Partitions and sorts • Direct path loads • Reduces CPU utilization on database

Page 34: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Oracle Direct Connector to HDFS • Create external tables of files in HDFS • PREPROCESSOR  HDFS_BIN_PATH:hdfs_stream  • All the features of External Tables • Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS

Page 35: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Big Data Appliance and Exadata

Page 36: Integrated Data Warehouse with Hadoop and Oracle Database

How not to Fail

Page 37: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Data that belongs in RDBMS

Page 38: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Prepare for Migration

Page 39: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Use Hadoop Efficiently •  Understand your bottlenecks:

•  CPU, storage or network?

•  Reduce use of temporary data:

•  All data is over the network

•  Written to disk in triplicate.

•  Eliminate unbalanced workloads

• Offload work to RDBMS

•  Fine-tune optimization with Map-Reduce

Page 40: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Your Data is NOT as BIG

as you think

Page 41: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Getting Started

• Pick a Business Problem •  Acquire Data

• Use right tool for the job •  Hadoop can start on the cheap

•  Integrate the systems

• Analyze data •  Get operational

Page 42: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Thank you and Q&A

http://www.pythian.com/news/

http://www.facebook.com/pages/The-Pythian-Group/163902527671

@pythian

http://www.linkedin.com/company/pythian

1-877-PYTHIAN

[email protected]

To contact us…

To follow us…

@pythianjobs