Upload
chen-gwen-shapira
View
113
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Use cases and advice on integrating Hadoop with the enterprise data warehouse
Citation preview
With Oracle Database and Hadoop
Building the Integrated Data Warehouse
Gwen Shapira, Senior Consultant
© 2012 – Pythian
Why Pythian • Recognized Leader:
• Global industry leader in data infrastructure managed services and consulting with expertise in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems administration
• Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and Western Union to help manage their complex IT deployments
• Expertise: • One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8
Oracle ACEs/ACE Directors • Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata,
Oracle GoldenGate & Oracle RAC
• Global Reach & Scalability: • 24/7/365 global remote support for DBA and consulting, systems administration, special
projects or emergency response
© 2012 – Pythian
About Gwen Shapira • Oracle ACE Director • 13 Years with pager
• 7 as Oracle DBA
• Senior Consultant:
• Has MacBook, will travel.
• @gwenshap
• http://www.pythian.com/news/author/shapira/
© 2012 – Pythian
Agenda • What is Big Data?
• Why do we care about Big Data?
• Why your DWH needs Hadoop?
• Examples of Hadoop in the DWH
• How to integrate Hadoop into your DWH
• Avoiding major pitfalls
What is Big Data?
© 2012 – Pythian
MORE DATA THAN YOU CAN HANDLE
© 2012 – Pythian
MORE DATA THAN RELATIONAL DATABASES CAN HANDLE
© 2012 – Pythian
MORE DATA THAN RELATIONAL DATABASES CAN HANDLE CHEAPLY
© 2012 – Pythian
Data Arriving at fast Rates Typically unstructured Stored without aggregation
Analyzed in Real Time For Reasonable Cost
© 2012 – Pythian
Where does Big Data come from?
• Social media • Enterprise transactional data • Consumer behaviour • Multimedia • Sensors and embedded devices • Network devices
© 2012 – Pythian
Why all the Excitement?
© 2012 – Pythian
Complex Data Architecture
Your DWH needs Hadoop
© 2012 – Pythian
Big Problems with Big Data • It is:
• Unstructured
• Unprocessed
• Un-aggregated
• Un-filtered
• Repetitive
• And generally messy.
Oh, and there is a lot of it.
© 2012 – Pythian
Technical Challenges • Storage capacity
• Storage throughput
• Pipeline throughput
• Processing power
• Parallel processing
• System Integration
• Data Analysis
Scalable storage
Massive Parallel Processing
Ready to use tools
© 2012 – Pythian
Hadoop Principles Bring Code to Data Share Nothing
© 2012 – Pythian
Hadoop in a Nutshell
Map-Reduce: Framework for writing massively parallel jobs
HDFS: ���Replicated Distributed Big-Data File System
© 2012 – Pythian
Hadoop Benefits • Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to maximize parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem
© 2012 – Pythian
Hadoop Limitations • Hadoop is scalable but not fast • Batteries not included • Instrumentation not included either • Well-known reliability limitations
Use Cases and Customer Stories
Hadoop in the Data Warehouse
© 2012 – Pythian
ETL for Unstructured Data
Logs Web servers, app server, clickstreams
Flume Hadoop Cleanup,
aggregation Longterm storage
DWH BI,
batch reports
© 2012 – Pythian
ETL for Structured Data
OLTP Oracle, MySQL,
Informix…
Sqoop, Perl
Hadoop Transformation
aggregation Longterm storage
DWH BI,
batch reports
© 2012 – Pythian
Bring the World into Your Datacenter
© 2012 – Pythian
Rare Historical Report
© 2012 – Pythian
Find Needle in Haystack
© 2012 – Pythian
We are not doing SQL anymore
Connecting the (big) Dots
© 2012 – Pythian
Sqoop Queries
© 2012 – Pythian
Sqoop is Flexible (for import)
• Select <columns> from <table> where <condition> • Or <write your own query>
• Split column
• Parallel
• Incremental
• File formats
© 2012 – Pythian
Sqoop Import Examples
• Sqoop import -‐-‐connect jdbc:oracle:thin:@//dbserver:1521/masterdb -‐-‐username hr -‐-‐table emp -‐-‐where “start_date > ’01-‐01-‐2012’”
• Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb -‐-‐username myuser -‐-‐table shops -‐-‐split-‐by shop_id -‐-‐num-‐mappers 16
Must be indexed or partitioned to avoid 16 full table scans
© 2012 – Pythian
Less Flexible Export
• 100 row batch inserts • Commit every 100 batches
• Parallel export
• Update mode Example:
sqoop export -‐-‐connect jdbc:oracle:thin:@//dbserver:1521/masterdb -‐-‐table bar -‐-‐export-‐dir /results/bar_data
© 2012 – Pythian
Fuse-DFS • Mount HDFS on Oracle server:
• sudo yum install hadoop-0.20-fuse
• hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point>
• Use external tables to load data into Oracle • File Formats may vary • All ETL best practices apply
© 2012 – Pythian
Oracle Loader for Hadoop • Load data from Hadoop into Oracle • Map-Reduce job inside Hadoop • Converts data types. • Partitions and sorts • Direct path loads • Reduces CPU utilization on database
© 2012 – Pythian
Oracle Direct Connector to HDFS • Create external tables of files in HDFS • PREPROCESSOR HDFS_BIN_PATH:hdfs_stream • All the features of External Tables • Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS
© 2012 – Pythian
Big Data Appliance and Exadata
How not to Fail
© 2012 – Pythian
Data that belongs in RDBMS
© 2012 – Pythian
Prepare for Migration
© 2012 – Pythian
Use Hadoop Efficiently • Understand your bottlenecks:
• CPU, storage or network?
• Reduce use of temporary data:
• All data is over the network
• Written to disk in triplicate.
• Eliminate unbalanced workloads
• Offload work to RDBMS
• Fine-tune optimization with Map-Reduce
© 2012 – Pythian
Your Data is NOT as BIG
as you think
© 2012 – Pythian
Getting Started
• Pick a Business Problem • Acquire Data
• Use right tool for the job • Hadoop can start on the cheap
• Integrate the systems
• Analyze data • Get operational
© 2012 – Pythian
Thank you and Q&A
http://www.pythian.com/news/
http://www.facebook.com/pages/The-Pythian-Group/163902527671
@pythian
http://www.linkedin.com/company/pythian
1-877-PYTHIAN
To contact us…
To follow us…
@pythianjobs