Upload
armen
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
MySQL and Hadoop. MySQL SF Meetup 2012 Chris Schneider. About Me. Chris Schneider, Data Architect @ Ning.com (a Glam Media Company) Spent the last ~2 years working with Hadoop (CDH) Spent the last 10 years building MySQL architecture for multiple companies [email protected]. - PowerPoint PPT Presentation
Citation preview
MySQL and HadoopMySQL SF Meetup 2012Chris Schneider
About MeChris Schneider, Data Architect @ Ning.com (a
Glam Media Company)Spent the last ~2 years working with Hadoop
(CDH)Spent the last 10 years building MySQL
architecture for multiple [email protected]
What we’ll coverHadoopCDHUse cases for HadoopMap ReduceScoopHive Impala
What is Hadoop?An open-source framework for storing and
processing data on a cluster of serversBased on Google’s whitepapers of the Google
File System (GFS) and MapReduceScales linearly Designed for batch processingOptimized for streaming reads
The Hadoop DistributionCloudera
The only distribution for Apache Hadoop
What Cloudera Does Cloudera Manager Enterprise Training
Hadoop Admin Hadoop Development Hbase Hive and Pig
Enterprise Support
Why HadoopVolume
Use Hadoop when you cannot or should not use traditional RDBMS
Velocity Can ingest terabytes of data per day
Variety You can have structured or unstructured data
Use cases for Hadoop Recommendation engine
Netflix recommends movies
Ad targeting, log processing, search optimization eBay, Orbitz
Machine learning and classification Yahoo Mail’s spam detection Financial: Identity theft and credit risk
Social Graph Facebook, Linkedin and eHarmony connections
Predicting the outcome of an election before the election, 50 out of 50 correct thanks to Nate Silver!
Some Details about HadoopTwo Main Pieces of Hadoop Hadoop Distributed File System (HDFS)
Distributed and redundant data storage using many nodes
Hardware will inevitably fail
Read and process data with MapReduce Processing is sent to the data Many “map” tasks each work on a slice of the data Failed tasks are automatically restarted on another
node or replica
MapReduce Word CountThe key and value together represent a row of
data where the key is the byte offset and the value is the line
map (key,value)
foreach (word in value)
output (word,1)
Map is used for Searching
64, big data is totally cool and big…
Intermediate Output (on local disk):big, 1data, 1is, 1totally, 1cool, 1and, 1big, 1
MAP
Foreach word
Reduce is used to aggregateHadoop aggregates the keys and calls a reduce for each unique key… e.g. GROUP BY, ORDER BYreduce (key, list)
sum the list
output (key, sum)
big, (1,1)data, (1)is, (1)totally, (1)cool, (1)and, (1)
big, 2data, 1is, 1totally, 1cool, 1and, 1
Reduce
Where does Hadoop fit in?Think of Hadoop as an augmentation of your
traditional RDBMS systemYou want to store years of data You need to aggregate all of the data over
many years timeYou want/need ALL your data stored and
accessible not forgotten or deleted You need this to be free software running on
commodity hardware
Where does Hadoop fit in?
MySQL MySQL
MySQL MySQL
http http http
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
DataNodeDataNode
NameNodeNameNode2
SecondaryNameNode JobTracker
Hadoop (CDH4)MySQL
MySQL
Sqoop or ETLSqoop
Flume
Tableau: Business Analytics
HivePig
Data FlowMySQL is used for OLTP data processingETL process moves data from MySQL to Hadoop
Cron job – SqoopOR
Cron job – Custom ETL
Use MapReduce to transform data, run batch analysis, join data, etc…
Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report
MySQL HadoopData Capacity Depends, (TB)+ PB+Data per query/MR
Depends, (MB -> GB)
PB+
Read/Write Random read/write
Sequential scans, Append-only
Query Language SQL MapReduce, Scripted Streaming, HiveQL, Pig Latin
Transactions Yes NoIndexes Yes NoLatency Sub-second Minutes to hoursData structure Relational Both structured
and un-structured
Enterprise and Community Support
Yes Yes
About SqoopOpen Source and stands for SQL-to-HadoopParallel import and export between Hadoop
and various RDBMS Default implementation is JDBCOptimized for MySQL but not for performance Integrated with connectors for Oracle, Netezza,
Teradata (Not Open Source)
Sqoop Data Into Hadoop
This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.City
The resulting TSV file(s) will be stored in HDFS
$ sqoop import --connect jdbc:mysql://example.com/world \--tables City \--fields-terminated-by ‘\t’ \--lines-terminated-by ‘\n’
Sqoop FeaturesYou can choose specific tables or columns to
import with the --where flagControlled parallelism
Parallel mappers/connections (--num-mappers) Specify the column to split on (--split-by)
Incremental loads Integration with Hive and Hbase
Sqoop Export
The City table needs to exist Default CSV formatted Can use staging table (--staging-table)
$ sqoop export --connect jdbc:mysql://example.com/world \--tables City \--export-dir /hdfs_path/City_data
About Hive Offers a way around the complexities of
MapReduce/JAVA Hive is an open-source project managed by the
Apache Software Foundation Facebook uses Hadoop and wanted non-JAVA
employees to be able to access data Language based on SQL Easy to lean and use Data is available to many more people
Hive is a SQL SELECT statement to MapReduce translator
More About HiveHive is NOT a replacement for RDBMS
Not all SQL works
Hive is only an interpreter that converts HiveQL to MapReduce
HiveQL queries can take many seconds or minutes to produce a result set
RDBMS vs HiveRDBMS Hive
Language SQL Subset of SQL along with Hive extensions
Transactions Yes NoACID Yes NoLatency Sub-second
(Indexed Data)Many seconds to minutes(Non Index Data)
Updates? Yes, INSERT [IGNORE], UPDATE, DELETE, REPLACE
INSERT OVERWRITE
Sqoop and Hive
Alternatively, you can create table(s) within the Hive CLI and run an “fs -put” with an exported CSV file on the local file system
$ sqoop import --connect jdbc:mysql://example.com/world \--tables City \--hive-import
Impala It’s new, it’s fastAllows real time analytics on very large data
setsRuns on top of HIVE Based off of Google’s Dremel
http://research.google.com/pubs/pub36632.html
Cloudera VM for Impala https://ccp.cloudera.com/display/SUPPORT/
Downloads
Thanks EveryoneQuestions?Good References
Cloudera.com http://infolab.stanford.edu/~ragho/hive-
icde2010.pdf
VM downloads https://ccp.cloudera.com/display/SUPPORT/
Cloudera%27s+Hadoop+Demo+VM+for+CDH4