Upload
nader-ganayem
View
107
Download
1
Tags:
Embed Size (px)
DESCRIPTION
First slide of Hadoop: * Introduction to Big Data and Hadoop: - Presenting and defining big data - Introducing Hadoop and History - Hadoop - how it works? - HDFS
Citation preview
Introducing:
The Modern Data Operating System
Hadoop is ...A scalable fault tolerant distributed for data storage and processing (open source under the Apache license)
● Hadoop Distributed FileSystem (HDFS): self-healing, high-bandwidth clustered storage
● MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction
- Core Hadoop has two main systems:
GFS
Map/Reduce
BigTable
> > >
> > >
> > >
HDFS
MapReduce
Hadoop Origins
Hadoop Chronicles
GFS
Map/Reduce
BigTable
Doug Cutting
Etymology
● Hadoop was created in 2004 by "Douglass (Doug) Cutting"
● Implemented Google Filesystem and Big Tables papers
● He aimed it, to index the internet in google style for startup search engine 'Nutch'
● Named it after his son's elephant shaped favourite toy named hadoop
What is Big Data?
"In Information Technology, big data is loosely defined term used to describe set so large and complex that they became awkward to work with using on-hand database management tools."
Wikipedia
● 2008: Google processes 20PB a day
● 2012: Facebook ingests 500TB of data a day
● 2009: eBay has 6.5 PB user data + 50 TB a day
● 2011: Yahoo! has 180-200 PB of data
How big is big?
Limitations of Existing Analytics Architecture
BI Reports + Online Apps
RDBMS (aggregated data)
ETL (Extract, Transfer & Load)
Storage Grid
Data Collection
Instrumentation (Raw Data Sources)
Moving Data from storage to compute doesn't scale!
Can't explore original raw data
Archiving = Premature deathMostly Append
Why Hadoop?Challenge: Read 1 TB of data
1 Machine
- 4 IO channels- Each channel: 100 MB/s
?
10 Machines
- 4 IO channels- Each channel: 100 MB/s
?45 minutes 4.5 minutes
Hadoop and Friends
The Key Benefit: Agility/FlexibilitySchema-On-Write (RDBMS) Schema-On-Read (Hadoop)
- Schema must be created before any data can be loaded
- An explicit load operation has to take place which transforms data to DB internal structure
- New columns must be be added explicitly before new data for such columns can be loaded into the database
- Reads are fast - Load is fast- Standards / Governance - Flexibility / Agility
- Data is simply copied to the file store, no transformations are needed
- A SerDe (Serializer/Deserializer) is applied during read tume to extract the required column (late binding)
- New data can strat flowing anytime and will appear retroactively once the SerDe is updated to parse it
Hadoop ComponentsMaster/Slave Architecture
Name Node
Job Tracker
Data Nodes
Task Trackers
File metadata:/kenshoo/data1.txt ---> 1,2,3/kenshoo/data2.txt ---> 4,5
NameNode r=3
hdfs-site.xml
dfs.replication
Data Nodes
3 3 3 1
2 2 1 1 2
5 4
5 4 5
4
Underlying FS options
ext3- released in 2001 - Used by Yahoo!- bootstrap + format slow- set:
- noatime- tune2fs (to turn off reserved blocks)
ext4- released in 2008 - Used by Google- Fast as XFS- set:
- delayed allocation off-noatime- tune2fs (to turn off reserved blocks)
XFS- released in 1993 - Fast- Drawbacks:
- deleting large # of files
Sample HDFS shell Commands
bin/hadoop fs -lsbin/hadoop fs -mkdirbin/hadoop fs -copyFromLocalbin/hadoop fs -copyToLocalbin/hadoop fs -moveToLocalbin/hadoop fs -rmbin/hadoop fs -tailbin/hadoop fs -chmodbin/hadoop fs -setrep -w 4 -R /dir1/s-dir
Mounting using FUSE:
hadoop-fuse-dfs dfs://10.73.9.50 /hdfs
Name Node
2
4
3
5
Rack 1
2
4
3
5
Rack 2
2
4
3
5
Rack 3
Job Tracker HBase Master
Network Topology
Yahoo! Installation
- 8 core switches- 100 racks- 40 servers/rack- 1 GBit in rack- 10 GBit among racks-Total 11PB
Name Node
2
4
3
5
Rack 1
7
9
8
10
Rack 2
12
14
13
15
Rack 3
Job Tracker HBase Master
Rack Awareness
metadata
file.txt = Blk A: DN: 2,7,8
Blk B:DN: 9,12,14
NameNode
A
B
B
BB
A A
A
2
4
3
5
Rack 1
7
9
8
10
Rack 2
HDFS Writes
metadata
file.txt = Blk A: DN: 2,7,9
NameNode
A
A
A B
Core
C
Client
A
A
2
4
3
5
Rack 1
7
9
8
10
Rack 2
12
14
13
15
Rack 3
Reading Files
metadata
file.txt = Blk A: DN: 2,7,8
Blk B:DN: 9,12,14
NameNode
A
BB
BB
A A
A
Client
File1.txt parts:Blk A: 2,7,8Blk B: 9,12,14
Core
wanna read file1.txt