Upload
rohit-agrawal
View
88
Download
4
Embed Size (px)
Citation preview
Introduction to Big Data & Hadoop
Architecture
Module 1• What is Big Data?• Hadoop Ecosystem Components• Hadoop Architecture• Hadoop Storage: HDFS• Hadoop Processing: MapReduce Framework• Hadoop Server Roles: NameNode, DataNode, Secondary NameNode• Anatomy of File Read and Write
What is Big Data?• Walmart handles more than one million customer transactions every
hour.• Facebook handles 40 billion photos from its user base.• New York Stock Exchange generates about one TB of new trade data
per day.• Last.fm hosts approximately 25 million users, taking up one TB of
storage daily.• Twitter generates 7 TB of data daily.• IBM claims 90% of today’s stored data was generated in last two
years.
Three Characteristics of Big Data V3s• Volume(Data quantity)• Facebook ingests 500 TB of new data every day.• Boeing 737 will generate 240 TB of flight data during a single journey.
• Velocity(Data Speed)• High Frequency stock trading algorithms reflect market changes within
microseconds.• Clickstreams capture user behavior at millions of events per second.
• Variety(Data Types)• Geospatial data, Audio and Video, unstructured text.
The structure of Big Data• Structured• CSV, Data stored in RDBMS
• Semi-Structured• XML, JSON, SGML
• Unstructured• Video data, Audio Data, Images
How Big Data impacts on IT?• By 2016 4.4 million IT jobs in Big Data; 1.9 million is in US itself
• India will require a minimum of 1 lakh data scientist in the next couple of years in addition to data analysts and data managers to support the Big Data space.
• The opportunity for Indian Service providers lies in offering services around Big Data implementation and analytics for global multinationals.
References• www.slideshare.com• www.wikipedia.com• www.computereducation.org• www.youtube.com• www.about.com
What is Hadoop?• Apache™ Hadoop® is an open source software project that enables the
distributed processing of large data sets across clusters of commodity servers.
• It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.
• Rather than relying on high-end hardware, the resiliency of these
clusters comes from the software’s ability to detect and handle failures at the application layer.
Hadoop – Ecosystem Components
Hadoop Ecosystem• Pig: A scripting language that simplifies the creation of MapReduce
jobs and excels at exploring and transforming data.• Hive: Provides SQL like access to your Big Data. • HBase: A Hadoop database.• Sqoop: For efficiently transferring bulk data between Hadoop and
relation databases.• Oozie: A workflow scheduler system to manage Apache Hadoop jobs.• Flume: For efficiently collecting, aggregating, and moving large
amounts of log data.
Hadoop Architecture
Hadoop – Core Components• HDFS - A file system that spans all the nodes in a Hadoop cluster for
data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes
• Map/Reduce – The data processing framework that understands and assigns work to the nodes in a cluster.
Anatomy of a File Read
Page 70 Definitive Guide
Anatomy of a File Write
Page 73 Definitive Guide