Introduction to Big Data & Hadoop Architecture - Module 1

Introduction to Big Data & Hadoop

Architecture

Module 1• What is Big Data?• Hadoop Ecosystem Components• Hadoop Architecture• Hadoop Storage: HDFS• Hadoop Processing: MapReduce Framework• Hadoop Server Roles: NameNode, DataNode, Secondary NameNode• Anatomy of File Read and Write

What is Big Data?• Walmart handles more than one million customer transactions every

hour.• Facebook handles 40 billion photos from its user base.• New York Stock Exchange generates about one TB of new trade data

per day.• Last.fm hosts approximately 25 million users, taking up one TB of

storage daily.• Twitter generates 7 TB of data daily.• IBM claims 90% of today’s stored data was generated in last two

years.

Three Characteristics of Big Data V3s• Volume(Data quantity)• Facebook ingests 500 TB of new data every day.• Boeing 737 will generate 240 TB of flight data during a single journey.

• Velocity(Data Speed)• High Frequency stock trading algorithms reflect market changes within

microseconds.• Clickstreams capture user behavior at millions of events per second.

• Variety(Data Types)• Geospatial data, Audio and Video, unstructured text.

The structure of Big Data• Structured• CSV, Data stored in RDBMS

• Semi-Structured• XML, JSON, SGML

• Unstructured• Video data, Audio Data, Images

How Big Data impacts on IT?• By 2016 4.4 million IT jobs in Big Data; 1.9 million is in US itself

• India will require a minimum of 1 lakh data scientist in the next couple of years in addition to data analysts and data managers to support the Big Data space.

• The opportunity for Indian Service providers lies in offering services around Big Data implementation and analytics for global multinationals.

References• www.slideshare.com• www.wikipedia.com• www.computereducation.org• www.youtube.com• www.about.com

http://www.slideshare.com/

http://www.wikipedia.com/

http://www.computereducation.org/

http://www.youtube.com/

http://www.about.com/

What is Hadoop?• Apache™ Hadoop® is an open source software project that enables the

distributed processing of large data sets across clusters of commodity servers.

• It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

• Rather than relying on high-end hardware, the resiliency of these

clusters comes from the software’s ability to detect and handle failures at the application layer.

Hadoop – Ecosystem Components

Hadoop Ecosystem• Pig: A scripting language that simplifies the creation of MapReduce

jobs and excels at exploring and transforming data.• Hive: Provides SQL like access to your Big Data. • HBase: A Hadoop database.• Sqoop: For efficiently transferring bulk data between Hadoop and

relation databases.• Oozie: A workflow scheduler system to manage Apache Hadoop jobs.• Flume: For efficiently collecting, aggregating, and moving large

amounts of log data.

Hadoop Architecture

Hadoop – Core Components• HDFS - A file system that spans all the nodes in a Hadoop cluster for

data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes

• Map/Reduce – The data processing framework that understands and assigns work to the nodes in a cluster.

Anatomy of a File Read

Page 70 Definitive Guide

Anatomy of a File Write

Page 73 Definitive Guide

Technology

Introduction to Big Data & Hadoop Architecture - Module 1