14
Introduction to Big Data & Hadoop Architecture

Introduction to Big Data & Hadoop Architecture - Module 1

Embed Size (px)

Citation preview

Page 1: Introduction to Big Data & Hadoop Architecture - Module 1

Introduction to Big Data & Hadoop

Architecture

Page 2: Introduction to Big Data & Hadoop Architecture - Module 1

Module 1• What is Big Data?• Hadoop Ecosystem Components• Hadoop Architecture• Hadoop Storage: HDFS• Hadoop Processing: MapReduce Framework• Hadoop Server Roles: NameNode, DataNode, Secondary NameNode• Anatomy of File Read and Write

Page 3: Introduction to Big Data & Hadoop Architecture - Module 1

What is Big Data?• Walmart handles more than one million customer transactions every

hour.• Facebook handles 40 billion photos from its user base.• New York Stock Exchange generates about one TB of new trade data

per day.• Last.fm hosts approximately 25 million users, taking up one TB of

storage daily.• Twitter generates 7 TB of data daily.• IBM claims 90% of today’s stored data was generated in last two

years.

Page 4: Introduction to Big Data & Hadoop Architecture - Module 1

Three Characteristics of Big Data V3s• Volume(Data quantity)• Facebook ingests 500 TB of new data every day.• Boeing 737 will generate 240 TB of flight data during a single journey.

• Velocity(Data Speed)• High Frequency stock trading algorithms reflect market changes within

microseconds.• Clickstreams capture user behavior at millions of events per second.

• Variety(Data Types)• Geospatial data, Audio and Video, unstructured text.

Page 5: Introduction to Big Data & Hadoop Architecture - Module 1

The structure of Big Data• Structured• CSV, Data stored in RDBMS

• Semi-Structured• XML, JSON, SGML

• Unstructured• Video data, Audio Data, Images

Page 6: Introduction to Big Data & Hadoop Architecture - Module 1

How Big Data impacts on IT?• By 2016 4.4 million IT jobs in Big Data; 1.9 million is in US itself

• India will require a minimum of 1 lakh data scientist in the next couple of years in addition to data analysts and data managers to support the Big Data space.

• The opportunity for Indian Service providers lies in offering services around Big Data implementation and analytics for global multinationals.

Page 7: Introduction to Big Data & Hadoop Architecture - Module 1

References• www.slideshare.com• www.wikipedia.com• www.computereducation.org• www.youtube.com• www.about.com

Page 8: Introduction to Big Data & Hadoop Architecture - Module 1

What is Hadoop?• Apache™ Hadoop® is an open source software project that enables the

distributed processing of large data sets across clusters of commodity servers.

• It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

• Rather than relying on high-end hardware, the resiliency of these

clusters comes from the software’s ability to detect and handle failures at the application layer.

Page 9: Introduction to Big Data & Hadoop Architecture - Module 1

Hadoop – Ecosystem Components

Page 10: Introduction to Big Data & Hadoop Architecture - Module 1

Hadoop Ecosystem• Pig: A scripting language that simplifies the creation of MapReduce

jobs and excels at exploring and transforming data.• Hive: Provides SQL like access to your Big Data. • HBase: A Hadoop database.• Sqoop: For efficiently transferring bulk data between Hadoop and

relation databases.• Oozie: A workflow scheduler system to manage Apache Hadoop jobs.• Flume: For efficiently collecting, aggregating, and moving large

amounts of log data.

Page 11: Introduction to Big Data & Hadoop Architecture - Module 1

Hadoop Architecture

Page 12: Introduction to Big Data & Hadoop Architecture - Module 1

Hadoop – Core Components• HDFS - A file system that spans all the nodes in a Hadoop cluster for

data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes

• Map/Reduce – The data processing framework that understands and assigns work to the nodes in a cluster.

Page 13: Introduction to Big Data & Hadoop Architecture - Module 1

Anatomy of a File Read

Page 70 Definitive Guide

Page 14: Introduction to Big Data & Hadoop Architecture - Module 1

Anatomy of a File Write

Page 73 Definitive Guide