22
Nilay Mishra ([email protected]) Introduction OF Big Data And Hadoop

Introduction_OF_Hadoop_and_BigData

Embed Size (px)

Citation preview

Page 1: Introduction_OF_Hadoop_and_BigData

Nilay Mishra

([email protected])

Introduction OF Big Data And Hadoop

Page 2: Introduction_OF_Hadoop_and_BigData

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.

“Big data” isn’t just a technology—it’s a business strategy for capitalizing on information resources

Page 3: Introduction_OF_Hadoop_and_BigData

Social media and networks (all of us are generating data)

Scientific instruments (collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks (measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data

But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

3

Page 4: Introduction_OF_Hadoop_and_BigData

4

Page 5: Introduction_OF_Hadoop_and_BigData

Analyst

IT

I need to evaluate the possible relationship between client

salary and overdrafts

OK. We have to evaluate a lot of statistics, set the correct

db indexes and db partitioning. It will take us 5

days.

Page 6: Introduction_OF_Hadoop_and_BigData

Analyst

IT

Great. Thanks a lot. I’m going to check the results.

Done. You can run your analytical query.

Page 7: Introduction_OF_Hadoop_and_BigData

Analyst

IT

Great. I can see here some nice correlations. Now I need to look

at it from the different perspective.

Ohhh, welcome dear friend. Understand. So, it’s ….

another 5 days of our work

Noooo!!! It’s not possible to

work here!

Page 8: Introduction_OF_Hadoop_and_BigData

8

Page 9: Introduction_OF_Hadoop_and_BigData
Page 10: Introduction_OF_Hadoop_and_BigData

Hadoop Distributed File System

Data is organized in files and directory.

Files are divided into blocks and distributed across

cluster nodes.

Block placement is done at runtime.

Replication

Blocks are replicated to handle error.

Checksum is used to check data integrity.

Page 11: Introduction_OF_Hadoop_and_BigData
Page 12: Introduction_OF_Hadoop_and_BigData
Page 13: Introduction_OF_Hadoop_and_BigData
Page 14: Introduction_OF_Hadoop_and_BigData

Functional Programming Meets Distributed Processing.

Automatic parallelization and distributed processing .

Page 15: Introduction_OF_Hadoop_and_BigData
Page 16: Introduction_OF_Hadoop_and_BigData
Page 17: Introduction_OF_Hadoop_and_BigData
Page 18: Introduction_OF_Hadoop_and_BigData

18

Shuffle & Sorting based on k

Reduce

Reduce

Reduce

Map

Map

Map

Map

Input blocks on HDFS

Produces (k, v) ( , 1)

Parse-hash

Parse-hash

Parse-hash

Parse-hash

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Users only provide the “Map” and “Reduce” functions

Page 19: Introduction_OF_Hadoop_and_BigData
Page 20: Introduction_OF_Hadoop_and_BigData
Page 21: Introduction_OF_Hadoop_and_BigData

Apache Avro: designed for communication between Hadoop nodes through data serialization

Cassandra and Hbase: a non-relational database designed for use with Hadoop

Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop

Mahout: an AI tool designed for machine learning; that is, to assist with filtering data for analysis and exploration

Pig Latin: A data-flow language and execution framework for parallel computation

ZooKeeper: Keeps all the parts coordinated and working together

Page 22: Introduction_OF_Hadoop_and_BigData