Introduction_OF_Hadoop_and_BigData

Nilay Mishra

([email protected])

Introduction OF Big Data And Hadoop

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.

“Big data” isn’t just a technology—it’s a business strategy for capitalizing on information resources

Social media and networks (all of us are generating data)

Scientific instruments (collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks (measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data

But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

3

4

Analyst

IT

I need to evaluate the possible relationship between client

salary and overdrafts

OK. We have to evaluate a lot of statistics, set the correct

db indexes and db partitioning. It will take us 5

days.

Analyst

IT

Great. Thanks a lot. I’m going to check the results.

Done. You can run your analytical query.

Analyst

IT

Great. I can see here some nice correlations. Now I need to look

at it from the different perspective.

Ohhh, welcome dear friend. Understand. So, it’s ….

another 5 days of our work

Noooo!!! It’s not possible to

work here!

8

Hadoop Distributed File System

Data is organized in files and directory.

Files are divided into blocks and distributed across

cluster nodes.

Block placement is done at runtime.

Replication

Blocks are replicated to handle error.

Checksum is used to check data integrity.

Functional Programming Meets Distributed Processing.

Automatic parallelization and distributed processing .

18

Shuffle & Sorting based on k

Reduce

Reduce

Reduce

Map

Map

Map

Map

Input blocks on HDFS

Produces (k, v) ( , 1)

Parse-hash

Parse-hash

Parse-hash

Parse-hash

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Users only provide the “Map” and “Reduce” functions

Apache Avro: designed for communication between Hadoop nodes through data serialization

Cassandra and Hbase: a non-relational database designed for use with Hadoop

Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop

Mahout: an AI tool designed for machine learning; that is, to assist with filtering data for analysis and exploration

Pig Latin: A data-flow language and execution framework for parallel computation

ZooKeeper: Keeps all the parts coordinated and working together

Documents

Introduction_OF_Hadoop_and_BigData