Upload
nilay-mishra
View
135
Download
5
Embed Size (px)
Citation preview
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.
“Big data” isn’t just a technology—it’s a business strategy for capitalizing on information resources
Social media and networks (all of us are generating data)
Scientific instruments (collecting all sorts of data)
Mobile devices (tracking all objects all the time)
Sensor technology and networks (measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion
3
4
Analyst
IT
I need to evaluate the possible relationship between client
salary and overdrafts
OK. We have to evaluate a lot of statistics, set the correct
db indexes and db partitioning. It will take us 5
days.
Analyst
IT
Great. Thanks a lot. I’m going to check the results.
Done. You can run your analytical query.
Analyst
IT
Great. I can see here some nice correlations. Now I need to look
at it from the different perspective.
Ohhh, welcome dear friend. Understand. So, it’s ….
another 5 days of our work
Noooo!!! It’s not possible to
work here!
8
Hadoop Distributed File System
Data is organized in files and directory.
Files are divided into blocks and distributed across
cluster nodes.
Block placement is done at runtime.
Replication
Blocks are replicated to handle error.
Checksum is used to check data integrity.
Functional Programming Meets Distributed Processing.
Automatic parallelization and distributed processing .
18
Shuffle & Sorting based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks on HDFS
Produces (k, v) ( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v]) ( , [1,1,1,1,1,1..])
Produces(k’, v’) ( , 100)
Users only provide the “Map” and “Reduce” functions
Apache Avro: designed for communication between Hadoop nodes through data serialization
Cassandra and Hbase: a non-relational database designed for use with Hadoop
Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop
Mahout: an AI tool designed for machine learning; that is, to assist with filtering data for analysis and exploration
Pig Latin: A data-flow language and execution framework for parallel computation
ZooKeeper: Keeps all the parts coordinated and working together