Big data

Elementary, my clear …BigData

"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay."Sherlock Holmes – The Adventure of the Copper Beeches

Constantin CiureanuSoftware Architect ([email protected])Project “Hadoop Data Transformation”

Small Data

• If Excel cannot handle that amount of data – doesn’t mean it’s BigData

• “Small data” usually means:– Very clear & structured data– Easy to understand datasets content, join with other datasets

• Keeping data small means:– Store just a limited amount of records / only relevant columns– Process / Sample / Aggregate, then store the results and drop the rest (so the

original raw data is lost, hence further analysis is not possible anymore, creating new metrics is almost impossible)

– Drop old partitions periodically

Intro - Interesting facts• The volume of business data worldwide, across all companies,

doubles every 1.2 years (was 1.5 years)

• A regular person is processing daily more data than a 16th century individual in his entire life

• In the last years cost of storage and processing power dropped significantly

• Bad data or poor data quality costs US businesses $600 billion annually

• Big data will drive $232 billion in spending through 2016 (Gartner)

• By 2015, 4.4 million IT jobs globally will be created to support big data (Gartner)

• Facebook processes 10 TB of data every day / Twitter 7 TB

• Google has over 3 million servers processing over 2 trillion searches per year in 2012 (only 22 million in 2000)

De-mystifying BigData• #1 There’s no such thing as too big! Just data of unusual size meaning

that a brute force approach is out of the question• #2 Data is everywhere! You just need to collect, store and understand it!• “Big data represents a new era in data exploration and utilization.”

• Sources of big datasets:– Transactions– Logs– Emails– Social media (tweets, messages, posts, likes)– Network elements– (sampling) Various events, measurements from sensors (eg. LHC)– User interactions (eg. Shopping sessions / Clicks / Scrolls & mouse hover)– Geospatial data– Audio / Images / Video– External sources (there are companies selling data)

BigData – domain of interest

• Computer science: e-commerce, Social networks, Banking and Finance Markets, Advertising, Telecom, Media, Analytics, Visualization, Machine learning, Predictions, Ads, Personalized recommendations (better customer understanding), Data mining

• Statistics – analysis with better calculation precision• Health, Medicine & Biology (eg. “Decoding the human genome

originally took 10 years to process, now it can be achieved in less than a week”)

• Processing BigData requires using distributed architectures and algorithms (most of them based on a “Divide and conquer” approach, while others also relies heavily on sampling)

Reasons to use BigData

• Continuously decreasing of cost for hardware and storage• There are companies out there storing everything. Some making the

mistake to store all logs as “big data” • Work faster, scalable, consider every piece of data not just sampling• Sell better (have an advantage in market analysis and expand into

new markets)• Few of companies do actually value BigData and are able to

properly explore and use its entire potential.• “Divide and conquer” approach – need to change the way you think!• Sharded data, goes nicely along Bloom Filters concept• Very difficult to work with BigData because it’s mainly not clear, un-

structured and implies using huge files

Hadoop - Introduction

• Hadoop is just a small piece of the BigData puzzle• As we all know Google is far ahead compared to the entire BigData

market out there• 2004 – Google published a paper on a process called MapReduce

(a framework which provides a parallel processing model and associated implementation to process huge amount of data)

• Google File System (GFS - recently Colossus) -> Apache HDFS

• Google MapReduce -> Apache MapReduce

• Google BigTable -> Apache HBase

Hadoop vs. Other Systems

8

Distributed Databases Hadoop

Computing Model - Notion of transactions- Transaction is the unit of work- ACID properties, Concurrency control- Real-Time

- Notion of jobs- Job is the unit of work- No concurrency control- Not Real-Time

Data Model - Structured data with known schema- Read/Write mode

- Any data will fit in any format - (un)(semi)structured- ReadOnly mode (Append mode

added in Hadoop v2)

Cost Model - Expensive servers - Cheap commodity machines

Fault Tolerance - Failures are rare- Recovery mechanisms

- Failures are common over thousands of machines

- Simple yet efficient fault tolerance- Data replication- High availability- Self healing

Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance

Hadoop Subprojects & Related

• Pig– High-level language for data analysis

• HBase– Table storage for semi-structured data

• Zookeeper– Coordinating distributed applications

• Hive– SQL-like Query language and Metastore

• Mahout– Machine learning

• Lucene, Solr, Blur– Indexing and search engines

• …

MapReduce - #1

• With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was incredibly successful, so others wanted to replicate the algorithm

• The same code can run on data of any size.• The code runs where the data is• Therefore, an implementation of MapReduce framework was

adopted by an Apache open source project named Hadoop.• HDFS = Hadoop Distributed File System, an Apache open source

distributed file system designed to run on commodity hardware• But the commodity hardware – have issues (eg. if one node fails -

on average - once each 3 years, then 1000 nodes will have a failure each day – again, on average)

MapReduce - #2

MapReduce - example

Shuffle & Sorting based on k

Input blocks on HDFS

Produces (k, v) ( , 1)

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Part0003

Part0002

Part0001

Example - NoSQL DB – HBase

• HBase is a column-oriented database management system that runs on top of HDFS.

• HBase can store massive amounts of data and allows random access to it

• Can store sparse datasets / denormalized data / has flexible schemas• The tables are Key-Value storage (one single “PK”)• Get operations are generally very fast, while scan for a range of start-

stop keys is incredibly fast. Hence the need to think a lot in choosing the right key “composition” to allow your application to use the range scanning

• Drawbacks of HBase – no SQL support

BigData @Amazon– Amazon stores and process more shopping data than any other

single company– Offers software services to external companies via AWS (Amazon

Web Services)– Amazon DynamoDB (a high performant NoSQL database)– Amazon Elastic Cloud Control (EC2), Elastic Map Reduce (EMR)– Amazon analyzes hundreds of millions of daily sales, inventory

ASINs, various SKUs to compute metrics in near real-time, determine prices that maximize profit and clear inventory (literally lowest available price on-line)

– Amazon uses click stream analysis, machine learning and data mining to detect robots / fraudulent behavior / optimize page content and structure, maximize income by using recommendations and price setting mechanisms

– Amazon sends recommendations to customers offers (read spam )

– Optimize routes thousands of package delivery

Questions?

• And yet, this is only the beginning!

• Interesting links:– http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf

– https://www.youtube.com/watch?v=qqfeUUjAIyQ

– https://www.youtube.com/watch?v=c4BwefH5Ve8

– https://www.youtube.com/watch?v=HFplUBeBhcM

• Expect more on Tech Herbert day!