10
Big Data Technologies Chandra Chikkareddy

Big data technologies

Embed Size (px)

Citation preview

Big Data Technologies

Chandra Chikkareddy

Introduction

• Big Data is Data that is hard to capture, store, and analyze with commonly used software tools due to its very large size

• “World’s nervous system—a real-time feedback loop which didn’t exist before” - Yahoo CEO Marissa Mayer

January 25, 2013www.societyconsulting.co

m2

• Mobile devices, smart energy meters, remote sensing, wireless sensors, software machine logs, cameras, rfid readers, etc. are creating massive amounts of data that businesses & governments now have the opportunity to analyze and act upon.

• Every day approx 2.5 quintillion (2.5×10^18) bytes of data is created.

• Business and economic possibilities of big data and its wider implications are important issues that business leaders and policy makers will tackle in the years ahead

Why you should care?

January 25, 2013www.societyconsulting.co

m3

Industry verticals using Big Data

January 25, 2013www.societyconsulting.c

om4

Digital Media & E-Commerce

Real-time ad targeting, Web analytics & trends

Energy and Utilities Smart meter analytics, Asset management

Financial Services Risk and fraud management, Portfolio management, Customer analytics

Government Threat Management, Law Enforcement (Real-time multimodal surveillance, Cyber security detection), Macro economic analytics

Healthcare and Life Sciences

New drug development, Medical record text analytics, Genomic analytics

Retail CRM, Targeted marketing analysis, Vendor delivery & Supply chain optimizations, Market basket analysis, Click-stream analysis

Telecommunications CRM, Call detail record analysis, Least cost routing, Fraud management

Transportation Logistics optimization, Traffic congestion

Any industry vertical which accumulates a sufficient quantity of data can leverage Big data technologies.  Here are some of the verticals

Big Data landscape/technologies

January 25, 2013www.societyconsulting.c

om5

Source: http://www.forbes.com/sites/oracle/2012/12/13/billions-of-reasons-to-get-ready-for-big-data/http://www.rosebt.com/1/post/2012/6/big-data-vendor-landscape.htmlhttp://www.dataart.com/software-outsourcing/big-data http://www.capgemini.com/technology-blog/2012/09/big-data-vendors-technologies/

Big Data Process/Steps

January 25, 2013www.societyconsulting.c

om6

Data processing steps at a basic level can be broken into three stages. Data as being raw indicators, information as the meaningful interpretation of those signals, and insight as an actionable piece of knowledge.

• Consider 10 million page views a day on a popular web site• Capture User id for every page view and store them

as integer• 10 million x 4 bytes = 40 MB of storage/day• 40MB x 30 days = 1.17 GB/month

• Data quickly grows and so does challenges around storage, processing and analytics.

Why Web Analytics quickly leads to Big Data Science

10^7 elementsDomain of 32 – bit integers 40MB / day

January 25, 2013www.societyconsulting.co

m7

New Algorithm techniques in traditional computing• Probabilistic Data structures

• Cardinality Estimation, Frequency Estimation, Range Query, Membership Query etc.

Distributed computing /Divide and Conquer• Break processing units into equal parts, get individual

results, and aggregate• Distributed systems are complex to build and maintain

• Depended on academia & research labs for renting compute

Dealing with large datasets

January 25, 2013www.societyconsulting.c

om8

Traditional Distributed system challenges

Data exchange requires synchronization

Temporal dependencies are complicated

Difficult to deal with partial failures of the system

Mostly at compute time, data is copied to the compute nodes

Developers spend more time designing for failure than they do actually working on the problem itself

Transferring data to compute nodes becomes a bottleneck• Typical disk data transfer rate: 75MB/sec -- Time

taken to transfer 100GB of data to the processor: approx 22 mins.

New approach is needed

January 25, 2013www.societyconsulting.c

om9

Ideal system for distributed computing

Partial failure support

Data recoverability

Component recoverability

Consistency

Scalability

January 25, 2013www.societyconsulting.c

om10