Big Data vs Data Warehousing

Preview:

DESCRIPTION

An attempt to fi

Citation preview

Thomas Kejser

thomas@kejser.org

http://blog.kejser.org

@thomaskejser

Bigdata vs. Data Warehousing

Synergy or Conflict?

Thomas Kejserhttp://blog.kejser.org@thomaskejser

• Formerly: Lead SQLCAT EMEA• Now: CTO FusionIo EMEA

• 15 year database experience• Performance Tuner

Who is this Guy?

Billi

on H

uman

s

Year2000 2050 2100 2150 2200 22505

6

7

8

9

10

Source: United Nations Projections

Human Consciousness Doesn’t Scale

Text Messages in a Table

CREATE TABLE AllTexts (

Sender BIGINT 8B

, Receiver BIGINT 8B

, SenderLocation BIGINT 8B

, ReceiverLocation BIGINT 8B

, Time DATETIME 8B , SMS VARCHAR(140) 140B

)= 180Bytes

How much do we text?

• World Average• 6.1 Trillion Text Messages / year• About 80% cell phone coverage• 7 billion people• 3 messages/day/person

• But: • Teenagers: 50 messages/day

Source: Pew Internet Research 2010 & ITU

How much will we EVER text?

• 9B people acting like teenagers (in 2050)• 50 texts/day

• That’s 450 billion texts/day• 164 Trillion texts/year (20x today)• 180 bytes each• Assume x3 compression

• Approximation: 10 Petabytes/year in 2050

LOGCapacity GB

Year

Can it be done?

Moore’s Hard Drives

How Large is this/year?

Hard Disk (4TB) : 2.5”

About 1500 Wine Bottles

Wine Bottle (75cl): 4.0”

• Calculating:• 2U Storage=24 Disks

(includes compute)• 4TB per Disk• 100TB in 2U (a bit

less)• 10PB = 200U storage

• About six racks

In the Data Center

Warehouses Serve us Well..

• Good Management Interfaces

• Standard SQL• with a few extensions

• Appliances• Support system• Homogenous HW

• In chunks

… And it is Becoming a Commodity

vs.

PDW vs. Hive – Scan/seek

SELECT count(*) FROM lineitem

Query 1 Query 20

200400600800

100012001400

HivePDW

Secs.

SELECT max(l_quantity) FROM lineitem WHERE l_orderkey > 1000 and l_orderkey < 100000 GROUP BY l_linestatus

Query 1 Query 2

Hive PDW-U PDW-P0

5001000150020002500300035004000

Series1

Secs.

PDW vs. Hive - Joins

PDW-U: • orders partitioned on c_custkey • lineitem partitioned on l_partkey

PDW-P: • orders partitioned on o_orderkey• lineitem partitioned on

l_orderkey

SELECT max(l_orderkey) FROM ordersJOIN lineitem ON l_orderkey = o_orderkey

• Thread startup times• Co-location awareness• Files vs. optimized DB memory

structures• Column stores and other DB tech

Generic is good…

… but when there is structure, make use of it!

What does Big Data need to Catch up?

• What is BigdataVery Unstructured Data

How many Pictures of Cats?

• Flickr Today: • 300MB/month • 2GB/year• 51M users (too small?)

• Estimate: 102 PB / year

• 10 x text messages

Source: WikiPedia

How big is this in wine bottles?

We have learned how to store it!

• Distributed File System

• Open Source• No more SAN

• The Failure Unit is the Server

What is HDFS?

Fully unstructured data is boring

…Unless you get money for storing it

Acquiring Personal Information

Your Semi-structured Data, the Old Fashioned Way

The Social Angle

Who do you talk to and how often?

The Reasons

Why do you own a cell phone?

Your Semi-structured Data, For Free

- at The PubSaturday, 1:39am

Big Value

Extraction of

of meaning and insight

from semi-structured data

Extracting Meaning from Humans

Method Examples

Turn semi-structure to structure Image recognition, network proximity and super nodes, social media

Needle in a haystack Extract outliers, Fraud

Herd behaviors Clustering, Pattern Recognition, “Customers who bought this also bought”

Text classification and search Text indexes, syntactic counting, pagerank

Text to structure Semantic analysis, loose structure into structure

Find New Customers

“Michael, who is respected among his peers, often talks about his new, coolgadgets”

Michael

Thomas

Tommy

Cross Sell

“Families who own an Aston Martin will often buy a Mini Cooper too”

Free Information

Need: Lots of CPU Cores!

Need: Data Centers!

Provisioning has to be REALLY fast

• Get good at • Statistics (again)• Distributed Algorithms• Tuning

• Understand Physical Constraints

• Acquire deep domain knowledge

Things to Learn for the Future

Something is Changing

Today Tomorrow

YouCAPEX Hardware OPEX Hardware

The Mother of All Stovepipes

Data you are afraidto lose

Big Data / Staging(No Model)

Delivery(Model)

Data You actually need

Synergy

Create Structure for me

Here is a tableWarehouse

Applying Social Media to Structure

Data Warehouse

• There is a model• Seek Co-location• Respond in seconds• Calculate first, query after• Expensive HW• Optimise for target HW• Homogenous HW• Pay vendor, expect

optimised

Big Data

• Don’t bother modeling!• Optional Co-Location• Respond in minutes• Calculate while querying• Cheap HW• Good enough on all HW• Heterogeneous HW• Free license, optimise

yourself

Summary

Q A&

Recommended