30
© comScore, Inc. Proprietary. Using Hadoop to Process a Trillion+ Events Michael Brown, CTO | February 28 th , 2013

Using Hadoop

  • Upload
    eaiti

  • View
    92

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Hadoop

© comScore, Inc. Proprietary.

Using Hadoop to Process a Trillion+ Events

Michael Brown, CTO | February 28th, 2013

Page 2: Using Hadoop

© comScore, Inc. Proprietary. 2

comScore is a leading internet technology company t hatprovides Analytics for a Digital World ™

NASDAQ SCOR

Clients 2,100+ Worldwide

Employees 1,000+

Headquarters Reston, Virginia, USA

Global Coverage Measurement from 172 Countries; 44 Markets Reported

Local Presence 32 Locations in 23 Countries

Big Data Over 1.5 Trillion Digital Interactions Captured Monthly

V0113

Page 3: Using Hadoop

Vocabulary for Measuring InformationIf a Grain of Sand were One Byte of Information . . .

1 Gigabyte =1 billion bytespatch of sand—9” square, 1’ deep

1 Terabyte =1 trillion bytesa sandbox—24’ square, 1’ deep

1 Petabyte =1,000 terabytesa mile long beach—100’ wide , 1’ deep

1 Megabyte =1 million bytesa tablespoon of sand

1 Zetabyte =1,000 exabytesthe same beach—along the entire US coast

1 Exabyte =1,000 petabytesthe same beach—from Maine to North Carolina

1 Yottabyte =1,000 zetabytes (24 Zeroes)enough info to bury the entireUS under 296 feet of sand

Page 4: Using Hadoop

© comScore, Inc. Proprietary.

Panel Heat Map

Page 5: Using Hadoop

© comScore, Inc. Proprietary.

CENSUS

Unified Digital Measurement ™ (UDM) Establishes Platform For Panel + Census Data Integration

PANEL

Unified Digital Measurement (UDM)Patent-Pending Methodology

Adopted by 90% of Top 100 U.S. Media Properties

Global PERSONMeasurement

Global DEVICEMeasurement

V0411

Page 6: Using Hadoop

© comScore, Inc. Proprietary.

Worldwide Tags per Month

0

200,000,000,000

400,000,000,000

600,000,000,000

800,000,000,000

1,000,000,000,000

1,200,000,000,000

1,400,000,000,000

1,600,000,000,000

Jul

Aug

Sep Oct

Nov

Dec Jan

Feb

Mar

Apr

May Jun

Jul

Aug

Sep Oct

Nov

Dec Jan

Feb

Mar

Apr

May Jun

Jul

Aug

Sep Oct

Nov

Dec Jan

Feb

Mar

Apr

May Jun

Jul

Aug

Sep Oct

Nov

Dec Jan

2009 2010 2011 2012 2013

# of

rec

ords

Panel Records Beacon Records

Page 7: Using Hadoop

© comScore, Inc. Proprietary.

Beacon Heat Map

Page 8: Using Hadoop

© comScore, Inc. Proprietary.

Our Event Volume in Perspective

Source: comScore MediaMetrix Worldwide December 2012

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

Top 65 WW Properties – Cumulative Page Views

Page 9: Using Hadoop

© comScore, Inc. Proprietary.

Worldwide UDM ™ Penetration

December 2012 Penetration Data

Europe

Austria 87%Belgium 93%Switzerland 89%Germany 92%Denmark 88%Spain 95%Finland 93% France 92%Ireland 90%Italy 90%Netherlands 93%Norway 91%Portugal 92%Sweden 90%United Kingdom 92%

Asia Pacific

Australia 90%Hong Kong 95%India 92%Japan 82%Malaysia 93%New Zealand 91%Singapore 92%

North America

Canada 94%United States 91%

Latin America

Argentina 95%Brazil 96%Chile 94%Colombia 95%Mexico 93%Puerto Rico 92%

Middle East & Africa

Israel 92%South Africa 78%

Percentage of Machines Included in UDM Measurement

Page 10: Using Hadoop

© comScore, Inc. Proprietary.

High Level Data Flow

Panel

Census

ETL

Delivery

Page 11: Using Hadoop

© comScore, Inc. Proprietary.

Our Cluster

Production Hadoop Cluster

� 120 nodes: Mix of Dell 720xd, R710 and R510 servers

� Each R510 has (12x2TB drives; 64GB RAM; 24 cores)

� 3000+ total CPUs

� 6.0TB total memory

� 2PB total disk space

� Our distro is MapR M5 2.1.0

Page 12: Using Hadoop

© comScore, Inc. Proprietary.

The Project:

vCE – Validated Campaign Essentials

Page 13: Using Hadoop

© comScore, Inc. Proprietary.

comScore - vCE

Page 14: Using Hadoop

© comScore, Inc. Proprietary.

The Problem Statement

Calculate the number of events and unique cookies f or each reportable campaign element

Key take away

� Data on input will be aggregated daily

� Need to process all data for 3 months

� Need to calculate values for every day in the 92 day period spanning all reportable campaign elements

Page 15: Using Hadoop

© comScore, Inc. Proprietary.

Structure of the Required Output

Client Campaign Population Location Cookie Ct Period

1234 160873284 840 1 863,185 1

1234 160873284 840 1 1,719,738 2

1234 160873284 840 1 2,631,624 3

1234 160873284 840 1 3,572,163 4

1234 160873284 840 1 4,445,508 5

1234 160873284 840 1 5,308,532 6

1234 160873284 840 1 6,032,073 7

1234 160873284 840 1 6,710,645 8

1234 160873284 840 1 7,421,258 9

1234 160873284 840 1 8,154,543 10

Page 16: Using Hadoop

© comScore, Inc. Proprietary.

Counting Uniques from a Time Ordered Log File

A

B

C

D

B

A

A

Major Downsides:Need to keep all key elements in memory.

Constrained to one machine for final aggregation.

Page 17: Using Hadoop

© comScore, Inc. Proprietary.

First Version

Java Map-Reduce application which processes pre-agg regated data from 92 days

Map reads the data and emits each cookie as the key of the key value pair

All 130B records go though the shuffle

Each Reducer will get all the data for a particular campaign sorted by cookie

Reducer aggregates the data by grouping key ( Clien t / Campaign / Population ) and calculates unique cookies for period 1-92

Volume Grew rapidly to the point the daily processi ng took more than a day

Page 18: Using Hadoop

© comScore, Inc. Proprietary.

M/R Data Flow

CB

Mapper MapperMapperMap Map Map

Reduce ReduceReduce

BA AC

AA BB CC

A B C

Page 19: Using Hadoop

© comScore, Inc. Proprietary.

Scaling Issue

As our volume has grown we have the following stats :

� Over 500 billion events per month

� Daily Aggregate 1.5 billion

� 130 billion aggregate records for 92 days

� 70K Campaigns

� Over 50 countries

� We see 15 billion distinct cookies in a month

� We only need to output 25 million rows

Page 20: Using Hadoop

© comScore, Inc. Proprietary.

Basic Approach Retrospective

Processing speed is not scaling to our needs on a s ample of the input data

Diagnosis� Most aggregations could not take significant advantage of combiners.

� Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster due to shuffle and skew in data for keys.

Diagnosis� A new approach is required to reduce the shuffle

Page 21: Using Hadoop

© comScore, Inc. Proprietary.

Counting Uniques from a Key Ordered Log File

A

D

B

C

B

A

A

Major Downsides:Need to sort data in advance.

The sort time increases as volume grows.

Page 22: Using Hadoop

© comScore, Inc. Proprietary.

Counting Uniques from a Key Ordered Log File

Page 23: Using Hadoop

© comScore, Inc. Proprietary.

Counting Uniques from Sharded Key Ordered Log Files

Page 24: Using Hadoop

© comScore, Inc. Proprietary.

Solution to reduce the shuffle

The Problem:� Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and

job performance issues

The Idea:� Partition and sort the data by cookie on a daily basis

� Create a custom InputFormat to merge daily partitions for monthly aggregations

Page 25: Using Hadoop

© comScore, Inc. Proprietary.

Custom Input Format with Map Side Aggregation

CB

Mapper MapperMapperMap Map Map

Reduce ReduceReduce

BA AC

A B C

A B C

Combiner Combiner Combiner

A B C

Page 26: Using Hadoop

© comScore, Inc. Proprietary.

Risks for Partitioning

Data locality

� Custom InputFormat requires reading blocks of the partitioned data over the network

� This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one node

Map failures might result in long run times

� Size of the map inputs is no longer set by block size

� This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper

Page 27: Using Hadoop

© comScore, Inc. Proprietary.

Partitioning Summary

Benefits:� A large portion of the aggregation can be completed in the map phase

� Applications can now take advantage of combiners

� Shuffles sizes are minimal

Results:� Took a job from 35 hours to 3 hours with no hardware changes

Page 28: Using Hadoop

© comScore, Inc. Proprietary.

Useful Factoids

Visit www.comscoredatamine.com or follow @datagems for the latest gems.

Colorful, bite-sized graphical representations of the best discoveries we unearth.

Page 29: Using Hadoop

© comScore, Inc. Proprietary.

Thank You!

Michael BrownCTOcomScore, Inc.

[email protected]

Page 30: Using Hadoop

© comScore, Inc. Proprietary. 30

Diagram