17
TRACKING LIVE WIKIPEDIA CHANGES [email protected] Insight Data Engineering Week 4 - January 2015 DRAFT WIKIWATCH.ANDREWMO.COM

Insight Data Engineering - Week 4 DRAFT

Embed Size (px)

Citation preview

TRACKING LIVE WIKIPEDIA [email protected]

Insight Data Engineering Week 4 - January 2015

DRAFT WIKIWATCH.ANDREWMO.COM

MOTIVATION

• Raw dumps of Wikipedia data are available for analysis on a monthly basis, but…What about changes between these intervals?

• Data Collection:Live edits for Wikimedia projects are broadcast to nearly 882 IRC channels

• Goals: (Collect, filter, format, transform and produce information about live edit data)

Realtime Telemetry wikiwatch.andrewmo.com

SOMETHING ELSEOther statistics

DATA PIPELINE ENGINEERINGCapture + Fusion + Analysis

Kafka Storm MySQL APIIRC

Hadoop NoSQLHDFS

CAPTUREUp to 882 Simultaneous Channels

~660 events/min avg across all channels

INGESTKafka + Hadoop

Kafka#de logBot

Topic

Topic

Topic#fr

#en

HDFS

–Andrew Mo

We tried Spark Streaming.Scala, it’s not you - It’s me.

Next sprint, maybe?

STREAM PROCESSING

Multiple Topologies(10 sec, 10 min, 1 hr)

Multiple Metrics(events, size, new pages, topics, users)

Python + Storm (Pyleus)MySQL

API ACCESSTime Series Summary Metrics

for Multiple Windows

New Pages

Detailed User Activity

Detailed Topic Activity

Top Topics, Top Users, Top Bots, etc

Thanks

Apache Software FoundationWikimedia FoundationInsight Data Science

LinkedIn (Kafka)Twitter (Storm)

Yelp (Pyleus)

ABOUT MOA Project Manager that Writes Code !

Worked at RAND Corporation Booz Allen Hamilton

Studied at Pardee RAND Graduate School UC San Diego - Electrical Engineering

Alphabet SoupPMP, PMI-ACP, CISSP, ISSEP, CSEP, CSEP-ACQ [email protected] GitHub: https://github.com/moandcompanyLinkedIn: http://linkedin.com/in/andrewmo

BONUS CHARTS

BATCHPROCESSING

Map Reduce+

Hadoop Streaming

HiveMR Job

… and more …

Visual

FIREHOSEMultiplex all sensors to a firehose topic

Kafka#de logBot Omnichannel

#fr

#en

logBot

logBot

VELOCITY AND OUR NEXT SPRINTSprint 1 (MVP Development)

18 Jan - 31 Jan 2015

Address the need + Simplify

API-query elicitation and discovery

Novel feature focus - Realtime

Maximize common-code (Python)

Sprint 2 (MVP Validation)

Engage users + Complete Features

API enhancement

Batch Integration

NoSQL Optimization

Preempt Technical Debt - Refactoring

Velocity Chart

TECHNOLOGY TO EVALUATE

• Presto

• Samza

• Hive + Tez

• Kafka on YARN (KOYA)

• Kafka Security (Authentication)

• Spark + Spark Streaming (1.2+ Python)