45
Stream Processing in SmartNews Takumi Sakamoto 2016.03.12

Stream Processing in SmartNews #jawsdays

Embed Size (px)

Citation preview

Page 1: Stream Processing in SmartNews #jawsdays

Stream Processing in SmartNews

Takumi Sakamoto 2016.03.12

Page 2: Stream Processing in SmartNews #jawsdays

Takumi Sakamoto @takus 😍 = ⚽ ✈ 📷

Page 3: Stream Processing in SmartNews #jawsdays

http://bit.ly/1MCOyBX

JAWSDAYS 2015

Page 4: Stream Processing in SmartNews #jawsdays

AWS Case Study

http://aws.amazon.com/solutions/case-studies/smartnews/

Page 5: Stream Processing in SmartNews #jawsdays

What is SmartNews?

• News Discovery App for Mobile

• Launched in 2012

• 15M+ Downloads in World Wide

https://www.smartnews.com/en/

Page 6: Stream Processing in SmartNews #jawsdays

How We Deliver News?

Internet Algorithms TrendingNews

Page 7: Stream Processing in SmartNews #jawsdays

Why Stream Processing?

Page 8: Stream Processing in SmartNews #jawsdays

Today’s News is Wrapping

Tomorrow’s Fish and Chips

Page 9: Stream Processing in SmartNews #jawsdays

↑ Yesterday's News

http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/

Page 10: Stream Processing in SmartNews #jawsdays

News Articles Lifetime

https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/

Page 11: Stream Processing in SmartNews #jawsdays

Speed is Matter for Us

Page 12: Stream Processing in SmartNews #jawsdays

System Overview

Page 13: Stream Processing in SmartNews #jawsdays

News Delivery Pipeline

CrawlerInternet Analyzer Indexer CloudSearch APISearch

APIGateway

Mobile App

APITracker

DynamoDB

Index System

Feedback System

1 minute

5 minute

Page 14: Stream Processing in SmartNews #jawsdays

Index System

• Crawler

• collect news articles & social signals

• Analyzer

• extract title, content, thumbnail...

• classify topics (sports, politics, technology...)

• Indexer

• upload article metadata into CloudSearch

Page 15: Stream Processing in SmartNews #jawsdays

Feedback System

• API Tracker

• receive user's activity log from mobile app

• Spark Streaming

• generate various metrics for news ranking

• stored metrics into DynamoDB

Page 16: Stream Processing in SmartNews #jawsdays

How to Glue Each Service?

Page 17: Stream Processing in SmartNews #jawsdays

Ref: Amazon Kinesis: Real-time Streaming Big data Processing Applications

Page 18: Stream Processing in SmartNews #jawsdays

Why Kinesis Streams?

• Fully managed service

• Multiple consumer applications

• Reasonable pricing

Page 19: Stream Processing in SmartNews #jawsdays

Multiple Consumers

Kinesis Stream

Sparkon EMR

AWSLambda

DataScientist

I wanna consume streaming data by

Spark

ApplicationEngineer

I wanna add a streaming monitor

by Lambda

Empowers Engineers to Do Trial and Error

Page 20: Stream Processing in SmartNews #jawsdays

News Delivery Pipeline

CrawlerInternet Analyzer Indexer CloudSearch APISearch

APIGateway

Mobile App

APITracker

DynamoDB

KinesisStream

KinesisStream

KinesisStream

Page 21: Stream Processing in SmartNews #jawsdays

Data & Its Numbers

• User activities

• ~100 GBs per day (compressed)

• 60+ record types

• User demographics or configurations etc...

• 15M+ records

• Articles metadata

• 100K+ records per day

Page 22: Stream Processing in SmartNews #jawsdays

How We Produce/Consume Kinesis Streams?

Page 23: Stream Processing in SmartNews #jawsdays

Index System

Crawler

KPL

KPL

KPL

KCL

KCL

KCL

KPL

KPL

KPL

Analyzer

KCL

KCL

KCL

Indexer

CloudSearch

Collect, Analyze and Index Articles with Kinesis Libraries (KPL & KCL)

Page 24: Stream Processing in SmartNews #jawsdays

Kinesis Libraries

• Kinesis Producer Library (KPL)

• put records into an stream

• asynchronous architecture (buffer records)

• Kinesis Consumer Library (KCL)

• consume and process data from an stream

• handle complex tasks associated with distributed

computing

Page 25: Stream Processing in SmartNews #jawsdays

KPL/KCL Monitoring

• KPL/KCL publishes custom CloudWatch metrics

• Key Metrics for KPL

• User Record Received, User Record Pending

• All Errors

• Key Metrics for KCL

• RecordsProcessed

• MillisBehindLatest

• RecordProcessor.processRecords.Time

https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html

https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html

Page 26: Stream Processing in SmartNews #jawsdays

Monitoring with Datadog

Page 27: Stream Processing in SmartNews #jawsdays

Feedback System

Generate Metrics by User Clusters for Ranking Articles

Amazon CloudSearch

APISearch

APIGateway

Kinesis Stream

Amazon S3 Hive / Spark

DynamoDB

UserClusters

UserFeedback

APITracker

Amazon S3

Offline ETL / Machine Learning

PushNotification

ArticleMetadata

Metricsby Cluster

Page 28: Stream Processing in SmartNews #jawsdays

Why Metrics by Cluster?

Consider Each User's Interests

Ensure Diversity for Avoiding Filter Bubble

https://en.wikipedia.org/wiki/Filter_bubble

Amazon CloudSearch

API

DynamoDB

Article raw score

San Fransisco Giants … 3.5

New York Yankees … 6.2

FIFA World Cup … 20.4

U.S.Open Championships … 8.4

weight

1

0.6

0.2

0.2

score

3.5

3

4.08

1.68

+ =User

GET /news/sports

Metrics by User Cluster

Article Inventry

userId: 1000gender: Maleage: 36location: San Fransisco, USinterests: Baseball

Page 29: Stream Processing in SmartNews #jawsdays

Input Data by Fluentd

• Forwarder (running on each instances)

• archive events to S3

• forward events to aggregators

• Aggregator (HA Configuration※)

• put events into Kinesis Stream

• alert and report (not mentioned here)

※ http://docs.fluentd.org/articles/high-availability

Page 30: Stream Processing in SmartNews #jawsdays

Example Configurations

<source> @type tail tag smartnews.user_activity ... </source>

<match smartnews.user_activity> @type copy <store> @type s3 ... </store> <store> @type forward ... </store> </match>

Forwarder

<source> @type forward ... </source>

<match smartnews.user_activity> @type copy <store> @type kinesis ... </store> <store> ... </store> </match>

Aggregator

http://docs.fluentd.org/articles/kinesis-stream

Page 31: Stream Processing in SmartNews #jawsdays

Offline ETL Flow

Transform Text Files into Columnar Files Various Machine Learning Tasks

API

RDS

{ “timestamp”: 1453161447, “userId”: 1234, “platform”: “ios”, “edition”: “ja_JP”, “action”: “viewArticle”, “data”: { “articleId: 1234, “duration”: 30.2 }}

userId, age, gender, location, 1234, 28, M, Tokyo, …1235, 32, F, Nagano, …1240, 18, F, Keyoto, …

Amazon S3

Hive on EMR

Amazon S3

Airflow

ManageWorkflow

Activities

Users

Spark on EMR

Page 32: Stream Processing in SmartNews #jawsdays

Airflow: Workflow Engine

Execute Task A -> Task B -> Task C, D

5 * * * * app hive -f query_1.hql

15 * * * * app hive -f query_2.hql

30 * * * * app hive -f query_3.hql

Page 33: Stream Processing in SmartNews #jawsdays

Spark Streaming

Kinesis Stream

Shard 1

Shard 2

Shard3

Dstream 1

Dstream 2

Dstream 3

RDD

RDD

RDD

RDD

Female

Male

+

Minutely RDD

Teen

Female

Male

Teen

Female

Male

Teen

Minutely Metrics by User Cluster

DynamoDB

.

.

.Pre Computed RDD

Split Streams into Minutely RDD

Join Minutely RDD on PreComputed RDD

Page 34: Stream Processing in SmartNews #jawsdays

Monitor Spark Streaming

Spark UI is Useful for Monitoring

Page 35: Stream Processing in SmartNews #jawsdays

Integrate with CloudWatch

class CloudWatchRelay(conf: SparkConf) extends StreamingListener { override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { putMetricToCloudWatch(s"BatchStarted", 1.0) } override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { putMetricToCloudWatch(s"BatchCompleted", 1.0) putMetricToCloudWatch(s"BatchRecordsProcessed", batchCompleted.batchInfo.numRecords toDouble) batchCompleted.batchInfo.processingDelay.foreach { delay => putMetricToCloudWatch(s"ProcessingDelay", delay) } batchCompleted.batchInfo.schedulingDelay.foreach { delay =>

putMetricToCloudWatch(s"SchedulingDelay", delay) } batchCompleted.batchInfo.totalDelay.foreach { delay => putMetricToCloudWatch(s"TotalDelay", delay) } } }

Set Alert to SchedulingDelay

Page 36: Stream Processing in SmartNews #jawsdays

Summary

Page 37: Stream Processing in SmartNews #jawsdays

Summary

• Fast & stable stream processing is crucial for SmartNews

• lifetime of news is very short

• process events as fast as possible

• Kinesis Stream plays an important role

• one-click provision & scale

• empowers engineers to do trial & error

Page 38: Stream Processing in SmartNews #jawsdays

Discuss More?

Join Our Free Lunch in Tokyo Office!!

Page 39: Stream Processing in SmartNews #jawsdays

We’re hiring!!!

ML/NLP engineer Site reliability engineer Web application engineer iOS/Android engineer Ad engineer

http://about.smartnews.com/en/careers/

Page 40: Stream Processing in SmartNews #jawsdays

See Also

• SmartNews の Webmining を支えるプラットフォーム

• Stream 処理と Offline 処理の統合

• Building a Sustainable Data Platform on AWS

• AWS meetup「Apache Spark on EMR」

Page 41: Stream Processing in SmartNews #jawsdays

PipelineDB

Page 42: Stream Processing in SmartNews #jawsdays

PipelineDB

• OSS & enterprise streaming SQL database

• PostgreSQL compatible

• connect to Chartio 😍

• join stream to normal PostgreSQL table

• Support probabilistic data structures

• e.g. HyperLogLog

https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/

Page 43: Stream Processing in SmartNews #jawsdays

Realtime Monitoring

APIGateway

Stream

Continuous View

Continuous View

Continuous View

Discard raw record soon afterconsumed by Continuous View

Incrementallyupdated in realtime

PipelineDB Chartio

AWSLambda

Slack

Access Continuous Viewby PostgreSQL Client

Record

Page 44: Stream Processing in SmartNews #jawsdays

Continuous View

-- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day, hostname;

-- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream;

-- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;

Page 45: Stream Processing in SmartNews #jawsdays

Dashboard in Chartio1. Building query (Drag&Drop / SQL)

2. Add step (filter、sort、modify)

3. Select visualize way (table、graph)