35
Building an ETL pipeline for Elasticsearch using Spark * * @2014 eXelate Inc. Confidential and Proprietary Itai Yaffe, Big Data Infrastructure Developer December 2015

Building an ETL pipeline for Elasticsearch using Spark

Embed Size (px)

Citation preview

Page 1: Building an ETL pipeline for Elasticsearch using Spark

Building an ETL pipeline for Elasticsearch using Spark

* *@2014 eXelate Inc. Confidential and Proprietary

Itai Yaffe, Big Data Infrastructure Developer

December 2015

Page 2: Building an ETL pipeline for Elasticsearch using Spark

Agenda• About eXelate• About the team• eXelate’s architecture overview• The need • The problem• Why Elasticsearch and how do we use it?• Loading the data• Re-designing the loading process• Additional improvements• To summarize

* *©2011 eXelate Inc. Confidential and Proprietary

Page 3: Building an ETL pipeline for Elasticsearch using Spark

About eXelate, a Nielsen company• Founded in 2007• Acquired by Nielsen on March, 2015• A leader in the Ad Tech industry• Provides data and software services through :

• eXchange (2 billion users)• maX DMP (data management platform)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 4: Building an ETL pipeline for Elasticsearch using Spark

Our numbers

* *©2011 eXelate Inc. Confidential and Proprietary

• ~10 billion events per day• ~150TB of data per day• Hybrid cloud infrastructure

• 4 Data Centers• Amazon Web Services

Page 5: Building an ETL pipeline for Elasticsearch using Spark

About the team• The BDI (Big Data Infrastructure) team is in charge of

shipping, transforming and loading eXelate’s data into various data stores, making it ready to be queried efficiently

• For the last year and a half, we’ve been transitioning our legacy systems to modern, scale-out infrastructure (Spark, Kafka, etc.)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 6: Building an ETL pipeline for Elasticsearch using Spark

About me• Dealing with Big Data challenges for the last 3.5 years,

using :• Cassandra• Spark• Elasticsearch• And others…

• Joined eXelate on May 2014• Previously : OpTier, Mamram

• LinkedIn : https://www.linkedin.com/in/itaiy• Email : [email protected]

* *©2011 eXelate Inc. Confidential and Proprietary

Page 7: Building an ETL pipeline for Elasticsearch using Spark

eXelate’s architecture overview

* *©2011 eXelate Inc. Confidential and Proprietary

Serving (frontend servers)

Incoming HTTP requests

ETL

ETL

ETL

DMP applications

(SaaS)

DB

DWH

Page 8: Building an ETL pipeline for Elasticsearch using Spark

The need

* *©2011 eXelate Inc. Confidential and Proprietary

Page 9: Building an ETL pipeline for Elasticsearch using Spark

The need

* *©2011 eXelate Inc. Confidential and Proprietary

Page 10: Building an ETL pipeline for Elasticsearch using Spark

The need• From the data perspective :

• ETL – collect raw data and load it into Elasticsearch periodically• Tens of millions of events per day• Data is already labeled

• Query - allow ad hoc calculations based on the stored data• Mainly counting unique users related to a specific

campaign in conjunction with geographic/demographic data limited by date range

• The number of permutations is huge, so real-time queries are a must! (and can’t be pre-calculated)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 11: Building an ETL pipeline for Elasticsearch using Spark

The problem• We chose Elasticsearch as the data store (details to

follow)• But… the ETL process was far from optimal

• Also affected query performance

* *©2011 eXelate Inc. Confidential and Proprietary

Page 12: Building an ETL pipeline for Elasticsearch using Spark

Why Elasticsearch?• Originally designed as a text search engine• Today it has advanced real-time analytics

capabilities• Distributed, scalable and highly available

* *©2011 eXelate Inc. Confidential and Proprietary

Page 13: Building an ETL pipeline for Elasticsearch using Spark

How do we use Elasticsearch?

• We rely heavily on its counting capabilities• Splitting the data into separate indices based on a

few criteria (e.g TTL, tags VS segments)• Each user (i.e device) is stored as a document with

many nested document

* *©2011 eXelate Inc. Confidential and Proprietary

Page 14: Building an ETL pipeline for Elasticsearch using Spark

How do we use Elasticsearch?

* *©2011 eXelate Inc. Confidential and Proprietary

{"_index": "sample","_type": "user","_id": "0c31644ad41e32c819be29ba16e14300","_version": 4,"_score": 1,"_source": {

"events": [{

"event_time": "2014-01-18","segments": [

{"segment": "female"

},{

"segment": "Airplane tickets"}

]},{

"event_time": "2014-02-19","segments": [

{"segment": "female"

},{

"segment": "Hotel reservations"}

]}

]}}

Page 15: Building an ETL pipeline for Elasticsearch using Spark

Loading the data

* *©2011 eXelate Inc. Confidential and Proprietary

Page 16: Building an ETL pipeline for Elasticsearch using Spark

Standalone Java loader application• Runs every few minutes• Parses the log files• For each user we encountered :

• Queries Elasticsearch to get the user’s document• Merges the new data into the document on the

client-side• Bulk-indexes documents into Elasticsearch

* *©2011 eXelate Inc. Confidential and Proprietary

Page 17: Building an ETL pipeline for Elasticsearch using Spark

OK, so what’s the problem?• Multiple updates per user per day

• Updates in Elasticsearch are expensive (basically delete + insert)

• Merges are done on the client-side• Involves redundant queries

• Leads to degradation of query performance• Not scalable or high available

* *©2011 eXelate Inc. Confidential and Proprietary

Page 18: Building an ETL pipeline for Elasticsearch using Spark

Re-designing the loading process• Batch processing once a day during off-hours

• Daily dedup leads to ~75% less update operations in Elasticsearch

• Using Spark as our processing framework• Distributed, scalable and highly available • Unified framework for batch, streaming, machine

learning, etc.• Using update script

• Merges are done on the server-side

* *©2011 eXelate Inc. Confidential and Proprietary

Page 19: Building an ETL pipeline for Elasticsearch using Spark

Elasticsearch update script

* *©2011 eXelate Inc. Confidential and Proprietary

import groovy.json.JsonSlurper;

added=false;

def slurper = new JsonSlurper();def result = slurper.parseText(param1);ctx._ttl = ttl;ctx._source.events.each() { item->if (item.event_time == result[0].event_time) { def segmentMap = [:]; item.segments.each() { segmentMap.put(it.segment,it.segment) }; result[0].segments.each{ if(!segmentMap[it.segment]){ item.segments += it } }; added=true; }};

if(!added) { ctx._source.events += result}

Page 20: Building an ETL pipeline for Elasticsearch using Spark

Re-designing the loading process

* *©2011 eXelate Inc. Confidential and Proprietary

AWS S3

AWS Data Pipeline

AWS EMR

AWS SNS

Page 21: Building an ETL pipeline for Elasticsearch using Spark

Zoom-in• Log files are compressed (.gz) CSVs• Once a day :

• Files are copied and uncompressed into the EMR cluster using S3DistCp

• The Spark application :• Groups events by user and build JSON documents,

which include an inline udpate script• Writes the JSON documents back to S3

• The Scala application reads the documents from S3 and bulk-indexes them into Elasticsearch

• Notifications are sent via SNS

* *©2011 eXelate Inc. Confidential and Proprietary

Page 22: Building an ETL pipeline for Elasticsearch using Spark

We discovered it wasn’t enough…• Redundant moving parts• Excessive network traffic• Still not scalable enough

* *©2011 eXelate Inc. Confidential and Proprietary

Page 23: Building an ETL pipeline for Elasticsearch using Spark

Elasticsearch-Spark plugin-in for the rescue…

* *©2011 eXelate Inc. Confidential and Proprietary

AWS S3

AWS Data Pipeline

AWS EMR

Elasticsearch-Spark plug-in

AWS SNS

Page 24: Building an ETL pipeline for Elasticsearch using Spark

Deep-dive• Bulk-indexing directly from Spark using

elasticsearch-hadoop plugin-in for Spark : // Save created RDD records to a file documentsRdd.saveAsTextFile(outputPath)

Is now : // Save created RDD records directly to ElasticsearchdocumentsRdd.saveJsonToEs(configData.documentResource, scala.collection.Map(ConfigurationOptions.ES_MAPPING_ID -> configData.documentIdFieldName))

• Storing the update script on the server-side (Elasticsearch)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 25: Building an ETL pipeline for Elasticsearch using Spark

Better…• Single component for both processing and indexing • Elastically scalable • Out-of-the-box error handling and fault-tolerance

• Spark-level (e.g spark.task.maxFailures)• Plug-in level (e.g

ConfigurationOptions.ES_BATCH_WRITE_RETRY_COUNT/WAIT)

• Less network traffic (update script is stored in Elasticsearch)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 26: Building an ETL pipeline for Elasticsearch using Spark

… But• Number of deleted documents continually grows

• Also affects query performance• Elasticsearch itself becomes the bottleneck

• org.elasticsearch.hadoop.EsHadoopException: Could not write all entries [5/1047872] (maybe ES was overloaded?). Bailing out...

• [INFO ][index.engine ] [NODE_NAME] [INDEX_NAME][7] now throttling indexing: numMergesInFlight=6, maxNumMerges=5

* *©2011 eXelate Inc. Confidential and Proprietary

Page 27: Building an ETL pipeline for Elasticsearch using Spark

Expunging deleted documents• Theoretically not a “best practice” but necessary

when doing significant bulk-indexing• Done through the optimize API

• curl -XPOST http://localhost:9200/_optimize?only_expunge_deletes

• curl -XPOST http://localhost:9200/_optimize?max_num_segments=5

• A heavy operation (time, CPU , I/O)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 28: Building an ETL pipeline for Elasticsearch using Spark

Improving indexing performance

• Set index.refresh_interval to -1• Set indices.store.throttle.type to none• Properly set the retry-related configuration

properties (e.g spark.task.maxFailures)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 29: Building an ETL pipeline for Elasticsearch using Spark

What’s next?• Further improve indexing performance, e.g :

• Reduce excessive concurrency on Elasticsearch nodes by limiting Spark’s maximum concurrent tasks

• Bulk-index objects rather than JSON documents to avoid excessive parsing

• Better monitoring (e.g using Spark Accumulators)

* *©2011 eXelate Inc. Confidential and Proprietary

Page 30: Building an ETL pipeline for Elasticsearch using Spark

To summarize• We use :

• S3 to store (raw) labeled data

• Spark on EMR to process the data• Elasticsearch-hadoop plug-in for bulk-indexing

• Data Pipeline to manage the flow

• Elasticsearch for real-time analytics

* *©2011 eXelate Inc. Confidential and Proprietary

Page 31: Building an ETL pipeline for Elasticsearch using Spark

To summarize• Updates are expensive – consider daily dedup• Avoid excessive querying and network traffic -

perform merges on the server-side • Use an update script • Store it on your Elasticsearch cluster

• Make sure your loading process is scalable and fault-tolerant – use Spark• Reduce # of moving parts • Index the data directly using elasticsearch-hadoop plug-in

* *©2011 eXelate Inc. Confidential and Proprietary

Page 32: Building an ETL pipeline for Elasticsearch using Spark

To summarize• Improve indexing performance – properly configure

your cluster before indexing • Avoid excessive disk usage – optimize your indices

• Can also help query performance

• Making the processing phase elastically scalable (i.e using Spark) doesn’t mean the whole ETL flow is elastically scalable • Elasticsearch becomes the new bottleneck…

* *©2011 eXelate Inc. Confidential and Proprietary

Page 33: Building an ETL pipeline for Elasticsearch using Spark

Questions?

Also - we’re hiring!http://exelate.com/about-us/careers/

•DevOps team leader•Senior frontend developers•Senior Java developers

* *©2011 eXelate Inc. Confidential and Proprietary

Page 34: Building an ETL pipeline for Elasticsearch using Spark

Thank you

©2011 eXelate Inc. Confidential and Proprietary

Itai Yaffe

Page 35: Building an ETL pipeline for Elasticsearch using Spark

Keep an eye on…• S3 limitations :

• The penalty involved in moving files• File partitioning and hash prefixes

* *©2011 eXelate Inc. Confidential and Proprietary