Realtime Reporting using Spark Streaming

  • Published on

  • View

  • Download

Embed Size (px)


  1. 1. Breaking ETL barrier with Real-time reporting using Kafka, Spark Streaming
  2. 2. About us Concur (now part of SAP) provides travel and expense management services to businesses.
  3. 3. Data Insights A team that is building solutions to provide customer access to data, visualization and reporting. Expense Travel Invoice
  4. 4. About me Santosh Sahoo Principal Architect III, Data Insights
  5. 5. Stack so far.. OLAP ReportETL OLTP App
  6. 6. Numbers 7K OLTP database sources 14K OLAP Reporting dbs 28K ETL Jobs 2B row changes 300M rows (Compacted) Only ~20 failure a night
  7. 7. Traditional ETL challenges Scheduled (High latency) Hard to scale. Failover and recovery. Monolithic-ness Spaghetti (Logic +SQL)
  8. 8. Moving forward Streaming, real time Scalable Highly available Reduce maintenance overhead Eventual Consistency
  9. 9. Streaming Data Pipeline Source Flow Management Processor Storage Querying
  10. 10. Data Source Event bus for business events Log Scrapping Transaction log scraping (Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog) Change Data Capture Application messaging/JMS Micro batching (High watermarked, change tracking)
  11. 11. Kafka - Flow Management No nonsense logging 100K/s throughput vs 20k of RabbitMQ Log compaction Durable persistence Partition tolerance Replication Best in class integration with Spark
  12. 12. Columnar Storage Optimized for analytic query performance. Vertical partitioning Column Projection Compression Loosely coupled schema. HBase AWS Redshift Parquet ORC Postgres (Citrus) SAP HANA
  13. 13. Hadoop/HDFS Pro - Scale Con- Latency
  14. 14. Spark Streaming What? A data processing framework to build scalable fault-tolerant streaming applications. Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.
  15. 15. Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor Executor Executor Source D1 D2 D3 D4 WAL D1 D2 Replication Data Store TASK DStream- Discretized Stream of RDD RDD - Resilient Distributed Datasets
  16. 16. Optimized Direct Kafka API
  17. 17. How val kafkaParams = Map("" -> "localhost:9092, anotherhost:9092") val topics = Set("sometopic", "anothertopic") val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
  18. 18. Architecture
  19. 19. App OLTP Kafka Spark Streaming OLAP Reporting App High level view
  20. 20. OLTP Reporting Cognos Tableau ? Archive Flume Camus Stream Processor Spark Samza, Storm, Flink HDFS Import FTP HTTP SMTP C Tachyon P Standby Protobuf Json Broker Kafka Hive/ Spark SQL HANA Load balance Failover HANA HANA HANA Replication Service bus SqoopSnapshot Pig/Hive/MR - Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Complete Architecture
  21. 21. Can Spark Streaming survive Chaos Monkey?
  22. 22. Lambda Architecture Lambda architecture is a data-processing pattern designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.
  23. 23. Demo .
  24. 24. QnA
  25. 25. We are hiring
  26. 26. Thank you!