1. Breaking ETL barrier with Real-time reporting using Kafka, Spark Streaming
2. About us Concur (now part of SAP) provides travel and expense management services to businesses.
3. Data Insights A team that is building solutions to provide customer access to data, visualization and reporting. Expense Travel Invoice
4. About me Santosh Sahoo Principal Architect III, Data Insights
5. Stack so far.. OLAP ReportETL OLTP App
6. Numbers 7K OLTP database sources 14K OLAP Reporting dbs 28K ETL Jobs 2B row changes 300M rows (Compacted) Only ~20 failure a night
7. Traditional ETL challenges Scheduled (High latency) Hard to scale. Failover and recovery. Monolithic-ness Spaghetti (Logic +SQL)
8. Moving forward Streaming, real time Scalable Highly available Reduce maintenance overhead Eventual Consistency
9. Streaming Data Pipeline Source Flow Management Processor Storage Querying
10. Data Source Event bus for business events Log Scrapping Transaction log scraping (Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog) Change Data Capture Application messaging/JMS Micro batching (High watermarked, change tracking)
11. Kafka - Flow Management No nonsense logging 100K/s throughput vs 20k of RabbitMQ Log compaction Durable persistence Partition tolerance Replication Best in class integration with Spark
12. Columnar Storage Optimized for analytic query performance. Vertical partitioning Column Projection Compression Loosely coupled schema. HBase AWS Redshift Parquet ORC Postgres (Citrus) SAP HANA
13. Hadoop/HDFS Pro - Scale Con- Latency
14. Spark Streaming What? A data processing framework to build scalable fault-tolerant streaming applications. Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.