Upload
lev-brailovskiy
View
877
Download
0
Embed Size (px)
Citation preview
Legacy SystemDWH Cluster
SSIS Manager
External Data Providers
Event Collector
Caching
Reporting
DWH
Application Servers
Legacy Scale
500TBStorage
40KEvents Processed
per Second
3.5BEvent Processed Daily
Daily Processing
20GBData Daily
Legacy SystemDWH Cluster
SSIS Manager
External Data Providers
Event Collector
Caching
Reporting
DWH
Application Servers
First Stage
Data warehouse
Servers Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector Analytics
Reporting
Monitoring
Servers
sFTPFTP
sFTPFTP
Legacy DWH
Servers
First Stage Summary
Full Redundancy
Comparison Legacy vs. Batch
Linear Scale
Partial Test Coverage
Raw Level Data Access
CD
First Stage
Data warehouse
Servers Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector Analytics
Reporting
Monitoring
Servers
sFTPFTP
sFTPFTP
Legacy DWH
Servers
Second Stage
Data warehouse
Servers Servers
Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector
Scheduling
Reporting
Monitoring
Servers
S3AzuresFTPFTP
AzureS3sFTPFTP
Real Time DWH
Servers
Servers
Analytics
First Stage Summary
Near Real time Processing
Comparison Batch vs. Real Time
Full Monitoring
Full Test Coverage
“Product” Event/Report Definition
DevOps Automation
Batch Event Processing
Hadoop Cluster
Hadoop Monitoring
Aggregated data exporter
Processed data aggregator
Error Processing
Data Archivator
Data Collection Cluster
Raw data processingMap-Reduce
Raw data files pushed to Hadoop (WEB HDFS)
Vertica
External\Internal DWH Clusters
Data flow direction
Monitoring data
Raw data processing1. Cleaning/Transformation/Enrichment/Validation of data from main data sources with Map-Reduce2. Month history
Aggregator Process1. DSL for defining new kind of aggregation
Data exporter1. Export aggregated data2. Export processed data
Processed\Aggregated data
Logging Framework Elastic Search
Logs will be exposed through Kibana to monitor data flow
Monitoring
Monitoring of data flow inside and outside of Event Processing Cluster
Hadoop monitoring data
Error Processing1. Automatic error re-processing with time window
S3
Data Collection
Data Collection Cluster
Servers
Servers
Servers
Video TrackingAd TrackingUser Tracking
3rd Party Ad Tracking
SQL Server
CSV data received every hour via FTP. Raw Events and Dimensions.
Text files received every five minutes. From Public and Private Cloud.Raw Events.
Logging Framework Elastic Search
Hadoop Processing Cluster
Data about received files\events reported with logging
framework
Raw data files pushed to Hadoop (WEB HDFS)
Dimension tables
Servers to acquireStage 1 :.NET Application will pull FTP, SQL DWH server for loggers and SQL Replication for dimension dataStage 2:Think to move to other more appropriate technology like Akka
Data flow direction
Logs will be exposed through Kibana to monitor data flow
Monitoring data
Monitoring
Monitoring of data flow inside and outside of Data Collection Cluster
MongoDb
Data Distribution
Data Distribution Cluster
Hive
Vertica
MongoDB
Report Distributor
Logging Framework Elastic Search
Reporting Platform
Data flow direction
Logs will be exposed through Kibana to monitor data flow
Monitoring data
Monitoring
Monitoring of data flow inside and outside of Data Distribution Cluster
Report S3 Storage
Reporting Platform
Vertica
Hive
SQL Server
1. Distributed2. Encapsulate Repository3. Versioning4. Smart query execution5. Testable
MongoDb
Reporting Platform
Report Designer
Report Provider
Report Distributor
Reporting API
Statistics Provider
S3 Report Storage
Data sources of Reporting platform are in Private and Public
Application Servers
MonitoringMonitoring Cluster
Cloudera Manager
Elastic Search Cluster
Vertica Management
Kibana
Zabbix
Applications
Vertica
Hadoop
MongoDb
Migration Outcome
15%Cost Reduction
Linear Scale
90%Unit Test Coverage
x280Processing Time
x50Development ROI
Current Scale
86BEvent Processed Daily
120TBData Daily
1MEvents Processed
per Second
Near Real Time ProcessingMinimum Interval : 5 min
15+Event Sources
4.5PBHadoop
70TBVertica
Scale Growth
x15Event Processed Daily
x6000Daily Processed Data
x25Events Processed
per Second
x280Processing Time
Second Stage
Data warehouse
Servers Servers
Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector
Scheduling
Reporting
Monitoring
Servers
S3AzuresFTPFTP
AzureS3sFTPFTP
Real Time DWH
Servers
Servers
Analytics
Third Stage
Data warehouse
Servers
Servers
Servers
Servers
Data Collection
Servers
Data Distribution
Servers
DWH API
Servers
External Data Providers
Event Collector
Scheduling
Reporting
Monitoring
Servers
S3AzuresFTPFTP
AzureS3sFTPFTP
Real Time DWH
Servers ServersServers
Analytics