Upload
carst-vaartjes
View
193
Download
1
Embed Size (px)
Citation preview
Architecture Overview
Architecture Overview
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
• The architecture consists of 5 basic components, a HTML5 Client and a file backend
• Each instance of a component auto-registers in the metadata master
• Every component defined here • Is horizontally scalable • Has load balancing• And has failover capabilities
• All external communication goes through the fully REST-ful api, where each request is checked against a role-based security system
• Next to the restful interface, it can also deliver and retrieve results and data through indirect methods (mail, sftp)
1
2
4
B
3
5
Web ClientA
1) Web Server
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
Web Client
• The web server receives all requests, checks them against the security model and metadata, after which it sets out the actions in the queuing system
• The setup of the security model, metadata (including data descriptions there) and the entire API (calls and actions) are proprietary code
• Dependencies:• Nginx, for the scalable http server• uWSGI, for running python code
behind nginx• Flask, a web framework for
handling sockets and sessions
1
2
4
B
A
3
5
2) Routing & Queuing
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
Web Client
• The queue server receives all action requests from the API, finds where it can execute them and load balances requests over these resources
• We have created the queues and auto-registering setup to create the generic framework functionality and to ensure load balancing and fail over capabilities
• Dependencies:• Celery, for the Python library• RabbitMQ, the distribution broker• Redis, for exchanging results
between the processes
1
2
4
B
A
3
5
3) Metadata
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
Web Client
• The metadata server contains all general data on users, databases and security, as well the metadata on available data for users (measures, dimensions, tables and how these all related to each other)
• Dependencies:• MongoDB, for containing the
metadata
1
2
4
B
A
3
5
4) Dynamic Query Engine
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
Web Client
• The dynamic query engine server contains a number of data files (which it automatically downloads and synchronizes from the backend) and can analyze and aggregate
• It can also auto-join tables on commonalities, perform a wide range of calculations and do several distributed analytics operations on row-level
• Dependencies:• Bcolz, for containing the data files
in a compressed, columnar format• Pandas, for higher end operations
for the result data set (joins, sorts, etc.)
1
2
4
B
A
3
5
5) Processing & Analytics
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
Web Client
• The processing & analytics server handles (asynchronous) calls to perform file loading, exporting and analytics calls
• This includes the creation and execution of machine learning and statistical models
• It also handles the conversion of raw data files into the binary files and updating relevant metadata
• Dependencies:• Scikit-learn for machine learning• Statsmodel for statistical models• Pandas, for data manipulation• Bcolz, for converting the data files
into a compressed, columnar format
1
2
4
B
A
3
5
A) Web Client
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
Web Client
• The web client is a full, web-based HTML5 client that gives access to all • Reporting• Analytics• File import• User and Security Mgmt• Server Mgmt
• The files are server by the webserver as a static, with all calls go through the standard API
• Dependencies:• Jquery, for cross-browser javascript
simplification and ui• Bootstrap, for layout• D3.js, a library for visualizations
1
2
4
B
A
3
5
B) File Backend
Web Server
API
Routing & Queuing Metadata
Dynamic Query Engine
Processing & Analytics
File Backend
Web Client
• The file backend contains all raw files and the processed (compressed, columnar) files
• DQE instances automatically retrieve their assigned files from the backend when a file has been updated.
• Dependencies:• AWS S3 for saving files
1
2
4
B
A
3
5
Architecture Comparison
Area Hadoop Cassandra Best In Class visualfabriq Difference
Data Non-structured & structured Structured, wide-column Teradata (structured, columnar) Structured, columnar, compressed
Optimized for numerical data (means: no text analytics etc.)
Architecture Rack-aware, daemon based Cluster
Peer-to-peer cluster Horizontally scaling, container-based microservices communicating through rabbitmq queues
Easier to monitor & scale
Setup Complex Complex Up & running in one minute Much, much easier to setup and rollout
Cluster Maintenance
Node creation and assignment usually through commercial cluster mgmt software
Peer-to-peer network; auto-configures
Self-registering nodes that can be assigned specific tasks and data in a web interface
ETL Flume, Sqoop Bulk Loader Informatica, Talend Web based, drag & drop with wizards
Web based, easy to use
Language Map/Reduce; add-ons for sql (pig, hive, impala, etc.)
CQL SQL MOLAP-like; sql interface to be build
SQL is the standard, but because of the built-in reporting and analytics this is not something users will need
Compression No No MongoDb/WiredTiger Blosc-based Saves on average 20x in disk space while speeding up reads
Performance Slow, batch based; Spark can add in-memory capability (speeds up 100x)
High, in-memory options High, disk-based with compression delivering in 2-3x range of in-memory
Out-of-the-box near in-memory performance with file-based scaling; with advances of CPU speed, this might even surpass traditional in-memory performance
Interface Restful API Restful API Restful API Restful API
Reporting Only in external tools (that connect to sql-connector)
Only in external tools (that connect to 3rd party connectors)
Tableau (HTML5, interactive, beautiful)
Built-in HTML5, interactive, extensible (d3.js based)
Only solution with out-of-the-box reporting with an easy-to-use, modern web-based interface
Analytics Distributed map/reduce analytics through Mahout
Only as optional, paid-for module SAS, SPSS Built-in HTML5, interactive environment that incorporates leading OS machine learning (sci-kit learn), statistics (statsmodel) and propietary (POS-analytics) functionality; nb: the analytics load is not fully distributed yet
Only solution with out-of-the-box analytics with an easy-to-use, modern web-based interface
Security Kerberos-based security Data object security General, role-based security One point to manage all security from data access to functionality (reporting, accessibility, etc.)
Open source Core is open source; several performance acceleration & mgmt tools are paid
Core is open source; analytics, backup and other options are paid
Core is open source; large cluster mgmt tools and vertical-specific analytics options are paid
Language Java Java Python (and Cython & C)