10
Architecture Overview

Bquery Reporting & Analytics Architecture

Embed Size (px)

Citation preview

Page 1: Bquery Reporting & Analytics Architecture

Architecture Overview

Page 2: Bquery Reporting & Analytics Architecture

Architecture Overview

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

• The architecture consists of 5 basic components, a HTML5 Client and a file backend

• Each instance of a component auto-registers in the metadata master

• Every component defined here • Is horizontally scalable • Has load balancing• And has failover capabilities

• All external communication goes through the fully REST-ful api, where each request is checked against a role-based security system

• Next to the restful interface, it can also deliver and retrieve results and data through indirect methods (mail, sftp)

1

2

4

B

3

5

Web ClientA

Page 3: Bquery Reporting & Analytics Architecture

1) Web Server

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

Web Client

• The web server receives all requests, checks them against the security model and metadata, after which it sets out the actions in the queuing system

• The setup of the security model, metadata (including data descriptions there) and the entire API (calls and actions) are proprietary code

• Dependencies:• Nginx, for the scalable http server• uWSGI, for running python code

behind nginx• Flask, a web framework for

handling sockets and sessions

1

2

4

B

A

3

5

Page 4: Bquery Reporting & Analytics Architecture

2) Routing & Queuing

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

Web Client

• The queue server receives all action requests from the API, finds where it can execute them and load balances requests over these resources

• We have created the queues and auto-registering setup to create the generic framework functionality and to ensure load balancing and fail over capabilities

• Dependencies:• Celery, for the Python library• RabbitMQ, the distribution broker• Redis, for exchanging results

between the processes

1

2

4

B

A

3

5

Page 5: Bquery Reporting & Analytics Architecture

3) Metadata

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

Web Client

• The metadata server contains all general data on users, databases and security, as well the metadata on available data for users (measures, dimensions, tables and how these all related to each other)

• Dependencies:• MongoDB, for containing the

metadata

1

2

4

B

A

3

5

Page 6: Bquery Reporting & Analytics Architecture

4) Dynamic Query Engine

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

Web Client

• The dynamic query engine server contains a number of data files (which it automatically downloads and synchronizes from the backend) and can analyze and aggregate

• It can also auto-join tables on commonalities, perform a wide range of calculations and do several distributed analytics operations on row-level

• Dependencies:• Bcolz, for containing the data files

in a compressed, columnar format• Pandas, for higher end operations

for the result data set (joins, sorts, etc.)

1

2

4

B

A

3

5

Page 7: Bquery Reporting & Analytics Architecture

5) Processing & Analytics

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

Web Client

• The processing & analytics server handles (asynchronous) calls to perform file loading, exporting and analytics calls

• This includes the creation and execution of machine learning and statistical models

• It also handles the conversion of raw data files into the binary files and updating relevant metadata

• Dependencies:• Scikit-learn for machine learning• Statsmodel for statistical models• Pandas, for data manipulation• Bcolz, for converting the data files

into a compressed, columnar format

1

2

4

B

A

3

5

Page 8: Bquery Reporting & Analytics Architecture

A) Web Client

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

Web Client

• The web client is a full, web-based HTML5 client that gives access to all • Reporting• Analytics• File import• User and Security Mgmt• Server Mgmt

• The files are server by the webserver as a static, with all calls go through the standard API

• Dependencies:• Jquery, for cross-browser javascript

simplification and ui• Bootstrap, for layout• D3.js, a library for visualizations

1

2

4

B

A

3

5

Page 9: Bquery Reporting & Analytics Architecture

B) File Backend

Web Server

API

Routing & Queuing Metadata

Dynamic Query Engine

Processing & Analytics

File Backend

Web Client

• The file backend contains all raw files and the processed (compressed, columnar) files

• DQE instances automatically retrieve their assigned files from the backend when a file has been updated.

• Dependencies:• AWS S3 for saving files

1

2

4

B

A

3

5

Page 10: Bquery Reporting & Analytics Architecture

Architecture Comparison

Area Hadoop Cassandra Best In Class visualfabriq Difference

Data Non-structured & structured Structured, wide-column Teradata (structured, columnar) Structured, columnar, compressed

Optimized for numerical data (means: no text analytics etc.)

Architecture Rack-aware, daemon based Cluster

Peer-to-peer cluster Horizontally scaling, container-based microservices communicating through rabbitmq queues

Easier to monitor & scale

Setup Complex Complex Up & running in one minute Much, much easier to setup and rollout

Cluster Maintenance

Node creation and assignment usually through commercial cluster mgmt software

Peer-to-peer network; auto-configures

Self-registering nodes that can be assigned specific tasks and data in a web interface

ETL Flume, Sqoop Bulk Loader Informatica, Talend Web based, drag & drop with wizards

Web based, easy to use

Language Map/Reduce; add-ons for sql (pig, hive, impala, etc.)

CQL SQL MOLAP-like; sql interface to be build

SQL is the standard, but because of the built-in reporting and analytics this is not something users will need

Compression No No MongoDb/WiredTiger Blosc-based Saves on average 20x in disk space while speeding up reads

Performance Slow, batch based; Spark can add in-memory capability (speeds up 100x)

High, in-memory options High, disk-based with compression delivering in 2-3x range of in-memory

Out-of-the-box near in-memory performance with file-based scaling; with advances of CPU speed, this might even surpass traditional in-memory performance

Interface Restful API Restful API Restful API Restful API

Reporting Only in external tools (that connect to sql-connector)

Only in external tools (that connect to 3rd party connectors)

Tableau (HTML5, interactive, beautiful)

Built-in HTML5, interactive, extensible (d3.js based)

Only solution with out-of-the-box reporting with an easy-to-use, modern web-based interface

Analytics Distributed map/reduce analytics through Mahout

Only as optional, paid-for module SAS, SPSS Built-in HTML5, interactive environment that incorporates leading OS machine learning (sci-kit learn), statistics (statsmodel) and propietary (POS-analytics) functionality; nb: the analytics load is not fully distributed yet

Only solution with out-of-the-box analytics with an easy-to-use, modern web-based interface

Security Kerberos-based security Data object security General, role-based security One point to manage all security from data access to functionality (reporting, accessibility, etc.)

Open source Core is open source; several performance acceleration & mgmt tools are paid

Core is open source; analytics, backup and other options are paid

Core is open source; large cluster mgmt tools and vertical-specific analytics options are paid

Language Java Java Python (and Cython & C)