Massively Scalable Computational Finance with SciDB

Preview:

DESCRIPTION

Hedge funds, investment managers and prop shops need to keep pace with rapidly growing data volumes from many sources. SciDB—an advanced computational database programmable from R and Python—scales out to petabyte volumes and facilitates rapid integration of diverse data sources. Open source and running on commodity hardware, SciDB is extensible and scales cost effectively. Attend this webinar to learn how quants and system developers harness SciDB’s massively scalable complex analytics to solve hard problems faster. SciDB’s native array storage is optimized for time-series data, delivering fast windowed aggregates and complex analytics, without time-consuming data extraction. Webinar presenters will demonstrate real world use cases, including the ability to quickly: 1. Generate aggregated order books across multiple exchanges 2. Create adjusted continuous futures contracts 3. Analyze complex financial networks to detect anomalous behavior

Citation preview

Massively Scalable

Computational

Finance with SciDB

Bryan Lewis

Chief Data Scientist

Frank Smietana

Solutions Architect

© P

ara

dig

m4

GoToWebinar

• Ask questions using the

Q&A window

• This webinar is being

recorded

• Replays will be available

from paradigm4.com

© P

ara

dig

m4

Common issues

• Expensive data ETL

• Lack of horizontal scalability

• Hard to program

• Hard to extend

• Difficulty with data JOINS

© P

ara

dig

m4

What is SciDB?

Massively scalable

distributed array database

© P

ara

dig

m4

What is SciDB?

Open source

© P

ara

dig

m4

Mike Stonebraker CTO

What is SciDB?

© Paradigm4 Inc.

Lawrence Berkeley

NASA Goddard

Projects using satellite image data

Institute for Geoinformatics

Global land change analysis on remote

sensing data (LANDSAT, MODIS, SENTINEL)

Lawrence Berkeley

Big Science and SciDB

© P

ara

dig

m4

Commercial applications Pharma, Biotech, Healthcare

Quantitative Finance

Image & Sensor Analytics

E-commerce

© P

ara

dig

m4

Arrays for finance

Symbol

Tim

e

© P

ara

dig

m4

Fast multidimensional SELECTs

© P

ara

dig

m4

Table model i j data

1 1 0.5

1 2 0.3

1 3 0.1

1 4 -0.5

2 1 0.9

2 2 0.0

2 3 -0.8

2 4 -0.8

3 1 1.1

3 2 1.0

3 3 1.2

3 4 1.5

4 1 0.9

4 2 1.0

4 3 1.2

4 4 1,5

© P

ara

dig

m4

Array model

0.5 0.3 0.1 -0.5

0.9 0.0 -0.8 -0.8

1.1 1.0 1.2 1.5

0.9 1.0 1.2 1.5

j

i

(1,1)

© P

ara

dig

m4

Our approach

• Less data movement

• Spatial data clustering

• Leverage popular languages

• Extensibility

© P

ara

dig

m4

C++

Julia

Java/JVM

Javascript

Array SQL

Use Popular Languages

JDBC

Protocol buffers

C/C++ API

HTTP

© P

ara

dig

m4

SciDB

0

SciDB

SciDB

1

SciDB

2

Shared-nothing architecture

© P

ara

dig

m4

Common issues

• Expensive data ETL

• Lack of horizontal scalability

• Hard to program

• Hard to extend

• Difficulty with data JOINS

© P

ara

dig

m4

SciDB

• Minimize ETL

• Massively scalable

• Program from many languages

• Open-source extensibility

• Fast parallel JOIN

© P

ara

dig

m4

Poll

© P

ara

dig

m4

Examples

• Order books

• Network analysis

© P

ara

dig

m4

Order book challenges

• Lots of exchanges

• Regulatory compliance

• Margins are shrinking

• Want more alpha

© P

ara

dig

m4

Create order book

• Load raw data into array

• Dimension along symbol and time

coordinate axes

• Create order book entries with

custom aggregation function ORDERBOOK

https://github.com/Paradigm4/orderbook-example

© P

ara

dig

m4

Consolidate order books

• Load as arrays

• Merge into single array

• Impute missing value

(inexact temporal join)

• Aggregate by time and symbol

© P

ara

dig

m4

Example Order Books

© P

ara

dig

m4

Merge and impute

© P

ara

dig

m4

Consolidated Order Book

© P

ara

dig

m4

Benchmark Results

• 9 exchanges; 358,000,000 events; 8,000 symbols

• Order book depth: 10

© P

ara

dig

m4

Financial network analysis

© P

ara

dig

m4

A graph

© P

ara

dig

m4

Sparse matrix representation

© P

ara

dig

m4

Bitcoin transactions A directed graph

Represented as a nonsymmetric

sparse matrix

From

address

To

address Date, Amount,

Transaction ID

© P

ara

dig

m4

Bitcoin network schema

(using the Reid/Harrigan user ID method)

Identify important nodes

• Kleinberg HITS method

• Subgraph centrality

• Fielder clustering

• Other methods...

Bitcoin subgraph centrality

• Identify top 5 most central hub and authority nodes

• 16.3M nodes

• 6.3M x 6.3M sparse matrix

• 8-instance SciDB cluster on a single workstation (8 cores)

• 20 seconds

© Paradigm4 Inc.

Correlation network

1 Compute bar data closing

prices from TAQ trades

2 na.locf imputation

3 Correlation matrix across all

instruments

4 Regularize

5 Precision matrix

6 Threshold

7 Plot clusters

All inside SciDB up to plot

Take away

• Bringing the analysis to the data

• In-database complex math

• Parallel time series analysis

• Programmable from C++, R, Python ...

• MPP on commodity clusters, clouds

• Extensible, open-source

www.paradigm4.com

© Paradigm4 Inc.

Questions?

Tell us about your application • info@paradigm4.com

Try our Quick Start • scidb.org/forum

• Download a VM or EC2 AMI

www.paradigm4.com

Recommended