Upload
continuum-analytics
View
2.409
Download
2
Embed Size (px)
Citation preview
QuantCon“Light Up Your Dark Data”
April 2016
2
What is dark data?
SQL
CSV
REST
JSON
SQL
CSV
REST
JSON
SQL
CSV
SQL
CSV
3
Example Datasets
Trade History
Signal History
Clearing Data
Log Files
Ref Data
Corp Actions
Market Data
Models
Firm Generated Vendor Generated
4
Compounding ChallengesAccumulates
Quickly
Disparate StorageDifferent Vendors
Format Changes
Ad-hoc Usage
Urgent!
5
Workflow
Find Data
Ad-Hoc ETL
Store / CopyAnalysis
Report
6
Sample Environment
Oracle MySQL MSSQL KDB ZIPCSV
SQL
Python
DSL
R Matlab
C++ Java
Storage
ETL
Analysis
REST
7
Independent First Class Citizens
Expression
ComputeData
8
DatashapeStructured data description language
http://datashape.pydata.org
9
Datashape Example daily_bars: var * { date: string, symbol: string, open: float64, high: float64, low: float64, close: float64, volume: int64, }
Language, compute, and storage independent
10
Blaze
Write expressions independent of storage system
Push computations to the data
Lazy evaluation
Pandas-like API
12
Blaze Expressions
13
Flat File Repositories
Many directories and files
Dictated structure
Naming convention part of dataset
Requires one off ad-hoc scripts
14
Vendor - directory structure/daily/us/nasdaq stocks//daily/us/nasdaq stocks/1//daily/us/nasdaq stocks/2/
osn.us.txtostk.us.txt…
zyne.us.txt/daily/us/nyse etfs//daily/us/nyse stocks/1//daily/us/nyse stocks/2/
Contains ~8400 individual files
15
Vendor – file contents
Date,Open,High,Low,Close,Volume,OpenInt20151111,18.5,25.9,18,24.5,1584600,020151112,24.25,27.12,22.5,25,83000,020151113,25.47,26.2,24.55,25.26,67300,020151116,25.01,26.19,24.13,25.02,16900,020151117,24.46,25.51,24.38,24.62,25900,020151118,24.62,26.31,24.06,25,111100,020151119,24.85,26,24.71,25.9,113100,0…
Symbol is not contained within the individual data files
/daily/us/nasdaq stocks/1/aaap.us.txt
16
Luxsource: "lux://global-equities/data/daily/us/nasdaq stocks" extractor: "{}/{Symbol}.{Region}.txt"
Date,Open,High,Low,Close,Volume,OpenInt,Symbol,Region20151111,18.5,25.9,18,24.5,1584600,0,aaap,us20151112,24.25,27.12,22.5,25,83000,0,aaap,us20151113,25.47,26.2,24.55,25.26,67300,0,aaap,us…20160322,11.56,11.98,10.8894,11.09,517604,0,zyne,us20160323,11.3,11.72,9.5,9.75,489743,0,zyne,us20160324,9.5,10.24,9.22,9.64,188512,0,zyne,us
One dataset with ~5.5 million rows
17
Lux Benefits
Combines individual files
No separate ETL or storage
Names become part of data
Optimized compute
18
Anaconda Mosaic
Interactive exploration
Intuitive interface
Advanced visualizations
Catalog of datasets and expressions
Provenance and Governance
19
Live Walkthrough
20
Project References
• Anaconda Mosaic - http://know.continuum.io/Anaconda-Mosaic
• Blaze Ecosystem - http://blaze.pydata.org• Bokeh - http://bokeh.pydata.org