Upload
continuum-analytics
View
107
Download
3
Embed Size (px)
Citation preview
Building Credit Infrastructure with Anaconda
Introduction
Hussain Sultan/ Data Scientist @ Capital One
Most data-driven analysis is simple
1Consume 2Analyze
� Predictive models
3Act
� Decisions/Rules� Historical performance
� Relational Data � Parameterized model scoring
� Facilitates experimentation
� Recommendations
� Iterative process� SQL data sources
And the trick is to strike the right balance between technology and people
Data Science Business Analysis+ +• Open Source • Extensible • Easy to maintain
Our challenge is to provide a way for efficient data access and model scoring along with extensible tooling
Data Retrieval
� Medium size data
Model Scoring
� Complex hierarchy of models
Ad-hoc Analysis
� SQL-like analysis
� Ad-hoc access patterns
� Fast retrieval� Lots of model adjustment
knobs
� Aggregations
� KPIs
We decided to build out a Python package that business analysts could interact to consume analytics
… and consumed by analysts in Jupyter notebooks
Business Analysts and SMEs
Analysts interact via a custom Python
PackageData ScientistsJupyter
Notebook Server/AEN
EC2 Instances (AWS)
Conda RepositoryCICDGithub
Navigating to an appropriate segmentation
API tree is a source of important metadata around segmentation
Pre-defined segmentation
Mechanism to register new segmentations
We use blaze as the mechanism to retrieve known segmentations of the data
Blaze
Postgres
RedshiftBackendsTo
Pandas
S3
Model scoring
Model scoring API - Configurable parameters
Dask is a mechanism to wrap inter-dependency of models and score them in an optimum way
Custom Dask Graph
Compute
Pandas
All of our data is returned as a pandas dataframe
Bring your own tool or create a custom workflow for analysis
With Jupyter notebooks, analyst can do most medium sized data analysis in a remote kernel
Jupyter Notebooks
(AEN)
Conda Env
Some codeConda
Repository
Execute
Thank you