27
Data Analytics Infrastructure Le Nguyen The Dat @lenguyenthedat Data Science SG Nov 2015 Meetup

Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Data Analytics Infrastructure

Le Nguyen The Dat

@lenguyenthedat

Data Science SG

Nov 2015 Meetup

Page 2: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Backgrounds ZALORA Group (2013 – 2014)

o  Biggest online fashion retails in South East Asia o  Data Infrastructure & Data Science

Page 3: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

BackgroundsCommercialize.TV (2015 – )

o  Multi Channel media network – focusing on China audiences

o  Data Infrastructure & Insights

Page 4: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Challenges No central data source:

o  Data stored in multiple locations o  Unclear ownership

Data definition and quality: o  Little to none documentation

o  Different formula, rules owned by different departments o  Always dirty no matter what

Reporting – Descriptive analytics:

o  Immediate needs, automations o  Important to do it right (and quick!)

Page 5: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Data Warehouse

Page 6: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Database TechnologiesSQL – Relational Databases:

o  MySQL, PostgreSQL o  MS SQL Server, Oracle SQL

NoSQL: o  Redis o  Cassandra o  MongoDB o  DynamoDB (AWS) o  RethinkDB

Map Reduce ecosystem:

o  Hadoop: HDFS – Pig – Hive – Hbase o  Spark: RDD – Spork – Shark (Spark SQL) – Hbase-Spark

Massively Parallel Processing (MPP): o  Vertica (HP) - Greenplum (EMC) – Netezza (IBM) o  ParAccel (Amazon Redshift)

Page 7: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Database TechnologiesSQL – Relational Databases: ✔

o  MySQL, PostgreSQL o  MS SQL Server, Oracle SQL

NoSQL: (?) o  Redis ✖ o  Cassandra ✔ o  MongoDB ✖ o  DynamoDB (AWS) ✖ o  Neo4j ✖

Map Reduce ecosystem: ✔

o  Hadoop: HDFS – Pig – Hive – Hbase o  Spark: RDD – Spork – Shark (Spark SQL) – Hbase-Spark

Massively Parallel Processing (MPP): ✔ o  Vertica (HP) - Greenplum (EMC) – Netezza (IBM) o  ParAccel (Amazon Redshift)

Page 8: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Data WarehouseAmazon Redshift aws.amazon.com/redshift

o  Cloud-based, Fully managed o  SQL (PostgreSQL 8.0.2) o  On-demand ($2000/year) o  Scalable (Petabyte-scale) o  FAST! amplab.cs.berkeley.edu/benchmark/

All in ONE place!

o  Product information o  Customer information o  Tracking data o  External data sources (Social Media, 3rd Party datasets)

Page 9: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Extract-Transform-Load (ETL)

Page 10: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

ETLAmazon Redshift’s COPY command. docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Page 11: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

ETL Custom made:

o  Simple bash script, python, SQL o  Use cases:

•  Scrappers •  Excel / CSV imports

Data Pipeline Frameworks:

o  Large scale, more complicated o  Examples:

•  Spotify’s Luigi – github.com/spotify/luigi •  Yelp’s Mycroft – github.com/Yelp/mycroft

3rd Party Services:

o  aws.amazon.com/redshift/partners •  Flydata •  Rjmetrics

Page 12: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Applications

Page 13: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Self-services Dashboards Intermediate users

o  SQL / Excel / WYSIWYG Tools o  Re:dash – www.redash.io

•  Open source: github.com/getredash/redash •  Try it out: demo.redash.io •  Self-manage & deployment:

o Docker o Pre-baked AMI (Amazon Web Services) o Google Cloud Images

•  Supports lots of database types (Redshift, MySQL, PostgreSQL, Big Query, MongoDB…)

•  Users need to know SQL •  Web-based, collaborative work type

Page 14: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Demo: re:dash data sources usage http://demo.redash.io/queries/756

Page 15: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Demo: NYC Taxis Tip Amounts http://demo.redash.io/queries/753

Page 16: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Self-services Dashboards Intermediate users

o  SQL / Excel / WYSIWYG Tools o  Tableau – www.tableau.com

•  Licensed software (14 days trial) •  Tableau Public (Free: public.tableau.com) •  Self-host or Tableau host (fully managed) •  Supports a lot more database types •  Group, User management – customized access right •  Drag & Drop software as well as web-based

Page 17: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Demo: Social media dashboard Baidu è Import.io API èData Warehouse è Tableau

Page 18: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Demo: Market Research - video platform performance SimilarWeb API èData Warehouse è Tableau

Page 19: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Others (Tableau Public): •  SEA Games Result history tiny.cc/seagames •  Rakuten – Viki data challenge tiny.cc/viki-viz

Page 20: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Advanced applications Advanced users

o  Data Warehouse connection (JDBC - PostgreSQL) o  Automated, highly customized reports. o  Data Science:

•  Recommendation engine •  Predictive modeling •  Classifications

Page 21: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Advanced applicationsInternal reporting tool Data Warehouse è SQL, Python (Django), JS è Product-Finder

[ Black dress | SKU or ID | Tiffany| Atmosphere ]

Sales Info Tracking Info

Product Info

Page 22: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Advanced applicationsRecommendation Engine Data Warehouse è SQL, Python, Haskell è ZALORA Website

Similar products: Similar products:

Similar products: Similar products:

Page 23: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Conclusions

Page 24: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Team & Technology stacko  Small team of 1-4 programmers

o Amazon Web Services •  No upfront cost

•  Low maintenance

•  Scalability

•  Integrations

o  Shell Scripts, Python, Haskell, D3.js

o  Unix, Open-source technologies

Page 25: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Takeawayso  (Good) data infrastructure is important:

•  Build it first (before you hire a data scientist!) •  Build it right: stable – fast – scalable.

o  There is no silver bullet: •  Understand what you need •  Always do more research

o  Data infrastructure is NOT that hard! •  Utilize existing, modern technologies •  Avoid old, proprietary technology that were built for the 90s!

Page 26: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

ReferencesEngineering Blogs and Tutorials

o  https://blog.asana.com/2014/11/stable-accessible-data-infrastructure-startup/ o  https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83 o  https://engineering.pinterest.com/blog/powering-interactive-data-analysis-

redshift o  http://engineering.ifttt.com/data/2015/10/14/data-infrastructure/ o  https://blog.rjmetrics.com/2015/10/15/the-data-infrastructure-meta-analysis-

how-top-engineering-organizations-built-their-big-data-stacks/ o  https://www.youtube.com/watch?v=reQtXquDpzo o  https://www.periscopedata.com/amazon-redshift-guide

Benchmarks

o  https://amplab.cs.berkeley.edu/benchmark/ o  https://www.flydata.com/blog/with-amazon-redshift-ssd-querying-a-tb-of-data-

took-less-than-10-seconds/ o  https://www.flydata.com/blog/hive-and-redshift-a-brief-comparison/

Page 27: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition

Thank you!