Critical Breakthroughs and Challenges in Big Data and Analytics

Preview:

Citation preview

Critical Breakthroughs and technicalChallenges in Big Data Driven Innovation

Paolo Spreafico

Head of EMEA Data Solution Engineers, Google Cloud Platform

Google Cloud Platform 2

Organize the world’s information and make it universally accessible and useful.Google’s Mission

2

#cloudconf2016

#cloudconf2016

Google Cloud Platform 5

By 2020, there will be 8 Billion connected smart phones

Source: Boston Consulting Group : The Mobile Revolution: How Mobile Technologies Drive a Trillion-Dollar ImpactIDC, 2015

— 2X more than today.And 32 Billion connected “IOT” devices

— 6X more than today.

Building what’s next 6

Source: IDC

increase in data (4ZB to 45ZB)

connected devices

of data “touched” by the cloud

40%35B10x

OrganisationData Questions

Tech

nolo

gy

Data is key (among others)

“Companies in the top third of their industry in the use of data- driven decision making were, on average, 5% more productive and 6% more profitable than their competitors.”

Andrew McAfee and Erik Brynjolfsson, MIT

What does Cloud 3.0 look like?

Google Cloud Platform 9

Storage Processing Memory Network

Single-node computing“Some assembly required”

True, on-demand cloud

An actual, global elastic cloud

Cloud 3.0

Invest your energy in great apps

Colocation

Your kit, someone else’s building.

Yours to manage.

Cloud 1.0Today's Cloud:

Virtualized Data Centers

Standard virtual kit, for rent. Still yours

to manage.

Cloud 2.0

Aut

omat

ion

Google Cloud Platform Vision

Messaging Big Data Containers NoSQL

http://googleasiapacific.blogspot.se/2015/06/growing-our-data-center-in-singapore.html

For the past 17 years, Google has been building out the fastest, most powerful, highest quality cloud infrastructure on the planet.

Edge locations in virtually every country in the world

Our Network

77Peering locations

10+ Years of Tackling Big Data Problems

Google Cloud Platform 13

Google Papers

20082002 2004 2006 2010 2012 2014 2015

GFS MapReduce

Flume Java Millwheel

OpenSource

2005

GoogleCloudProducts BigQuery Pub/Sub Dataflow Bigtable

BigTable Dremel PubSub

Apache Beam

Tensorflow

Google’s Data Services for everyone

Confidential + Proprietary

Storage and Databases

Cloud Storage

The Google Cloud data toolbox

Cloud SQL

Cloud Bigtable

Cloud Datastore

Big Data and Analytics

BigQuery

Cloud Pub/Sub

Cloud Dataflow

Cloud Dataproc

Cloud Datalab

Machine Learning

Cloud Machine Learning

Cloud Translate API

Cloud Vision API

Cloud Speech API

Confidential + Proprietary

A common configuration: draw conclusions

Events, metrics, etc.

Stream

Batch

Spreadsheets

BI Tools

Coworkers

Applications and Reports

Cloud Datalab

Visualization and BI

Co-workers

Batch

B CA

Raw logs, files, assets, Google

Analytics data etc.

A serverless big data stackthat scales automatically

Confidential & ProprietaryGoogle Cloud Platform 18

Complexities of Big Data ProcessingProgramming

Resource provisioning

Performance tuning

Monitoring

ReliabilityDeployment & configuration

Handling growing scale

Utilization improvements

Time to Understanding

Typical Big Data Processing

Confidential & ProprietaryGoogle Cloud Platform 19

Spend Time on ‘What’ not ‘How’

Time to Understanding

Big Data Processing with Google Cloud Platform

Programming

More time to dig into your data

Cloud 3.0 Big Data Lifecycle

Cloud Logs

Google App Engine

Google Analytics Premium

Cloud Pub/Sub

BigQuery Storage(tables)

Cloud Bigtable(NoSQL)

Cloud Storage(files)

Cloud Dataflow

BigQuery Analytics(SQL)

Capture Store Analyze

Batch

Process

Stream

Cloud Monitoring

Real-time analytics

Cloud Dataflow

Cloud ML

Real-timedashboard

Real-timealerts

Use

DataScientists

Analysts

Smartapps

Catalog & Data Lifecycle Automation

Cloud Datalab

Cloud Dataproc

Data Studio

Confidential & ProprietaryGoogle Cloud Platform 21

Emerging Big Data Challenges

Real-timedata ingestion

Machine learningat scale

Batch or streaming?

Analytics at the speed of thought

Batch or Streaming?Why do you have to choose?

Breakthrough #1

Google Cloud Platform Confidential & Proprietary 23

We don’t really use MapReduce anymoreUrs Hölzle

SVP TechnicalInfrastructure Google

“ ”

Confidential + Proprietary

A common configuration: capturing input

Cloud Pub/SubReliable, many-to-many, asynchronous messaging

Cloud StoragePowerful, simple and cost-effective object storage

Raw logs, files, assets, Google

Analytics data etc.

Events, metrics, etc.

Confidential + Proprietary

A common configuration: process and transform

Events, metrics, etc.

Cloud DataflowData processing engine forbatch and stream processing

Stream

Batch

Raw logs, files, assets, Google

Analytics data etc.

Confidential + Proprietary

A common configuration: process and transform

Events, metrics, etc.

Cloud DataflowData processing engine forbatch and stream processing

Stream

Batch

Cloud DataprocManaged Spark and Hadoop

Batch

Raw logs, files, assets, Google

Analytics data etc.

Confidential + Proprietary

A common configuration: analyze and store

Events, metrics, etc.

Stream

Batch

BigQueryExtremely fastand cheap on-demandanalytics engine

BigtableHigh performance NoSQL database for large workloadsBatch

Raw logs, files, assets, Google

Analytics data etc.

Confidential + Proprietary

A common configuration: draw conclusions

Events, metrics, etc.

Stream

Batch

Spreadsheets

BI Tools

Coworkers

Applications and Reports

Cloud Datalab

Visualization and BI

Co-workers

Batch

B CA

Raw logs, files, assets, Google

Analytics data etc.

Real-time data ingestion(and at scale)

Breakthrough #2

Google confidential │ Do not distribute

Overview:Data to process: Data in the Consolidated Audit Trail (CAT). A data repository of all equities and options orders, quotes, and events

Challenges:How to process the CAT and organize 100 billion market events into an “order lifecycle” in a 4 hour windowStore 6 years (~30PB) of data

Cloud Bigtable to process and run queries and tolerate volume increases

6 BILLIONMARKET EVENTS

WRITTEN PER HOUR

1.7 GIGsPER SECOND

PER HOUR

6 TBs

10 BNWRITTEN

PER HOUR BURSTS

1.7 GIGABYTESPER SECOND

10 TERABYTESPER HOUR

Google confidential │ Do not distribute

https://www.youtube.com/watch?v=fqOpaCS117Q

Analytics at the speed of thought

(and at scale)

Breakthrough #3

Building what’s next 33

Scales automatically

No setup or administration

Stream up to 100,000 rows p/sec

Easily integrates with third-party software

Google BigQuerymakes complex data analysis simple

Confidential + Proprietary

Google BigQuery Performance Example ?

Running an inefficient regular expression over 100 billion rows in

less than 60 seconds

Source: https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-bigquery-query

1000-core Hadoop Cluster = 2.5 hours

Before

Making ad hoc Queries with BigQuery < 5min

After

● 500+ Games● Hundreds of Analysts● Terabytes of Data Daily

Google BigQueryThe Power of Google Dremel for everyone

Storage Compute

Fast Ingest Query

Terabit Network

“Right at the start of the partnership we were able to reduce time to insight from 96 hours to 30 minutes by using BigQuery, allowing us to react in real time to customer needs and provide better service..”

Gary SandersHead of the bank's digital analytics function

https://www.finextra.com/newsarticle/28566/lloyds-partners-google-on-data-analytics

Machine learning for everyone

Breakthrough #4

Google Cloud Platform 4040

"Machine learning is a core, transformative way by which we're rethinking everything we're doing … we're thoughtfully applying it across all our products, be it search, ads, YouTube or Play."

Google confidential | Do not distribute

Applications that can see, hear and understand

Confidential & ProprietaryGoogle Cloud Platform 42

TensorFlow

Deep Learning technology currently powering over 100 Google services

Generalizable to vision, sound, text, video and other data

Runs on CPUs or GPUs, desktop, server, or mobile computing platforms.

Distributed via Apache 2.0 OSS license

Use your own data to train models

Google Cloud Platform Confidential & Proprietary 44

What Cloud Machine Learning Can Do

● Fully managed service

● Train using a custom Tensor Flow

graph

● Batch and online predictions, at scale

● Integrated Datalab experience

● Regression and classification tasks

Fully trained, easy to use Machine Learning models

CloudTranslate API

CloudVision API

CloudSpeech API

CloudVision API

LabelDetection

LandmarkDetectionOCR

LogoDetection

FaceDetection

Explicit Content

Detection

{"landmarkAnnotations": [

"description":"Arc de Triomphe","locations": [{"latLng": {

"latitude":48.873667,“longitude":2.295134}}],

"score":0.94231218]}

CloudSpeech API

Recognizes over 80 languages and variants

Can return text in real-time

Highly accurate, even in noisy environments

Access from any device

Powered by Google’s machine learning

Machine Learning Use Cases

Structured Data

Classification/ Regression● Customer Churn Analysis● Product Diagnostics● Forecasting

Recommendation● Content Personalization● Product X-Sells/Up-sells

Anomaly Detection● Fraud Detection● Asset Sensor Diagnostics● Log Metric Anomalies

Unstructured Data

Image Analytics● Identify damaged shipments● Explicit Content Classification

Text Analytics● Call Center log analysis● Language Identification● Topic Classification● Sentiment Analysis

cloud.google.com

Google Cloud Platform Confidential & Proprietary 52

Google’s Approach to

Cloud Security & Compliance

● Tens of thousands of custom built, homogenous systems

● Dozens of datacenters for redundancy● Data encryption in transit and at rest● Secure software development process● External security verifications● 500+ security engineers● 160+ academic research papers on security● Vulnerability Reward Program

We store our own data in this environment

SSAE-16SOC 1

SSAE-16SOC 2

SSAE-16SOC 3

ISO27001

HIPAA(BAA)

PCI DSS v3.0 FISMA FedRamp

GAE Complete Complete Complete Complete H2 15 Complete FISMA (Moderate) H2 15

GCS Complete Complete Complete Complete Complete Complete n/a H2 15

GCE Complete Complete Complete Complete Complete Complete n/a H2 15

Datastore Complete Complete Complete Complete H2 15 Complete n/a H2 15

Big Query Complete Complete Complete Complete Complete Complete n/a H2 15

Cloud SQL Complete Complete Complete Complete Complete Complete n/a H2 15

Genomics Complete Complete Complete Complete Complete n/a n/a H2 15

Apps Complete Complete Complete Complete Complete n/a GAFG only H2 15

Certifications

Google Cloud Platform Confidential & Proprietary 56

https://cloud.google.com/solutions/machine-learning-with-financial-time-series-data

Demo: Predicting the NYSE daily outcome

Google Cloud Platform Confidential & Proprietary 57

Get more info: Google Cloud for Financial Serviceshttps://cloud.google.com/solutions/finserv/

Recommended