51

with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >
Page 2: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Crunching Big Data with BigQuery

Ryan Boyd, Developer Advocatehttp://profiles.google.com/ryan.boyd@ryguyrg

XLDBTuesday, September 11th

2012

Page 3: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

How BIG is big?

Page 4: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

1 million rows?

Page 5: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

10 million rows?

Page 6: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

100 million rows?

Page 7: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

1 million

1 million1 million 1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million

1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million

1 million

1 million1 million1 million

1 million1 million1 million1 million

1 million

1 million

1 million1 million1 million

1 million1 million1 million1 million

1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million

500 million rows!

1 million

Page 8: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Big Data at Google

60 hours100 million gigabytes425 million users

Page 9: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >
Page 10: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >
Page 11: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Google's internal technology: Dremel

Page 12: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Big Data at Google - Finding top installed market apps

SELECT top(appId, 20) AS app, count(*) AS countFROM installlog.2012;ORDER BY count DESC

Result in ~20 seconds!

Page 13: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Big Data at Google - Finding slow servers

SELECT count(*) AS count, source_machine AS machineFROM product.product_log.liveWHERE elapsed_time > 4000GROUP BY source_machineORDER BY count DESC

Result in ~20 seconds!

Page 14: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

BigQuery gives you this power

Store data with reliability, redundancy and consistency

Go from data to meaning

Quickly!

At scale ...

Page 15: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

How are developers using it?

Game and social media analytics

Advertising campaign optimization

Sensor data analysis

Infrastructure monitoring

Page 16: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

● Show the power● Running your queries● Preparing your data● Visualizing your data● Underlying architecture design● Latest and greatest features

Agenda

Page 17: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Let's dive in!

Page 18: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

It's an API

Page 19: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Google Cloud Storage

Upload your Data

Page 20: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

BigQuery

Your data is loaded

Google Cloud Storage

Page 21: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Load your data into BigQuery"jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ...

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

Page 22: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Query Away!

"jobReference":{ "projectId":"605902584318"}, "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2"}

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs

Page 23: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

● Java● Python● .NET● PHP● JavaScript● Apps Script● ... more ...

Libraries

Page 24: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Libraries - Example JavaScript query

var request = gapi.client.bigquery.jobs.query({ 'projectId': project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;'});

request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...

Page 25: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

BigQuery UI

bigquery.cloud.google.com

Page 26: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Preparing your Data

Page 27: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Schema definition

birth_record

parent_id_motherparent_id_fatherpluralityis_maleraceweight

parents

idraceagecigarette_usestate

Page 28: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Schema definition

birth_record

mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight

Page 29: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Schema definition - sharding

birth_record_2011

mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight

birth_record_2012

mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight

birth_record_2013

birth_record_2014

birth_record_2015

birth_record_2016

Page 30: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Data format

1969,1969,1,20,,AL,TRUE,1,7.813,AL,1,20,true1971,1971,5,7,,NY,FALSE,1,7.213,MA,5,7,true2001,2001,12,5,,CA,TRUE,2,6.427,CA,12,5,true

CSV

Page 31: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Tools to prepare your data

● App Engine MapReduce● Commercial ETL tools

● Pervasive● Informatica● Talend

● UNIX command-line

Page 32: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Visualizing your Data

Page 33: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Commercial visualization tools

Page 34: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Google Spreadsheets

Page 35: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Custom code and the Google Visualization API

Page 36: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

BigQuery architecture

Page 37: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

“ If you do a table scan over a 1TB table, you're going to have a bad time. ”

Anonymous16th century Italian Philosopher-Monk

Page 38: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

● Reading 1 TB/ second from disk:● 10k+ disks

● Processing 1 TB / sec:● 5k processors

Goal: Perform a 1 TB table scan in 1 second

Parallelize Parallelize Parallelize!

Page 39: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Data access: Column Store

Record Oriented Storage Column Oriented Storage

Page 40: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Distributed Storage (e.g. GFS)

BigQuery Architecture

Mixer 0

Mixer 1Shard 0-8

Mixer 1Shard 17-24

Mixer 1Shard 9-16

Shard 0 Shard 10 Shard 12 Shard 24Shard 20

Page 41: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Running your Queries

Page 42: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

SELECT COUNT(foo), MAX(foo), STDDEV(foo) FROM ...

BigQuery SQL Example: Simple aggregates

Page 43: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

SELECT ... FROM ....WHERE REGEXP_MATCH(url, "\.com$") AND user CONTAINS 'test'

BigQuery SQL Example: Complex Processing

Page 44: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

SELECT COUNT(*) FROM (SELECT foo ..... )GROUP BY foo

BigQuery SQL Example: Nested SELECT

Page 45: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

BigQuery SQL Example: Small JOIN

SELECT huge_table.foo FROM huge_tableJOIN small_table ON small_table.foo = huge_table.foo

Page 46: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Distributed Storage (e.g. GFS)

BigQuery Architecture: Small Join

Mixer 0

Mixer 1Shard 0-8

Mixer 1Shard 17-24

Shard 0 Shard 24Shard 20

Page 47: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Other new features!

Page 48: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

Batch queries!

● Don't need interactive queries for some jobs?● priority: "BATCH"

Page 49: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

● API● Column-based datastore● Full table scans FAST● Aggregates● Commercial tool support● Use cases

That's it

Page 50: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >

SELECT questions FROM audience

SELECT 'Thank You!' FROM ryan

http://developers.google.com/bigquery

@ryguyrg http://profiles.google.com/ryan.boyd

Page 51: with BigQuery Crunching Big Data...Big Data at Google - Finding slow servers SELECT count(*) AS count, source_machine AS machine FROM product.product_log.live WHERE elapsed_time >