18
Python, PySpark and Riak TS Stephen Etheridge Lead Solution Architect, EMEA

Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Embed Size (px)

Citation preview

Page 1: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Python, PySpark and Riak TS

Stephen EtheridgeLead Solution Architect, EMEA

Page 2: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

CONFIDENTIAL

Agenda

• Introduction to Riak TS• The Riak Python client• The Riak Spark connector and PySpark

Basho Technologies | 2

Page 3: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

CONFIDENTIAL

Distributed Systems Software for Big Data, IoT and Hybrid Cloud applications

2011 Creators of Riak Distributed Systems• Riak KV: Resilient NoSQL database • Riak S2: Large Object Storage

2015 New Products• Basho Data Platform: Integrated NoSQL

databases, caching, in-memory analytics, and search

• Riak TS: Only Enterprise NoSQL database optimized for Time Series data

100+ employees

Global Offices • Seattle (HQ), Washington DC, London, Tokyo

Over 1/3 of the Fortune 50

BASHO SNAPSHOT

Page 4: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

MEETING THE NEEDS OF THE ENTERPRISE PRIORITIZED NEEDS

High Availability - Critical Data

High Scale - Heavy Reads & Writes

Geo Locality - Multiple Data Centers

Operational Simplicity – Resources

Don’t Scale as Clusters

Data Accuracy – Write Conflict Options

TIME SERIES USE CASES

IoT/DevicesFinancial/Economic

Scientific Observations

RIAK KV USE CASES

User DataSession DataProfile Data

Real-time DataLog Data

Page 5: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

20 TERABYTES OF DATA PER DAY BILLIONS OF MOBILE DEVICES

10 BILLION data transactions a day – 150,000 a second – Apple

Forecasting 2.8 BILLION locations around the world

Generates 4GB OF DATA every second

We’re focusing on helping people make better

decisions with the weather.

Page 6: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

CONFIDENTIAL

WHAT IS NEEDED FOR TIME SERIES?

Efficient way to store & retrieve time series data

Query language that supports range queries

High data volume Enterprise scale solution High availability

Basho Technologies | 6

Page 7: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

What is Riak TS?

Basho Technologies | 7

Riak TS is Riak KV (a complete Riak KV build is included in Riak TS) with the following additional features optimized to handle time series use cases:

• Tables- Riak TS introduces tables built on top of the underlying K/V structure

• SQL – Riak TS supports a subset of standard SQL to create and query time series data.

• Data Locality – Keys co-located by quanta to enable querying data across time bounded series.

Page 8: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Riak TS Quanta

Basho Technologies | 8

The Quantam function in Riak TS takes three parameters:

• The name of a field in the table definition of type timestamp;

• A numeric quantity;

• One of the units of time from the list below:

• Days – ‘d’• Hours – ‘h’• Minutes – ‘m’• Seconds – ‘s’

Important: A query covering more than a certain number of quanta (5 by default) will generate too many sub-queries and the query system will refuse to run it. Assuming a default quanta of 15 minutes, the maximum query time range is 75 minutes.

Page 9: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Supported Aggregate FunctionsRiak TS supports aggregate functions including:

• COUNT() - Returns the number of entries that match a specified criteria.• SUM() - Returns the sum of entries that match a specified criteria.• MEAN() & AVG() - Returns the average of entries that match a specified criteria.• MIN() - Returns the smallest value of entries that match a specified criteria.• MAX() - Returns the largest value of entries that match a specified criteria.• STDDEV() - Returns the statistical standard deviation of all entries that match a

specified criteria using Population Standard Deviation.

Basho Technologies | 9

Page 10: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Supported Data Types

Basho Technologies | 10

Riak TS tables support the following data types:

• Varchar - Any string content is valid, including Unicode. Can only be compared using strict equality, and will not be typecast (e.g., to an integer) for comparison purposes. Use single quotes to delimit varchar strings.

• Double - This type does not comply with its IEEE specification: NaN (not a number) and INF (infinity) cannot be used.

• Sint64– Signed 64 bit integer

• Boolean - true or false (any case)

• Timestamps - Timestamps are integer values expressing UNIX epoch time in UTC in milliseconds. Zero is not a valid timestamp.

Page 11: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Developing on Riak TSRiak TS currently supports the Protocol Buffers API and five client libraries including Java, Ruby, Python, Erlang, and Node.js.

Basho Technologies | 11

APIs Basho Clients Community Clients• Protocol Buffers • Java

• Ruby• Python• Erlang• Node.js• .NET c#

• Not yet!

Page 12: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Supported OperationsRiak TS clients currently support following operations:

• Delete - Deletes a single row by it's key values.• Fetch/Get - Fetches a single row by it's key values.• Query - Allows you to query a Riak TS table with the given query string.• Store/Put - Stores data in the Riak TS table.• (Stream) ListKeys - Lists the primary keys of all the rows in a Riak TS

table.

Basho Technologies | 12

Page 13: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

The Riak Python Client• Compatible with Python 2.7 and above• Can be installed easily with pip• Pre-requisites

– python-dev– libffi-dev– libssl-dev

• Riak TS results object can be turned into a Pandas dataframe easily, otherwise it is a list of lists!

• Demo with Aarhus data

Page 14: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Riak Spark Connector• Enables you to connect Spark

applications to Riak TS with the Spark RDD and Spark DataFrames APIs

• Write applications in – Scala (if you have to), – Python (yay!), – and Java (never!).

• Makes it easy to partition Riak data so multiple Spark workers can process the data in parallel,

• Has support for failover if a Riak node goes down while your Spark job is running.

• Comes as one JAR file that needs to be pathed in!

– Riak TS 1.2+– Apache Spark 1.6+– Scala 2.10– Java 8

Page 15: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge
Page 16: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Riak TS Tables

Basho Technologies | 16

Riak TS tables are a new Riak KV Bucket Type (and there is a one to one mapping of tables to bucket types). Tables are created using the riak-admin command line or via one the supported clients:

CREATE TABLE GeoCheckin ( myfamily varchar not null, myseries varchar not null, time timestamp not null, weather varchar not null, temperature double, PRIMARY KEY ( (myfamily, myseries, quantum(time, 15,

'm')), myfamily, myseries, time ) )

> riak-admin bucket-type create GeoCheckin '{"props” : {"table_def” : ”…”} }’

Page 17: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Partit ion and Local Keys

Basho Technologies | 17

Riak TS has two types of keys that help determine how to distribute data across a cluster and within local partitions of data:

• Partition keys – The partition key determines where data is placed within a cluster (by vnode)

• Family – class or type of data (i.e. user, device type, etc.)

• Series – identifies the specific instances of the class/type, such as username or device ID

• Quanta – the time interval to group data by

• Local keys – Local keys determine where and how data is written with the vnode (currently identical to the partition key)

Page 18: Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Querying Riak TS

Basho Technologies | 18

select * from WeatherStationData where time > 1453224610000 and time < 1453225490000 and device = 'Weather Station 0001' and deviceId = 'abc-xxx-001-001'

select MIN(temperature), AVG(temperature), MAX(temperature) from WeatherStationData where time > 1453224610000 and time < 1453225490000 and device = 'Weather Station 0001' and deviceId = 'abc-xxx-001-001'select (temperature * 2), (pressure - 1) from WeatherStationData where time > 1453224610000 and time < 1453225490000 and device = 'Weather Station 0001' and deviceId = 'abc-xxx-001-001'

Riak TS currently supports a subset of the SQL language that includes basic aggregate and mathematic functions.