Upload
matthias-wahl
View
179
Download
0
Embed Size (px)
DESCRIPTION
Analysing any huge dataset with the help of the crate datastore using the bare crate python client or SQLAlchemy.
Citation preview
Big Data Analysis with Crate and Python
Matthias Wahl - developer @ crate.io !
Email: [email protected]
Crate
shared nothing massively scalable datastore
standing on the shoulders of giants
Crate
get it at: https://crate.io/download
# bash -c "$(curl -L try.crate.io)"
Crate
automatic sharding and replication
(semi-) structured models
single table only
SQL query language
Crate
all common SQL types(and more)
powerful aggregations (‘GROUP BY’)
linear scalability - data and query execution is distributed
basic arithmetics (next release 0.39)
Crate
Aggregation Execution
SELECT station_name, max(temp), avg(temp), min(temp), count(distinct date) FROM weather_de WHERE temp != -999 GROUP BY station_name ORDER BY station_name ASC;
Aggregation Execution
H
M
M
M
R
R
R
collect
Request
Aggregation Execution
H
M
M
M
R
R
R
collect
hash based distribution
Aggregation Execution
H
M
M
M
R
R
R
group results
Aggregation Execution
H
M
M
M
R
R
R
final reduceResponse
Aggregation Execution
Using the python client
>>> from crate.client.http import Client >>> client = Client([“127.0.0.1:4200”]) >>> response = client.sql(“select * from weather_de limit 1”) >>> print(response) { u'duration': 659, u'rowcount': 1, u'rows': [ [1303365600000, 82.0, None, None, None, 0, u'954', 54.1667, 7.45, u'UFS Deutsche Bucht', 60.0, 10.9, 100, 5.2] ], u'cols': [u'date', ...] }
Using SQLAlchemy
>>> import sqlalchemy as sa >>> from sqlalchemy.ext.declarative import declarative_base >>> from sqlalchemy.orm import sessionmaker >>> engine = sa.create_engine(“crate://localhost:4200”) >>> Base = declarative_base()
Using SQLAlchemy
>>> class Weather(Base): ... ... __tablename__ = 'weather_de' ... ... station_id = Column('station_id', String, primary_key=True) ... station_name = Column('station_name', String) ... station_lat = Column('station_lat', Float) ... station_long = Column('station_lon', Float) ... station_height = Column('station_height', Integer) ... date = Column('date', DateTime, primary_key=True) ... temp = Column('temp', Float) ... humility = Column(Float) ... sunshine_hours = Column(Float) ... wind_speed = Column(Float) ... wind_direction = Column(Integer) ... rainfall_fallen = Column(Integer) ... rainfall_height = Column(Float) ... rainfall_form = Column(Integer)
Using SQLAlchemy
>>> from sa import func >>> res = DBSession.query( ... Weather.station_name, ... func.avg(Weather.temp) ... ).group_by(Weather.station_name) ... .order_by(Weather.station_name) ... .limit(10).all()
SELECT station_name, avg(temp) from weather group by station_name order by station_name limit 10;
Using SQLAlchemy
#Average sunshine hours from sqlalchemy.sql import func DBSession.query(func.avg(Weather.sunshine_hours)).scalar() # Average sunshine hours in Konstanz DBSession.query(func.avg(Weather.sunshine_hours)).filter(Weather.station_name==‘Konstanz’).scalar()
Feature Requests
I’m no data scientist
Feature Requests
Please tell us what you would like to see in crate.
I’m no data scientist
CRATE
Thank you
web: https://crate.io/
github: https://github.com/crate
twitter: @cratedata
IRC: #crate
stackoverflow tag: cratedata