Cloud and Big Data trends

Sebastien Goasguen, January 29th

@sebgoa

Cloud and Big Data

Drag picture to placeholder or click icon to add

A view on Big Data

http://www.economist.com/node/15557443?story_id=15557443

SKA


How did we get there ?

A natural evolution

New Distributed systems for:

Large scale datasets• From scientific instruments• From Web apps logs

Complex datasets• Not necessarily large.

Object stores• S3 clones

BigData and map-reduce

• While BigData is often associated with HDFS, Map-Reduce is the algorithm used to parallelize data processing.

• BigData ≠ Map-Reduce ≠ HDFS• Map-reduce is a way to express

embarrassingly parallel work easily.• You can do Map-Reduce without HDFS.

• e.g Basho map-reduce on riackCS


A really quick view on Clouds

Open Source IaaS

Today

BigData at peak

History

2003 –Google File System2005 – Hadoop2006 – Hadoop enters ASF incubator (Feb)2006 – S3 launched 2007 – Paper on Amazon Dynamo2009 – EMR launched2013 – CloudStack as a ASF TLP (March)2013 – Spark/Mesos enters ASF incubator


The Apache Software Foundation

Apache Software Foundation

35 projects in incubation:• 12 Hadoop related• ~30% Big Data related• Spark

117 top level projects:• ~16 cloud or bigdata +10%• Deltacloud, Libcloud, Whirr, jclouds• Hadoop, couchdb, cassandra, mesos• Bigtop, accumulo, lucene, UIMA• CloudStack

Hadoop Ecosystem

+ Up-coming next generation BD systems


Big Data and Cloud (Stack)s

Clouds and BigData

• Object store + compute IaaS to build EC2+S3 clone

• BigData solutions as storage backends for image catalogue and large scale instance storage.

• BigData solutions as workloads to CloudStack based clouds.

EC2, S3 clone• An open source IaaS with an EC2

wrapper e.g Opennebula• Deploy a S3 compatible object store –

separately- e.g riakCS• Two independent distributed systems

deployed

Cloud = EC2 + S3

Big Data as IaaS backend

“Big Data” solutions can be used as secondary storage .

Example• Open source IaaS + EC2 wrapper, e.g

CloudStack• Deploy S3 compatible object store, e.g

riakCS or Ceph or glusterFS• Use S3 as image store• Your EC2 service is a customer to your S3

service• Logstash + elasticsearch for logs/monitoring

Even use Bare Metal


Big Data as a Workload to the Cloud

Mesos, Spark are EC2 native

oec2_deploy.pyoec2_deploy.sho…

Tools

“PaaS”

Dev Pipeline

Conclusions

• Big Data is “catching up”• Tackle the big three head on:

• BigData, Cloud and DevOps• Add a big data backend to your cloud

from the start • Provide Big Data services on your cloud

Still behind !

Final Thoughts

Who manages my data transfers ?

Event

ApacheCON + CloudStack Collaboration Conference

Denver April 7-11th.

Cloud and Big Data

Get Involved with Apache CloudStack

Web: http://cloudstack.apache.org/

Mailing Lists: cloudstack.apache.org/mailing-lists.html

IRC: irc.freenode.net: 6667 #cloudstack #cloudstack-dev

Twitter: @cloudstack

LinkedIn: www.linkedin.com/groups/CloudStack-Users-Group-3144859

If it didn’t happen on the mailing list, it didn’t happen.

Technology

Cloud and Big Data trends