61
BigData in Cloud computing Viet-Trung Tran @Vietstack Sunday 1 February 15

Viet stack 2nd meetup - BigData in Cloud Computing

Embed Size (px)

Citation preview

BigData in Cloud computingViet-Trung Tran@Vietstack

Sunday 1 February 15

Bio

Viet-Trung Tran

[email protected]

https://www.facebook.com/groups/BigDataStartUp/

SoICT, Trendiction S.A Luxembourg, Microsoft Research Cambridge, INRIA France, BKAV

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Google trendsGoogle MapReduce paper 2014

Sunday 1 February 15

BigData in science

Sunday 1 February 15

Sunday 1 February 15

The Data Science: The 4th Paradigm for Scientific Discovery

Last few decades

Thousand years ago

Today and the Future

Last few hundred years

2

22.

34

acG

aa

Κ−=###

$

%

&&&

'

( ρπ

Simulation of complex phenomena

Newton’s laws, Maxwell’s equations…

Description of natural phenomena

Crédits: Dennis Gannon

Sunday 1 February 15

What’s BigData

Data has always been Big. The one aspect that differs now, if compared with the past, would be the sheer scale and accessibility of Data, which is the direct result of the super efficient speeds in which data can now be computed. Big Data is therefore an all-encompassing term for any collection of large data sets that were once difficult to process.

Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times.

Sunday 1 February 15

Data mining -> BigData mining?

Sunday 1 February 15

Simplified BigData stack

Data analytics & visualization

Data processing frameworks (Streaming, MapReduce, BSP

model)

Data management systems BlobSeer

Sunday 1 February 15

BigData management

Sunday 1 February 15

NoSQL

Sunday 1 February 15

The last 25 years of commercial DBMS development can be summed up in a single phrase: "one size fits all". This phrase refers to the fact that the traditional DBMS architecture (originally designed and optimized for business data processing) has been used to support many data-centric applications with widely varying characteristics and requirements. In this paper, we argue that this concept is no longer applicable to the database market, and that the commercial world will fracture into a collection of independent database engines, some of which may be unified by a common front-end

Sunday 1 February 15

Sunday 1 February 15

Why NoSQL“The whole point of seeking alternatives [to RDBMS systems] is that you need to solve a problem that relational databases are a bad fit for.” Eric Evans - Rackspace ACID does not scaleWeb applications have different needs

Scalability ElasticityFlexible schema/ semi-structured data

Geographically distributedWeb applications do not always need

Transaction

Strong consistencyComplex queries

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Big Data processing engines

MapReduce

Sunday 1 February 15

Sunday 1 February 15

Stream processing

Sunday 1 February 15

Large scale graph processing

Sunday 1 February 15

2012

Sunday 1 February 15

2014

Sunday 1 February 15

Vanilla Hadoop ecosystem

Sunday 1 February 15

Hortonworks data flatform

Sunday 1 February 15

Sunday 1 February 15

Hadoop ecosystem: Microsoft HDinsight

Sunday 1 February 15

BigData & CloudA Match made in heaven?

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Cloud features

Sunday 1 February 15

Data in the Clouds

As estimated by IDC, by 2020, about 40% data globally would be touched with Cloud Computing.

Cloud adoption is accelerating – the amount of data stored in Amazon Web Services (AWS) S3 cloud storage has jumped from 262 billion objects in 2010 to over 1 trillion objects at the end of the first second of 2012.

Sunday 1 February 15

While enterprises often keep their most sensitive data in-house, huge volumes of data such as social media data may be located externally.

It is a fact that data that is too big to process is also too big to transfer anywhere, so it’s just the analytical program which needs to be moved—not the data.

"You don't want to be shipping terabytes and petabytes around,". "Keep the data where it is, and then you move the analytics … to that data."

Sunday 1 February 15

Cloud enables BigDataSome of the first adopters of big data in cloud computing are users that deployed Hadoop clusters in highly scalable and elastic clouds: IBM, Azure, AWS

Cloud computing democratizes big data – any enterprise can now work with unstructured data at a huge scale.Analytics-as-a-service (AaaS) models for cloud-based big data analytics

Sunday 1 February 15

Drivers for big data on cloud adoptionCost reduction

Managing cloud-based big data is cost-effective, scalable, and fast to build.

Rapid provisioning/time to market

Faster provisioning is important for big data applications because the value of data reduces quickly as time goes by. 

Flexibility/scalability

Big data analysis, especially in the life sciences industry, requires huge compute power for a brief amount of time. For this type of analysis, servers need to be provisioned in minutes.

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

BigData is not always Cloud-appropriate

Low latency realtime data

Virtualization overhead

Multi-tenancy overhead

Scalability

Lack of cloud computing features to support RDBMS

Availability

“Rain cloud” incorporates clouds

Data integrity/privacy

Data can only be accessed by authorized users

Currently, encryption is utilized by most researchers to ensure data privacy in the cloud

Sunday 1 February 15

NoSQL vs SQL in the Cloud

Sunday 1 February 15

Data security/peformance trade-offs

Distributed nodes

Distributed data

Internode communication

RPC over TCP/IP?

Encrypted IO?

Security/performance trade-offs

Sunday 1 February 15

Cloud Architecture for Big Data

Resource scheduling and SLA for Big Data on CloudStorage and computation management in Cloud for Big Data

Large-scale data intensive workflow in support of Big Data processing on Cloud

Multiple source data processing and integration on Cloud

Virtualisation and visualisation of Big Data on Cloud

Fault tolerance and reliability for Big Data processing on Cloud

MapReduce with Cloud for Big Data processing

Distributed file storage system with Cloud for Big Data

Inter-cloud technology for Big Data

Security, privacy and trust in Big Data processing on Cloud

Green, energy-efficient models and sustainability issues in Cloud for Big Data processing

Cloud infrastructure for social networking with Big Data

User friendly Cloud access for Big Data processing

Innovative Cloud data centre networking for Big Data

Wireless and mobility support in Cloud data centre for Big Data

Sunday 1 February 15

BigData use cases

Sunday 1 February 15

Security Analytics

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Thank you for your attention

Sunday 1 February 15

Sunday 1 February 15

Classification of BigData

Sunday 1 February 15

Relationship between Cloud and BigData

Sunday 1 February 15

Sunday 1 February 15

Sunday 1 February 15

Open research issues

Data staging

Distributed storage systems: NoSQL, NewSQL

Data analysis

Data security

Sunday 1 February 15

In theory, Unfortunately, it’s not all good news.

DB administrators don’t have an easy ride. The NoSQL databases that have appeared in the last few years, with their key-value pairs, document stores, and missing schemas,

Sunday 1 February 15