Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Cloud Computing and Big Data

Xiaofeng MengRenmin University of China

Forum of Future Data

FFD, 2012, 武夷山

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Our Work

5

Outline



3

4

Conclusion


Our Work

5

Big Data is so hot!

Google Trends of Big Data

Big Data Across the Federal Government(USA, March, 2012)

What is Big Data?

DB(Database) vs. BD(Big Data)“Small data”，

Very Large Database（VLDB） MB, 结构数据

以数据为对象解决其存储和管理问题

Big Data，Extremely Large Database（XLDB） >PB，非结构数据

以数据为资源解决诸领域问题

数据工程

数据思维

Data Engineering

Data Thinking

What Can Big Data do ?

华尔街根据民众情绪抛售股票

对冲基金依据购物网站的顾客评论，分析企业产品销售情况

银行根据求职网站的岗位数量，推断就业率

投资机构收集并分析上市企业声明，从中寻找破产的蛛丝马迹

美国疾病控制和预防中心依据网民搜索，分析全球范围内流感等病疫的传播情况

美国总统奥巴马的竞选团队依据选民的微博，实时分析选民对总统竞选人的喜好

Prediction

Big Data Application

应用用户数精确度可靠度数据量反应

科学计算少极高低 -- 中等 Tera 慢

股市交易大量高极高 Gega 快

Web数据大量中等 -- 高中等 Peta 快

微博数据大量中等 -- 高中等 100Peta 快

。。。

Outline



3

4

Conclusion


Our Work

5

Cloud Computing and Big Data

Cloud Computing is just like the highway which can support a variety of transportation

Big Data can be seen as one vehicle on the highway

Cloud Computing is infrastructure while Big Data is its service object

Big Data Analysis Pipeline

Analysis

Integration

Extraction& Cleaning

Acquisition

Interpretation

Collaboration of cloud computing can greatly promote these process

From:

Outline



3

4

Conclusion


Our Work

5

Data, Data and Data!

Data is all around you!Data type is variousMost data is occupied by companyResearchers are difficult to get the data

No Size Fits All

Web dataScience dataFinancial DataMoving Object Data………

21%

18%

12%

11%

10%

9%

9%

8%

4%

2%

1%

1%

3%

35%

11%

0% 5% 10% 15% 20% 25% 30% 35% 40%

Oracle Exadata

Microsoft SQL PDW

IBM DB2 Smart Analytics System

Hadoop/Mapreduce

IBM Netzza

HP Vertica

Teradata EDW

EMC Greenplum

Sybase IQ

Infobright

Kognitb WX2

ParAccel Analytic Database

Other

We aren't using big data analytics tools

Don't know

Big Data Analytics Tools in Use

“大海捕鱼”vs.“池塘捕鱼”“Data is widely available;

what is scarce is the ability to extract wisdom from it.”

Parallelism Parallelism across nodes in a cluster Parallelism within a single node

Cloud ComputingNew hardware: SSD、PCM…

Timeliness

Many situations need the result of analysis immediately

Real-time processing can be a challenge with big data, especially in dynamic data environments like financial trading and social media.

Develop partial results in advance and then do incremental computation

New index structures are required

From:

Privacy

Manage privacy is both technical and sociological problem

New data source bring new problems：LBS、Microblog….

Share private data while limiting disclosure and ensuring sufficient data utility in the shared data

Differential privacy is a very important step, but it reduces information content too far in order to be useful in most practical cases

From:

Outline



3

4

Conclusion


Our Work

5

Overview of our work: Web Data Management

2010

2009

2006

2001

EasyScholar

C-DBLP

Deep Web Integration

Surface Web Data Extraction

ScholarSpace2011-Present

面向领域的Web数据集成技术成功研发多个线上系统，验证了数据集成技术有有效性

学术空间ScholarSpace 工作通数据集成系统

舆情监控平台图书价格比较网

(访问量超过了350万人次) (集成数据量超过了300万条)

(集成数据量超过了450万条) (动态集成方式, 实时数据)

ScholarSpace

文献：50万作者：40万

累计访问：400万日访问量：6000人次

ScholarSpace

实体:作者, 论文, 期刊, 会议, 研究机构, …

关联:作者关系, 论文发表关系,合作者关系,

数据抽取

数据集成

Advisor

Advisor

Advisor

Co-AuthorCo-Author

Author-Of

Author-Of

Author-Of

Published-In

Published-In

Member

Classmate

Reference

Published-In

Author-Of

关联演化

浏览查询分析基于任务多种形式丰富多样

隶属关系, 导师关系，参考文献关系…

关联发现、删除、更新

Web据管理框架

成果意义

建立了一种将数据结构化管理的途径，为解决特定领域的大数据集成问题奠定了基础

进而为大数据管理提供一种新的解决思路

Overview of our work: Cloud Data Management

2011/06

2010/06

2010/01

2008/04

Query Process & Benchmark_v2.0

TaijiDB_v1.0

System Survey & Benchmark_v1.0

Introduction & Index for Cloud

Extensive Research &TaijiDB_v2.0present

join querydistribution strategyprogress estimate

multidimensional index, query optimization, online aggregation

Current Work

Multidimensional Index in the Cloud

Multi-Fields Query Processing in the Cloud

Online Aggregation in the Cloud

Our Prototype System: TaijiDB

Benchmarking Cloud-based Data Management Systems

Practical Industrial Applications Motivated

Current Work






Multi-dimensional Index in the cloud - motivation

Massive Millions of senors or GPS enabled devices 10^6 * 2*60*24*1KB = 3TB/day

High Update frequency Data collection Frequency Hundreds of thousands of insertion per second

Multi-Dimensional Inherent attributes: spatio-temporal attributes Other attributes: speed, direction …

Toyota: G-Book

1 million+ members

GE: OnStar

5 million+ membersInternet of vehicles

Collaboration with NEC

Limits of Current Approaches

Traditional DBMS Be in trouble with scalability Can not support high insert throughput

Key-value Stores Pros

• High scalability、availability and fault tolerance• Efficient random read and write • Support high insertion throughput

Cons

• Only support fast rowkey based query • Can not support multi-dimensional query efficiently

Requirements

Design a new index model that can support efficient multi-dimensional query according to the characteristics of IoTapplications

The index model must support high inert throughput at the same time

Implementing the new index based on HBase

Multi-level Index Framework

Dividing the data into current data and historical data, indexing them at different granularities

For the present data, indexing the time intervals and subspaces at high level ; For the historical data, indexing each record in batch

Z-order Based Dynamical Space Partitioning

Advantages Make sure the data is distributed evenly The data that is close in the original time and space dimension can be

stored in the same regions

Current Work






Multi-Fields Query Processing in the Cloud - Motivation

Input

OutputR (msisdn, url, ts, size, otherData)

Select Top 100url, sum(size) s, count(msisdn) cFrom RWhere msisdn =861346672558 And

ts>20120205 And ts<20120429Group by url

Order by c

Select Top 100msisdn, sum(size) s, count(url) cFrom RWhere url=“www.baidu.com” And

ts>20120205 And ts<20120429Group by msisdn

Order by c

Collaboration with 诺西

Multiple Layer Grid Tree For Telecom

Typedef Struct MLGT {

Int N;

Int M;

Boolean bm[m][n];

Long insets[m][n];

Long trange;

Long mspace;

Map< RegionID,

MLGT>SR;

} MLGT

ts(0,0)

(mspace,0)

(0,trange)

sub MGLTsub MGLTsub MGLTmsisdn

Split Region

Region(Cell)

sub MGLT sub MGLT

MLGT(Multiple Layer Grid Tree)

Solution: MLGT + Optimized MapReduce Algorithm

Organized all the Regions of a given table into a multiple layer grid tree (MLGT)

Multiple Layer Grid Tree For Telecom

region

region

Query decomposition

HBase

Map/Reduce

Map task

Tablets meta info

setting parameters

Query resultsJob settings

Query1

2 23

4

Data flow

Component

Map task

Map task

Map task

Map task

5

Current Work






Online Aggregation in the Cloud - Motivation

Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM tableWHERE languageIN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language.

20TB

Amazon EC260 node cluster



20TB

Being processed…



20TB

95h

$1400

Batch-processing Online Aggregation

1h

Results with 95% confidence

Save Cost !!!

COLA - Architecture

Online Aggregation Executor State Manage Estimate Progress Prediction

Query Engine Backward Compatibility Transparent

User Interface 2 interfaces 2 processing modes

Data Manager Data SamplingMetadata Management

COLA - Implementation

COLA

Result Estimator

State Manager

Data Sampler

OLA Translator Progress Predictor

Map TranslatorCombine TranslatorReduce TranslatorNo Translator

Result Estimation &Confidence Interval Computation

Combiner+ Reducer

Split-based Queue:a queue for a tableequal length

a State Manager for a ReducerStateful Incremental Computation

MapReduce DAG Graph Task-based PERT NetworkCritical Path

Current Work






Benchmarking the CloudDBMS - Motivation

How is the performance?

How to choose the most appropriate system?

How to evaluate the systems?

Existing CloudDB

DataAnalysis

WEB Data Management

Applications

Architecture

Storage

Key Value

Data Model

Benchmark Design

StandardizationBroad representationEfficiency

Benchmark

TestCaseOperation

Scenario MetricsReal application scenario from

industry

A series of metrics to evaluate performance

Representative operations in the business application

Business process in the application

Benchmark Scenario

Input

Output

Benchmark Operations

PUT

PUT（KEY，VALUE）

GET

VALUE = GET（KEY）

SCAN

RESULT = SCAN（STARTKEY，ENDKEY）

LOAD LOAD（PATH）

Evaluation Results

Partition

Without partition

nodes

nodes

nodes

nodes

Res

pons

e tim

e

Data Import File Load Scalability

Current Work






TaijiDB - Motivation

Real World Applications Big Data

Cloud ComputingCloud Based DBMS

No One-To-All Solutions In the Cloud

TaijiDB: A TitAnIc and Just-In-time DataBase

系统架构

2012/7/10

HDFSTables & Files & Logs

HMaster

Basic SQL Interface/Application Interfaces

HRegionServer HRegionServer

SSD & Buffer Management

Unified API

Operation & Management Service

Storage Management

Index Management

Query Optimization

Random Sampling Algorithm

E -Commerce

Internet of Things Telecom

Security

Lock

Monitoring

LoadBalance

Metadata

Testing

Multi-Level Index

MLGTAlgorithm

Progress Estimating

Online Aggregation

Cassandra

Keyspace

SuperColumn

Thrift Interface

SuperColumn

SuperColumn

HBase

Front-end Interface

Query Processing

Unified Execution Engine

Storage Manager

Outline



3

4

Conclusion


Our Work

5

SummaryCloud Computing helps organizations store, manage,

share and analyze their Big Data in an affordable and easy-to-use way

The concept of Big data is wide and empty. We must focus on one or some domains.

Data thinking: Nothing can do without dataDifferent situations need different type of process: Batch

or StreamHardware and software both need to update

香山科学会议

网络数据科学与工程

李国杰，华云生，姚期智，成思危

主要议题社会、经济与IT领域中网络大数据应用

网络数据科学的共性理论基础

网络大数据的良性生态环境构建

中国科学报-李国杰

XLDB Asia 2012

Invited Talks Reference cases from scientific communities

Astroinformatics, Geoinformatics, Earth…

Reference cases from industry Facebook, eBay, EMC, Taobao…

Research on Big Data Management Laura(IBM), Xiaodong Zhang (Ohio), Martin(MonetDB)…

Panel Discussion Handling Extremely Large Scientific Data NoSQL: the Cure for Big Data? Evolution or Revolution: Database Research for Big Data

Lightning talks

未来每18 个月产生的数据量等于有史以来的数据量之和

--Jim Gray1998图灵奖获奖演说

谢谢！

Documents

Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized