Upload
vodang
View
220
Download
4
Embed Size (px)
Citation preview
Cloud Computing and Big Data
Xiaofeng MengRenmin University of China
Forum of Future Data
FFD, 2012, 武夷山
Outline
Introduction to Big Data1
Cloud Computing and Big Data 2
3
4
Conclusion
Challenging Problems
Our Work
5
Outline
Introduction to Big Data1
Cloud Computing and Big Data 2
3
4
Conclusion
Challenging Problems
Our Work
5
Big Data is so hot!
Google Trends of Big Data
Big Data Across the Federal Government(USA, March, 2012)
What is Big Data?
DB(Database) vs. BD(Big Data)“Small data”,
Very Large Database(VLDB) MB, 结构数据
以数据为对象解决其存储和管理问题
Big Data,Extremely Large Database(XLDB) >PB,非结构数据
以数据为资源解决诸领域问题
数据工程
数据思维
Data Engineering
Data Thinking
What Can Big Data do ?
华尔街根据民众情绪抛售股票
对冲基金依据购物网站的顾客评论,分析企业产品销售情况
银行根据求职网站的岗位数量,推断就业率
投资机构收集并分析上市企业声明,从中寻找破产的蛛丝马迹
美国疾病控制和预防中心依据网民搜索,分析全球范围内流感等病疫的传播情况
美国总统奥巴马的竞选团队依据选民的微博,实时分析选民对总统竞选人的喜好
Prediction
Big Data Application
应用 用户数 精确度 可靠度 数据量 反应
科学计算 少 极高 低 -- 中等 Tera 慢
股市交易 大量 高 极高 Gega 快
Web数据 大量 中等 -- 高 中等 Peta 快
微博数据 大量 中等 -- 高 中等 100Peta 快
。。。
Outline
Introduction to Big Data1
Cloud Computing and Big Data 2
3
4
Conclusion
Challenging Problems
Our Work
5
Cloud Computing and Big Data
Cloud Computing is just like the highway which can support a variety of transportation
Big Data can be seen as one vehicle on the highway
Cloud Computing is infrastructure while Big Data is its service object
Big Data Analysis Pipeline
Analysis
Integration
Extraction& Cleaning
Acquisition
Interpretation
Collaboration of cloud computing can greatly promote these process
From:
Outline
Introduction to Big Data1
Cloud Computing and Big Data 2
3
4
Conclusion
Challenging Problems
Our Work
5
Data, Data and Data!
Data is all around you!Data type is variousMost data is occupied by companyResearchers are difficult to get the data
No Size Fits All
Web dataScience dataFinancial DataMoving Object Data………
21%
18%
12%
11%
10%
9%
9%
8%
4%
2%
1%
1%
3%
35%
11%
0% 5% 10% 15% 20% 25% 30% 35% 40%
Oracle Exadata
Microsoft SQL PDW
IBM DB2 Smart Analytics System
Hadoop/Mapreduce
IBM Netzza
HP Vertica
Teradata EDW
EMC Greenplum
Sybase IQ
Infobright
Kognitb WX2
ParAccel Analytic Database
Other
We aren't using big data analytics tools
Don't know
Big Data Analytics Tools in Use
“大海捕鱼”vs.“池塘捕鱼”“Data is widely available;
what is scarce is the ability to extract wisdom from it.”
Parallelism Parallelism across nodes in a cluster Parallelism within a single node
Cloud ComputingNew hardware: SSD、PCM…
Timeliness
Many situations need the result of analysis immediately
Real-time processing can be a challenge with big data, especially in dynamic data environments like financial trading and social media.
Develop partial results in advance and then do incremental computation
New index structures are required
From:
Privacy
Manage privacy is both technical and sociological problem
New data source bring new problems:LBS、Microblog….
Share private data while limiting disclosure and ensuring sufficient data utility in the shared data
Differential privacy is a very important step, but it reduces information content too far in order to be useful in most practical cases
From:
Outline
Introduction to Big Data1
Cloud Computing and Big Data 2
3
4
Conclusion
Challenging Problems
Our Work
5
Overview of our work: Web Data Management
2010
2009
2006
2001
EasyScholar
C-DBLP
Deep Web Integration
Surface Web Data Extraction
ScholarSpace2011-Present
面向领域的Web数据集成技术 成功研发多个线上系统,验证了数据集成技术有有效性
学术空间ScholarSpace 工作通数据集成系统
舆情监控平台 图书价格比较网
(访问量超过了350万人次) (集成数据量超过了300万条)
(集成数据量超过了450万条) (动态集成方式, 实时数据)
ScholarSpace
文献:50万作者:40万
累计访问:400万 日访问量:6000人次
ScholarSpace
实体:作者, 论文, 期刊, 会议, 研究机构, …
关联:作者关系, 论文发表关系,合作者关系,
数据抽取
数据集成
Advisor
Advisor
Advisor
Co-AuthorCo-Author
Author-Of
Author-Of
Author-Of
Published-In
Published-In
Member
Classmate
Reference
Published-In
Author-Of
关联演化
浏览 查询 分析基于任务 多种形式 丰富多样
隶属关系, 导师关系,参考文献关系…
关联发现、删除、更新
Web据管理框架
成果意义
建立了一种将数据结构化管理的途径,为解决特定领域的大数据集成问题奠定了基础
进而为大数据管理提供一种新的解决思路
Overview of our work: Cloud Data Management
2011/06
2010/06
2010/01
2008/04
Query Process & Benchmark_v2.0
TaijiDB_v1.0
System Survey & Benchmark_v1.0
Introduction & Index for Cloud
Extensive Research &TaijiDB_v2.0present
join querydistribution strategyprogress estimate
multidimensional index, query optimization, online aggregation
Current Work
Multidimensional Index in the Cloud
Multi-Fields Query Processing in the Cloud
Online Aggregation in the Cloud
Our Prototype System: TaijiDB
Benchmarking Cloud-based Data Management Systems
Practical Industrial Applications Motivated
Current Work
Multidimensional Index in the Cloud
Multi-Fields Query Processing in the Cloud
Online Aggregation in the Cloud
Our Prototype System: TaijiDB
Benchmarking Cloud-based Data Management Systems
Multi-dimensional Index in the cloud - motivation
Massive Millions of senors or GPS enabled devices 10^6 * 2*60*24*1KB = 3TB/day
High Update frequency Data collection Frequency Hundreds of thousands of insertion per second
Multi-Dimensional Inherent attributes: spatio-temporal attributes Other attributes: speed, direction …
Toyota: G-Book
1 million+ members
GE: OnStar
5 million+ membersInternet of vehicles
Collaboration with NEC
Limits of Current Approaches
Traditional DBMS Be in trouble with scalability Can not support high insert throughput
Key-value Stores Pros
• High scalability、availability and fault tolerance• Efficient random read and write • Support high insertion throughput
Cons
• Only support fast rowkey based query • Can not support multi-dimensional query efficiently
Requirements
Design a new index model that can support efficient multi-dimensional query according to the characteristics of IoTapplications
The index model must support high inert throughput at the same time
Implementing the new index based on HBase
Multi-level Index Framework
Dividing the data into current data and historical data, indexing them at different granularities
For the present data, indexing the time intervals and subspaces at high level ; For the historical data, indexing each record in batch
Z-order Based Dynamical Space Partitioning
Advantages Make sure the data is distributed evenly The data that is close in the original time and space dimension can be
stored in the same regions
Current Work
Multidimensional Index in the Cloud
Multi-Fields Query Processing in the Cloud
Online Aggregation in the Cloud
Our Prototype System: TaijiDB
Benchmarking Cloud-based Data Management Systems
Multi-Fields Query Processing in the Cloud - Motivation
Input
OutputR (msisdn, url, ts, size, otherData)
Select Top 100url, sum(size) s, count(msisdn) cFrom RWhere msisdn =861346672558 And
ts>20120205 And ts<20120429Group by url
Order by c
Select Top 100msisdn, sum(size) s, count(url) cFrom RWhere url=“www.baidu.com” And
ts>20120205 And ts<20120429Group by msisdn
Order by c
Collaboration with 诺西
Multiple Layer Grid Tree For Telecom
Typedef Struct MLGT {
Int N;
Int M;
Boolean bm[m][n];
Long insets[m][n];
Long trange;
Long mspace;
Map< RegionID,
MLGT>SR;
} MLGT
ts(0,0)
(mspace,0)
(0,trange)
sub MGLTsub MGLTsub MGLTmsisdn
Split Region
Region(Cell)
sub MGLT sub MGLT
MLGT(Multiple Layer Grid Tree)
Solution: MLGT + Optimized MapReduce Algorithm
Organized all the Regions of a given table into a multiple layer grid tree (MLGT)
Multiple Layer Grid Tree For Telecom
region
region
Query decomposition
HBase
Map/Reduce
Map task
Tablets meta info
setting parameters
Query resultsJob settings
Query1
2 23
4
Data flow
Component
Map task
Map task
Map task
Map task
5
Current Work
Multidimensional Index in the Cloud
Multi-Fields Query Processing in the Cloud
Online Aggregation in the Cloud
Our Prototype System: TaijiDB
Benchmarking Cloud-based Data Management Systems
Online Aggregation in the Cloud - Motivation
Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM tableWHERE languageIN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language.
20TB
Amazon EC260 node cluster
Online Aggregation in the Cloud - Motivation
Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM tableWHERE languageIN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language.
20TB
Being processed…
Online Aggregation in the Cloud - Motivation
Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM tableWHERE languageIN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language.
20TB
95h
$1400
Batch-processing Online Aggregation
1h
Results with 95% confidence
Save Cost !!!
COLA - Architecture
Online Aggregation Executor State Manage Estimate Progress Prediction
Query Engine Backward Compatibility Transparent
User Interface 2 interfaces 2 processing modes
Data Manager Data SamplingMetadata Management
COLA - Implementation
COLA
Result Estimator
State Manager
Data Sampler
OLA Translator Progress Predictor
Map TranslatorCombine TranslatorReduce TranslatorNo Translator
Result Estimation &Confidence Interval Computation
Combiner+ Reducer
Split-based Queue:a queue for a tableequal length
a State Manager for a ReducerStateful Incremental Computation
MapReduce DAG Graph Task-based PERT NetworkCritical Path
Current Work
Multidimensional Index in the Cloud
Multi-Fields Query Processing in the Cloud
Online Aggregation in the Cloud
Our Prototype System: TaijiDB
Benchmarking Cloud-based Data Management Systems
Benchmarking the CloudDBMS - Motivation
How is the performance?
How to choose the most appropriate system?
How to evaluate the systems?
Existing CloudDB
DataAnalysis
WEB Data Management
Applications
Architecture
Storage
Key Value
Data Model
Benchmark Design
StandardizationBroad representationEfficiency
Benchmark
TestCaseOperation
Scenario MetricsReal application scenario from
industry
A series of metrics to evaluate performance
Representative operations in the business application
Business process in the application
Benchmark Scenario
Input
Output
Benchmark Operations
PUT
PUT(KEY,VALUE)
GET
VALUE = GET(KEY)
SCAN
RESULT = SCAN(STARTKEY,ENDKEY)
LOAD LOAD(PATH)
Evaluation Results
Partition
Without partition
nodes
nodes
nodes
nodes
Res
pons
e tim
e
Data Import File Load Scalability
Current Work
Multidimensional Index in the Cloud
Multi-Fields Query Processing in the Cloud
Online Aggregation in the Cloud
Our Prototype System: TaijiDB
Benchmarking Cloud-based Data Management Systems
TaijiDB - Motivation
Real World Applications Big Data
Cloud ComputingCloud Based DBMS
No One-To-All Solutions In the Cloud
TaijiDB: A TitAnIc and Just-In-time DataBase
系统架构
2012/7/10
HDFSTables & Files & Logs
HMaster
Basic SQL Interface/Application Interfaces
HRegionServer HRegionServer
SSD & Buffer Management
Unified API
Operation & Management Service
Storage Management
Index Management
Query Optimization
Random Sampling Algorithm
E -Commerce
Internet of Things Telecom
Security
Lock
Monitoring
LoadBalance
Metadata
Testing
Multi-Level Index
MLGTAlgorithm
Progress Estimating
Online Aggregation
Cassandra
Keyspace
SuperColumn
Thrift Interface
SuperColumn
SuperColumn
HBase
Front-end Interface
Query Processing
Unified Execution Engine
Storage Manager
Outline
Introduction to Big Data1
Cloud Computing and Big Data 2
3
4
Conclusion
Challenging Problems
Our Work
5
SummaryCloud Computing helps organizations store, manage,
share and analyze their Big Data in an affordable and easy-to-use way
The concept of Big data is wide and empty. We must focus on one or some domains.
Data thinking: Nothing can do without dataDifferent situations need different type of process: Batch
or StreamHardware and software both need to update
香山科学会议
网络数据科学与工程
李国杰,华云生,姚期智,成思危
主要议题 社会、经济与IT领域中网络大数据应用
网络数据科学的共性理论基础
网络大数据的良性生态环境构建
中国科学报-李国杰
XLDB Asia 2012
Invited Talks Reference cases from scientific communities
Astroinformatics, Geoinformatics, Earth…
Reference cases from industry Facebook, eBay, EMC, Taobao…
Research on Big Data Management Laura(IBM), Xiaodong Zhang (Ohio), Martin(MonetDB)…
Panel Discussion Handling Extremely Large Scientific Data NoSQL: the Cure for Big Data? Evolution or Revolution: Database Research for Big Data
Lightning talks
未来每18 个月产生的数据量等于有史以来的数据量之和
--Jim Gray1998图灵奖获奖演说
谢 谢!