61
Big Data and Social Computing Xiaofeng Meng Renmin University of China Copyright 2012. Do not duplicate without permission. 1

Big Data and Social Computing Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Cloud Computing and Big Data4mgnt ss...2008, Nature, Big Data ... 要新的方法解决BD ... ensuring

Embed Size (px)

Citation preview

Big Data and Social Computing

Xiaofeng MengRenmin University of China

Copyright 2012. Do not duplicate without permission. 1

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Social Computing

5

Copyright 2012. Do not duplicate without permission. 2

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Social Computing

5

Copyright 2012. Do not duplicate without permission. 3

Big Data is so hot!

Google Trends of Big Data

Big Data is Big deal?

Copyright 2012. Do not duplicate without permission. 4

Big Data is so hot!

Copyright 2012. Do not duplicate without permission. 5

Big Data is so hot! 2008, Nature, Big Data专刊

2011年2月,Science,Dealing with Big Data 2011年6月麦肯锡(McKinsey)

Big data: The next frontier for innovation, competition, and productivity

2012年1月达沃斯世界经济论坛 Big Data, Big Impact: New Possibilities for International

Development

2012年3月份美国奥巴马政府 Big Data Research and Development Initiative

2012年5月联合国Global Pulse的倡议项目 Big Data for Development:Challenges & Opportunities

《纽约时报》: The Age of Big DataCopyright 2012. Do not duplicate without permission. 6

What is Big Data?

3V or 4V

Copyright 2012. Do not duplicate without permission. 7

安阳殷墟遗址(公元前1300,距今3300年)

甲骨文大坑,1万7千余片

甲骨文大坑 is Big data

Copyright 2012. Do not duplicate without permission. 8

DB(Database) vs. BD(Big Data)“Small data”,

Very Large Database(VLDB) MB, 结构数据

运营式系统=>封闭数据源

以数据为对象解决其存储和管理问题

Big Data,Extremely Large Database(XLDB) >PB,非结构数据,

感知式系统=>开放数据源

以数据为资源解决诸领域问题

数据工程

数据思维

Data Engineering

Data Thinking

Copyright 2012. Do not duplicate without permission. 9

DB(Database) vs. BD(Big Data)

方法问题

DB BD

DB 认为BD就是DB的问题,也可以用DB的方法解决。比如SAP Sybase的CTOIrfan Khan就持有这种观点

认为BD本质上还是DB的问题,但是需要一些BD的方法来解决,典型的代表如IBM、Oracle、Teradata等公司的大数据解决方案。

BD 认为BD是一个新问题,但是可以用DB的一些方法来解决,典型的代表如SciDB系统

认为BD是一个新问题,需要新的方法解决BD的问题,典 型 代 表 如 Google 、Facebook等公司的解决方案

Copyright 2012. Do not duplicate without permission. 10

DB(Database) vs. BD(Big Data)

方法问题

DB BD

DB 认为BD就是DB的问题,也可以用DB的方法解决。比如SAP Sybase的CTOIrfan Khan就持有这种观点

认为BD本质上还是DB的问题,但是需要一些BD的方法来解决,典型的代表如IBM、Oracle、Teradata等公司的大数据解决方案。

BD 认为BD是一个新问题,但是可以用DB的一些方法来解决,典型的代表如SciDB系统

认为BD是一个新问题,需要新的方法解决BD的问题,典 型 代 表 如 Google 、Facebook等公司的解决方案

Copyright 2012. Do not duplicate without permission. 11

我们认为这是一种最正确的观点

这两种观点部分正确

这两种观点部分正确

这种观点不正确

数据产生的演化

运营式系统 数据源是被动产生,

数据规范,有秩序,强调数据的一致性

互联网系统 数据源是主动产生,有网络主体人加工产生

数据结构复杂,无秩序,不强调数据的一致性或只强调弱一致性

感知式系统

数据源是自动产生,未来无处不在的传感器和微处理器

大数据的真正驱动力!

Copyright 2012. Do not duplicate without permission. 12

“大海捕鱼”vs.“池塘捕鱼”

Copyright 2012. Do not duplicate without permission. 13

No Size Fits All

One Size Fits All

What Can Big Data do ?Prediction

Copyright 2012. Do not duplicate without permission. 14

What Can Big Data do ?华尔街根据民众情绪抛售股票

对冲基金依据购物网站的顾客评论,分析企业产品销售情况

银行根据求职网站的岗位数量,推断就业率

投资机构收集并分析上市企业声明,从中寻找破产的蛛丝马迹

美国疾病控制和预防中心依据网民搜索,分析全球范围内流感等病疫的传播情况

美国总统奥巴马的竞选团队依据选民的微博,实时分析选民对总统竞选人的喜好

Copyright 2012. Do not duplicate without permission. 15

What Can Big Data do ?Fraud DetectionHealthcare TransportationTelecommunicationsLife sciences Financial transactions ……….

Copyright 2012. Do not duplicate without permission. 16

Big Data Application

应用 用户数 精确度 可靠度 数据量 反应

科学计算 少 极高 低 -- 中等 Tera 慢

股市交易 大量 高 极高 Gega 快

Web数据 大量 中等 -- 高 中等 Peta 快

微博数据 大量 中等 -- 高 中等 100Peta 快

。。。

Copyright 2012. Do not duplicate without permission. 17

Big Data Application

Big ScienceBig ManufacturingBig BusinessBig Car– Google Self-driving car

Big Analytics

Big Integration Big Privacy

Copyright 2012. Do not duplicate without permission. 18

Big Efforts新加坡政府:德勤数据分析研究所(DAI),

企业:GE公司将投资15亿在旧金山湾区建立全球的软件和分析中心,中心拟雇佣至少400名数据科学家(Data Scientist)

1961: Charles Bachman at GE develops the first database management system, IDS难道GE又要拔得头筹?

Copyright 2012. Do not duplicate without permission. 19

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Social Computing

5

Copyright 2012. Do not duplicate without permission. 20

Cloud Computing and Big Data

Cloud Computing is just like the highway which can support a variety of transportation

Big Data can be seen as one vehicle on the highway

Cloud Computing is infrastructure while Big Data is its service object

Copyright 2012. Do not duplicate without permission. 21

Big Data Analysis Pipeline

Analytics

Integration

Extraction& Cleaning

Acquisition

Interpretation

Collaboration of cloud computing can greatly promote these process

From:

Copyright 2012. Do not duplicate without permission. 22

AcquisitionMultiple data resource and huge amountMuch of this data is of no interestData Reduction is important

Copyright 2012. Do not duplicate without permission. 23

Extraction & CleaningVarious data type: Structured &UnstructuredExtraction is often highly application dependentMissing information and error information should be

cleaned.Duplicate information should be identified and

merged

Copyright 2012. Do not duplicate without permission. 24

IntegrationRight metadata is neededthere has to be some translation of data as it flows

from one model(platform) to the other. E.g. Transfer data from Hadoop to DB2

Copyright 2012. Do not duplicate without permission. 25

AnalyticsFundamentally different from traditional statistical

analysis on small samplesReal-time analysisLack of coordination between database systems

Copyright 2012. Do not duplicate without permission. 26

Analytics Batch Process: 目前最具代表性是MapReduce模型

主要适用于对时间要求不太敏感的应用场景,具有很高的IO。典型应用有网站日志分

析、文本挖掘、生物信息学等

Stream Process: Storm(Twitter), S4(Yahoo!)主要应用场景有网页点击数的实时统计、传感器网络、金融中的高频交易等

Copyright 2012. Do not duplicate without permission. 27

Analytics Tools in Use

21%

18%

12%

11%

10%

9%

9%

8%

4%

2%

1%

1%

3%

35%

11%

0% 5% 10% 15% 20% 25% 30% 35% 40%

Oracle Exadata

Microsoft SQL PDW

IBM DB2 Smart Analytics System

Hadoop/Mapreduce

IBM Netzza

HP Vertica

Teradata EDW

EMC Greenplum

Sybase IQ

Infobright

Kognitb WX2

ParAccel Analytic Database

Other

We aren't using big data analytics tools

Don't know

Big Data Analytics Tools in Use

Copyright 2012. Do not duplicate without permission. 28

From: InformationWeek 2012 Big Data Survey of 231 business technology professionals, Dec. 2011

InterpretationBig data is of limited value if users cannot

understand the analysisThe provenance of the result dataData visualization

Copyright 2012. Do not duplicate without permission. 29

Big Data Visualization: The Internet Map

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Social Computing

5

Copyright 2012. Do not duplicate without permission. 30

Data, Data and Data!

Copyright 2012. Do not duplicate without permission. 31

Difficult to get the dataData is all around you!Data type is variousMost data is occupied by companyResearchers are difficult to get the data

Copyright 2012. Do not duplicate without permission. 32

No Size Fits All

Web dataScience dataFinancial DataMoving Object Data………

Copyright 2012. Do not duplicate without permission. 33

ScaleWe must store everything because we don’t know

which part of the data is valuable.Find a Needle in Haystack

Copyright 2012. Do not duplicate without permission. 34

“Data is widely available; what is scarce is the ability to extract wisdom from it.”

Hal Varian, Google's chief economist

Timeliness

Many situations need the result of analysis immediately

Real-time processing can be a challenge with big data, especially in dynamic data environments like financial trading and social media.

Develop partial results in advance and then do incremental computation

New index structures are required

From:

Copyright 2012. Do not duplicate without permission. 35

Privacy

Manage privacy is both technical and sociological problem

New data source bring new problems:LBS、Microblog….

Share private data while limiting disclosure and ensuring sufficient data utility in the shared data

Differential privacy is a very important step, but it reduces information content too far in order to be useful in most practical cases

From:

Copyright 2012. Do not duplicate without permission. 36

Usability

Copyright 2012. Do not duplicate without permission. 37

TaijiDB - MotivationRDBMS Cloud DBMS

Easy to use Need relatively long time to learn

Structured Query Language No such Language

High Cost Low Cost

Copyright 2012. Do not duplicate without permission. 38

TaijiDB - Motivation

We want to build a low cost system which is easy to use

Copyright 2012. Do not duplicate without permission. 39

TaijiDB - Motivation

Naïve users should feel comfortable with the system

Copyright 2012. Do not duplicate without permission. 40

TaijiDB - Motivation

Real World Applications Big Data

Cloud ComputingCloud Based DBMS

No One-To-All Solutions In the Cloud

TaijiDB: A TitAnIc and Just-In-time DataBase

Copyright 2012. Do not duplicate without permission. 41

系统架构

2012/10/22

HDFSTables & Files & Logs

HMaster

Basic SQL Interface/Application Interfaces

HRegionServer HRegionServer

SSD & Buffer Management

Unified API

Operation & Management Service

Storage Management

Index Management

Query Optimization

Random Sampling Algorithm

E -Commerce

Internet of Things Telecom

Security

Lock

Monitoring

LoadBalance

Metadata

Testing

Multi-Level Index

MLGTAlgorithm

Progress Estimating

Online Aggregation

Cassandra

Keyspace

SuperColumn

Thrift Interface

SuperColumn

SuperColumn

HBase

Front-end Interface

Query Processing

Unified Execution Engine

Storage Manager

Copyright 2012. Do not duplicate without permission. 42

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Social Computing

5

Copyright 2012. Do not duplicate without permission. 43

Four Science Paradigms 图灵奖获得者、著名数据库专家Jim Gray博士观察并总结人类自古以

来,在科学研究上,先后历经了实验、理论和计算三种范式。当数据量不断增长和累积到今天,传统的三种范式在科学研究,特别是一些新的研究领域已经无法很好的发挥作用,需要有一种全新的第四种范式来指导新形势下的科学研究

Copyright 2012. Do not duplicate without permission. 44

科学范式 出现时间 主要方法

实验 数千年前 通过观察来描述自然现象

理论 近百年 建立模型、分析

计算 近几十年 对复杂现象利用计算机进行仿真模拟

数据探索(data

exploration)

目前 仪器或仿真器产生数据,计算机软件将这些数据进行处理,而后存储于不同地方,最后要将这些数据高效地汇集、整理、统计、分析、共享和归档,并加以再利用

Big Data and Social Computing 2007年,雅虎首席科学家沃茨博士在《自然》发表《21世纪

的科学》:得益于计算机技术和海量数据库的发展,个人在真实世界的活动得到了前所未有的记录,这种记录的粒度很高,频度在不断增加,为社会科学的定量分析提供了极为丰富的数据。由于能测得更准,计算的更精确,社会科学将脱下“准科学”的外衣,在21世纪全面迈进科学的殿堂。例如,新闻的跟帖,网站的下载,社交平台的互动记录等都为社会行为的研究

提供了大量数据。

Copyright 2012. Do not duplicate without permission. 45

From: 涂子沛《大数据》

我们生活在两个社会中

46

物理社会(现实)国家, 城市企业, 学校

网络社会(虚拟)

FaceBook, MySpaceBlogger, QQ,

社会的数字化与数字的社会化

社会的数字化:数据足迹(data print) 在数字化时代,各色人等有意无意留下的数据足迹越来越丰富

数据足迹是有社会意义(social meaning)的,蕴含着社会结构

数字的社会化:

数据足迹及其结构本身就是社会结构和过程的一个环节,不断塑造着新的社会秩序和关系

For: 社会是计算的对象

物理社会

突发事件管理

舆情分析与监控

社会经济分析

人口系统

交通控制

信息资源管理

48

虚拟社会

广义---整个Web狭义---Web2.0系统(人参与到社会)

交互:HCI、HHI + 共享

促进人与人的交流

与知识信息共享

社会计算为社会

一切社会解释、监控、预测与规划都离不开对数据足迹的收集、整理和分析

社会计算:

基于特定社会需要,在特定社会理论指导下,收集、整理和分析数据足迹,以便进行社会解释、监控、预测与规划的过程和活动

研究意义

社会计算影响人文社会科学的诸多方面

社会学的基本理论方法改变

社会信息资源的管理方式改变

社会(企业\商业)管理模式的改变

社会(舆情)信息的收集与处理改变

所有这些使得社会研究手段改变

50

研究目标

51

建立一个支持人文社会科学研究工作的通用实验

平台

数据收集 数据分析 结论验证

体现人大特色

体现学科渗透

Copyright 2012. Do not duplicate without permission. 52

Copyright 2012. Do not duplicate without permission. 53

Copyright 2012. Do not duplicate without permission. 54

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Social Computing

5

Copyright 2012. Do not duplicate without permission. 55

SummaryCloud Computing helps organizations store, manage,

share and analyze their Big Data in an affordable and easy-to-use way

The concept of Big data is wide and empty. We must focus on one or some domains.

Different situations need different type of process: Batch or Stream

Social Computing is a big data problem!

Copyright 2012. Do not duplicate without permission. 56

香山科学会议

网络数据科学与工程

李国杰,华云生,姚期智,成思危

主要议题 社会、经济与IT领域中网络大数据应用

网络数据科学的共性理论基础

网络大数据的良性生态环境构建

中国科学报-李国杰

Copyright 2012. Do not duplicate without permission. 57

XLDB Asia 2012

Invited Talks Reference cases from scientific communities

Astroinformatics, Geoinformatics, Earth…

Reference cases from industry Facebook, eBay, EMC, Taobao…

Research on Big Data Management Laura(IBM), Xiaodong Zhang (Ohio), Martin(MonetDB)…

Panel Discussion Handling Extremely Large Scientific Data NoSQL: the Cure for Big Data? Evolution or Revolution: Database Research for Big Data

Lightning talks

Copyright 2012. Do not duplicate without permission. 58

Useful Resources

Copyright 2012. Do not duplicate without permission. 59

About our Lab

Innovative Data Management Research

Http://idke.ruc.edu.cnGoogle wamdm

Copyright 2012. Do not duplicate without permission. 60

未来每18 个月产生的数据量等于有史以来的数据量之和

--Jim Gray1998图灵奖获奖演说

SELECT questions FROM youSELECT thanks FROM me

Copyright 2012. Do not duplicate without permission. 61