32
13 NOVEMBER 2014 AARHUS UNIVERSITY AU BIG DATA POSSIBILITIES AND CHALLENGES LARS ARGE PROFESSOR AND CENTER DIRECTOR

BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

Ændr 2. linje i overskriften

til AU Passata Light

13 NOVEMBER 2014

AARHUS

UNIVERSITY AU

BIG DATA

POSSIBILITIES AND CHALLENGES

LARS ARGE

PROFESSOR AND CENTER DIRECTOR

Page 2: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular WHY BIG DATA?

”In God we trust

- all others must bring data”

W. Edwards Deming (US engineer and statistician, 1900-1993)

Page 3: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular WHAT IS BIG DATA?

Wikipedia (en.wikipedia.org/wiki/Big_data)

All-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications

Big Data characteristics

Volume (often very large)

Velocity (often arrives very fast)

Variety (often varied/complex format/type/meaning)

Veracity (often uncertain or imprecise)

Page 4: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA AVAILABILITY

Pervasive use of computers and sensors

Ability to acquire/store/process data

→ Big Data collected everywhere

→ Society increasingly “data driven”

Today as much data created in two days as we did until 2003!

Page 5: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA EXAMPLE: THE INTERNET

What happens in an

internet minute?

Page 6: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA EXAMPLE: TERRAIN DATA

Previously 30-100 meter data

E.g Shuttle Radar Topography Mission (SRTM)

near global 90-meter dataset

Now accurate meter or sub-meter data (e.g. LiDAR)

Europe: Denmark, Sweden, Netherlands, …

USA: NC, OH, PA, DE, IA, LA, …

Denmark

Denmark at 30-meter: ~46 million data points (GB)

Current 2-meter model: ~12 billion data points (TB)

Upcoming ½-meter model: ~ 168 billion data points

Page 7: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA IMPORTANCE

Nature/Science: Paradigm shift; Science will be about mining data

The economist: Managing data deluge difficult;

doing so will transform business and public life

Value is not in data creation but in data analysis!

Page 8: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA ANALYSIS IMPORTANCE

New York Times, 11/2/2012: The age of Big Data

“What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. …”

Page 9: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA ANALYSIS IMPORTANCE

Dan Ariely: ”Big Data is like teenage sex:

Everyone talks about it

Nobody really knows how to do it

Everyone thinks everyone is doing it

So everyone claims they are doing it…”

And like sex, the ones getting

the most are smart enough not

to talk about it

Page 10: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA INCREASING IMPORTANCE

Increasing government awareness of importance of

Big Data analysis

Big Data as a driver for growth

Governments are increasingly supporting

use of data through free data programs

Page 11: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular POPULAR BIG DATA ANALYSIS EXAMPLES

Google: Power of statistical methods on Big Data from the web

Google flue-trends

- Statistically certain search terms are

good indicators of flu activity

Google translate

- Not : Linguistic analysis to extract the meaning from syntax and vocabulary

- Instead : Statistically most probable translation based on similar translations on web

Page 12: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular POPULAR BIG DATA ANALYSIS EXAMPLES

Netflix: The power of recommendation systems

Analysis of subscriber preferences

created hit series “House of Cards”

- Old (1990) British TV series still popular

- Films featuring Kevin Spacey had always done well

- Movies directed by David Fincher (“the social network”)

had a healthy share

Page 13: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular BIG DATA ANALYSIS CHALLENGES

What questions should be asked

What questions can be answered

How can questions be answered

How is Big Data processed efficiently

How can different data be combined

How is uncertainly handled

What about legal issues

What about privacy issues

Researcher-industry-society collaboration important!

Page 14: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular AFTERNOON CASES

Many interesting projects/collaborations, including on

Releasing and exploiting government, social media and newspaper data

– and how they are accesses

Utilizing health care data to help mothers, newborn, school kids and hip patients alike

– including in Africa

Improving indoor service logistics, recycling systems and personal products offerings

- as well as national and global markets

Collecting data to model, analyze and improve air quality, traffic behavior, food perception

- as well as animal farming

Many good “Big Data – Big Impact” examples involving researchers, industry and government

Page 15: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular MADALGO CASES

MADALGO cases involve efficient processing of big terrain data

Cleaning ocean floor scanning data

Flood risk screening

→ both strong research/publications

and new/improved products

Important for success

MADALGO algorithms research

Domain and market knowledge of industry

Startup SCALGO as development “glue”

Page 16: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular CENTER FOR MASSIVE DATA ALGORITHMICS

Established 2007 funded by Danish National Research Foundation

5 year renewal in 2012 (10 year budget > $25 million)

- International evaluation: “MADALGO is the world-leading

center in the area of massive dataset algorithmics”

High level objectives

Advance algorithmic knowledge in massive data algorithms area

Train researchers in world-leading international environment

Be catalyst for multidisciplinary collaboration

Page 17: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular CENTER FOR MASSIVE DATA ALGORITHMICS

Established 2007 funded by Danish National Research Foundation

5 year renewal in 2012 (10 year budget > $25 million)

- International evaluation: “MADALGO is the world-leading

center in the area of massive dataset algorithmics”

Building on:

Algorithms research focus areas:

- I/O-efficient, cache-oblivious and streaming

- Algorithm engineering

Strong international team/environment

Multidisciplinary and industry collaboration

Page 18: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular I/O-EFFICIENT ALGORITHMS

Problems involving Big Data on disk

Disk access is 106 times slower than main memory access

Large access time amortized by transferring large blocks of data

→ Important to store/access data to take advantage of blocks

I/O-efficient algorithms:

Move as few disk blocks as possible to solve problem

The difference in speed between modern CPU and disk technologies is

analogous to the difference in speed in sharpening a pencil using a

sharpener on one’s desk or by taking an airplane to the other side of the

world and using a sharpener on someone else’s desk.” (D. Comer)

Page 19: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular I/O-EFFICIENT ALGORITHMS MATTER

Example: Visiting data in order

Array size N = 10 elements

Disk block size B = 2 elements

Main memory size M = 4 elements

→ Algorithm 1: N=10 disk accesses

→ Algorithm 2: N/B=5 disk assesses

Difference between N and N/B huge

N = 256 x106, B = 8000 , 1 ms disk access time N accesses take 256 x103 sec = 4266 min = 71 hours

N/B assesses take 256/8 sec = 32 seconds

1 5 2 6 7 3 4 10 8 9

1 2 10 9 8 5 4 7 6 3

Page 20: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular ALGORITHM ENGINEERING & COLLABORATION

Much of centers collaboration driven by algorithm engineering

Design/implementation of practical algorithms & experimentation

- Often provide valuable input to theoretical research work

- Sometime leads to practical breakthroughs

MADALGO, COWI and SCALGO flood risk collaboration

Started in 2006 as part of Strategic Research Council project

Builds on MADALGO I/O-efficient algorithms research

→ Unique big terrain data solutions and establishment of SCALGO

Collaboration continues, including in Innovation Fond project

→ Unique flood risk products

Page 21: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular FLOOD RISK ANALYSIS IMPORTANCE

Important to screen extreme rain or sea-level rise flood risk

50% of Danes worry about their homes being flooded (Userneeds)

90% of Danes say high flood risk affect decision to buy house

Cost of 2011 Copenhagen flood over 6 billion kroner (Swiss Re)

Potential to do so using detailed national elevation model

Elevation for roughly every 2x2 meter of soon ½x½ meter

hundreds or even thousands of points in family home lot!

Page 22: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular DETAILED (BIG) TERRAIN DATA ESSENTIAL

Mandø 2 meter

sea-level rise

90 meter terrain model 2 meter terrain model

Page 23: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular DETAILED (BIG) TERRAIN DATA ESSENTIAL

Drainage network

(flow accumulation)

90 meter terrain model 2 meter terrain model

Page 24: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular SURFACE FLOW MODELING

Flow accumulation on grid terrain model:

Initially one unit of water in each grid cell

Water (initial and received) distributed from each cell to lowest lower neighbor cell

Flow accumulation of cell is total flow through it

Note

Flow accumulation of cell = size of “upstream area”

Drainage network = cells with high flow accumulation

Flow stops/disappears in depressions -> model often “filled”

Page 25: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular FLOW ACCUMULATION PERFORMANCE

Natural algorithm access disk for each grid cell

“Push” flow down the terrain by visiting cells in height order

Problem since cells of same height scattered over terrain

Performance of commercial systems often not satifactory

Cannot handle Denmark at 2-meter resolution

We developed I/O-optimal algorithms

Now handle Denmark 2-meter model in a day on limited 4GB desktop!

Page 26: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular FLOW ACCUMULATION SUCCESS STORY

Shuttle Radar Topography Mission (SRTM)

Near global dataset

3-arc seconds (90-meter at equator) raster

~60 billion cells stored in roughly 14.000 files

Large USGS Hydrosheds project produced

“hydrological conditioned” 90-meter data

But upscaled to 500-meter to compute flow accumulation

Using I/O-efficient algorithms: One day on standard 4GB workstation!

Page 27: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular FLASH FLOOD MAPPING

Models how surface water gathers in depressions as it rains

Water from watershed of depression gathers in the depression

Depressions fill, leading to (dramatic) increase in neighbor depression watershed size

Flash Flood Mapping: Amount of rain before any given raster cell is below water

Watershed area

Volume

Watershed area

Volume

Page 28: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular FLASH FLOOD MAPPING EXAMPLE

After 10mm rain

After 50mm rain

After 100mm rain

After 150mm rain

After 150mm rain

Page 29: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular FLASH FLOOD MAPPING SUCCESS STORY

Based on collaborative research, COWI markets SCALGO

produced Flash Flood Mapping product in Denmark

under name ”Skybrudskort®“

Produced for entire country

Sold to over half of local governments

Jones Edmunds compared Flash Flood Mapping to result of

advanced dynamic model (ICPR) for Marion County, Florida

Results very close

Significantly more detailed

Cost under 5%

Page 30: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular AFTERNOON: ONLINE DEMONSTRATION

Page 31: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

13 NOVEMBER 2014

PROFESSOR, CENTER DIRECTOR

LARS ARGE AARHUS UNIVERSITY AU

Overskrift én linje

Bold eller Regular CONCLUSIONS

Hope to have convinced you that

Big Data has huge potential

- in research, industry and society

Exploiting Big Data challenging

- research-industry-society collaboration

one way to success

Thanks!

[email protected]

www.madalgo.au.dk

Page 32: BIG DATA POSSIBILITIES AND CHALLENGES · Veracity (often uncertain or imprecise) 13 NOVEMBER 2014 PROFESSOR, CENTER DIRECTOR ... Bold eller Regular BIG DATA ANALYSIS CHALLENGES

AARHUS

UNIVERSITY AU