Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Ændr 2. linje i overskriften
til AU Passata Light
13 NOVEMBER 2014
AARHUS
UNIVERSITY AU
BIG DATA
POSSIBILITIES AND CHALLENGES
LARS ARGE
PROFESSOR AND CENTER DIRECTOR
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular WHY BIG DATA?
”In God we trust
- all others must bring data”
W. Edwards Deming (US engineer and statistician, 1900-1993)
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular WHAT IS BIG DATA?
Wikipedia (en.wikipedia.org/wiki/Big_data)
All-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications
Big Data characteristics
Volume (often very large)
Velocity (often arrives very fast)
Variety (often varied/complex format/type/meaning)
Veracity (often uncertain or imprecise)
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA AVAILABILITY
Pervasive use of computers and sensors
Ability to acquire/store/process data
→ Big Data collected everywhere
→ Society increasingly “data driven”
Today as much data created in two days as we did until 2003!
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA EXAMPLE: THE INTERNET
What happens in an
internet minute?
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA EXAMPLE: TERRAIN DATA
Previously 30-100 meter data
E.g Shuttle Radar Topography Mission (SRTM)
near global 90-meter dataset
Now accurate meter or sub-meter data (e.g. LiDAR)
Europe: Denmark, Sweden, Netherlands, …
USA: NC, OH, PA, DE, IA, LA, …
Denmark
Denmark at 30-meter: ~46 million data points (GB)
Current 2-meter model: ~12 billion data points (TB)
Upcoming ½-meter model: ~ 168 billion data points
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA IMPORTANCE
Nature/Science: Paradigm shift; Science will be about mining data
The economist: Managing data deluge difficult;
doing so will transform business and public life
Value is not in data creation but in data analysis!
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA ANALYSIS IMPORTANCE
New York Times, 11/2/2012: The age of Big Data
“What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. …”
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA ANALYSIS IMPORTANCE
Dan Ariely: ”Big Data is like teenage sex:
Everyone talks about it
Nobody really knows how to do it
Everyone thinks everyone is doing it
So everyone claims they are doing it…”
And like sex, the ones getting
the most are smart enough not
to talk about it
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA INCREASING IMPORTANCE
Increasing government awareness of importance of
Big Data analysis
Big Data as a driver for growth
Governments are increasingly supporting
use of data through free data programs
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular POPULAR BIG DATA ANALYSIS EXAMPLES
Google: Power of statistical methods on Big Data from the web
Google flue-trends
- Statistically certain search terms are
good indicators of flu activity
Google translate
- Not : Linguistic analysis to extract the meaning from syntax and vocabulary
- Instead : Statistically most probable translation based on similar translations on web
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular POPULAR BIG DATA ANALYSIS EXAMPLES
Netflix: The power of recommendation systems
Analysis of subscriber preferences
created hit series “House of Cards”
- Old (1990) British TV series still popular
- Films featuring Kevin Spacey had always done well
- Movies directed by David Fincher (“the social network”)
had a healthy share
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular BIG DATA ANALYSIS CHALLENGES
What questions should be asked
What questions can be answered
How can questions be answered
How is Big Data processed efficiently
How can different data be combined
How is uncertainly handled
What about legal issues
What about privacy issues
Researcher-industry-society collaboration important!
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular AFTERNOON CASES
Many interesting projects/collaborations, including on
Releasing and exploiting government, social media and newspaper data
– and how they are accesses
Utilizing health care data to help mothers, newborn, school kids and hip patients alike
– including in Africa
Improving indoor service logistics, recycling systems and personal products offerings
- as well as national and global markets
Collecting data to model, analyze and improve air quality, traffic behavior, food perception
- as well as animal farming
Many good “Big Data – Big Impact” examples involving researchers, industry and government
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular MADALGO CASES
MADALGO cases involve efficient processing of big terrain data
Cleaning ocean floor scanning data
Flood risk screening
→ both strong research/publications
and new/improved products
Important for success
MADALGO algorithms research
Domain and market knowledge of industry
Startup SCALGO as development “glue”
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular CENTER FOR MASSIVE DATA ALGORITHMICS
Established 2007 funded by Danish National Research Foundation
5 year renewal in 2012 (10 year budget > $25 million)
- International evaluation: “MADALGO is the world-leading
center in the area of massive dataset algorithmics”
High level objectives
Advance algorithmic knowledge in massive data algorithms area
Train researchers in world-leading international environment
Be catalyst for multidisciplinary collaboration
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular CENTER FOR MASSIVE DATA ALGORITHMICS
Established 2007 funded by Danish National Research Foundation
5 year renewal in 2012 (10 year budget > $25 million)
- International evaluation: “MADALGO is the world-leading
center in the area of massive dataset algorithmics”
Building on:
Algorithms research focus areas:
- I/O-efficient, cache-oblivious and streaming
- Algorithm engineering
Strong international team/environment
Multidisciplinary and industry collaboration
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular I/O-EFFICIENT ALGORITHMS
Problems involving Big Data on disk
Disk access is 106 times slower than main memory access
Large access time amortized by transferring large blocks of data
→ Important to store/access data to take advantage of blocks
I/O-efficient algorithms:
Move as few disk blocks as possible to solve problem
The difference in speed between modern CPU and disk technologies is
analogous to the difference in speed in sharpening a pencil using a
sharpener on one’s desk or by taking an airplane to the other side of the
world and using a sharpener on someone else’s desk.” (D. Comer)
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular I/O-EFFICIENT ALGORITHMS MATTER
Example: Visiting data in order
Array size N = 10 elements
Disk block size B = 2 elements
Main memory size M = 4 elements
→ Algorithm 1: N=10 disk accesses
→ Algorithm 2: N/B=5 disk assesses
Difference between N and N/B huge
N = 256 x106, B = 8000 , 1 ms disk access time N accesses take 256 x103 sec = 4266 min = 71 hours
N/B assesses take 256/8 sec = 32 seconds
1 5 2 6 7 3 4 10 8 9
1 2 10 9 8 5 4 7 6 3
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular ALGORITHM ENGINEERING & COLLABORATION
Much of centers collaboration driven by algorithm engineering
Design/implementation of practical algorithms & experimentation
- Often provide valuable input to theoretical research work
- Sometime leads to practical breakthroughs
MADALGO, COWI and SCALGO flood risk collaboration
Started in 2006 as part of Strategic Research Council project
Builds on MADALGO I/O-efficient algorithms research
→ Unique big terrain data solutions and establishment of SCALGO
Collaboration continues, including in Innovation Fond project
→ Unique flood risk products
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular FLOOD RISK ANALYSIS IMPORTANCE
Important to screen extreme rain or sea-level rise flood risk
50% of Danes worry about their homes being flooded (Userneeds)
90% of Danes say high flood risk affect decision to buy house
Cost of 2011 Copenhagen flood over 6 billion kroner (Swiss Re)
Potential to do so using detailed national elevation model
Elevation for roughly every 2x2 meter of soon ½x½ meter
hundreds or even thousands of points in family home lot!
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular DETAILED (BIG) TERRAIN DATA ESSENTIAL
Mandø 2 meter
sea-level rise
90 meter terrain model 2 meter terrain model
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular DETAILED (BIG) TERRAIN DATA ESSENTIAL
Drainage network
(flow accumulation)
90 meter terrain model 2 meter terrain model
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular SURFACE FLOW MODELING
Flow accumulation on grid terrain model:
Initially one unit of water in each grid cell
Water (initial and received) distributed from each cell to lowest lower neighbor cell
Flow accumulation of cell is total flow through it
Note
Flow accumulation of cell = size of “upstream area”
Drainage network = cells with high flow accumulation
Flow stops/disappears in depressions -> model often “filled”
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular FLOW ACCUMULATION PERFORMANCE
Natural algorithm access disk for each grid cell
“Push” flow down the terrain by visiting cells in height order
Problem since cells of same height scattered over terrain
Performance of commercial systems often not satifactory
Cannot handle Denmark at 2-meter resolution
We developed I/O-optimal algorithms
Now handle Denmark 2-meter model in a day on limited 4GB desktop!
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular FLOW ACCUMULATION SUCCESS STORY
Shuttle Radar Topography Mission (SRTM)
Near global dataset
3-arc seconds (90-meter at equator) raster
~60 billion cells stored in roughly 14.000 files
Large USGS Hydrosheds project produced
“hydrological conditioned” 90-meter data
But upscaled to 500-meter to compute flow accumulation
Using I/O-efficient algorithms: One day on standard 4GB workstation!
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular FLASH FLOOD MAPPING
Models how surface water gathers in depressions as it rains
Water from watershed of depression gathers in the depression
Depressions fill, leading to (dramatic) increase in neighbor depression watershed size
Flash Flood Mapping: Amount of rain before any given raster cell is below water
Watershed area
Volume
Watershed area
Volume
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular FLASH FLOOD MAPPING EXAMPLE
After 10mm rain
After 50mm rain
After 100mm rain
After 150mm rain
After 150mm rain
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular FLASH FLOOD MAPPING SUCCESS STORY
Based on collaborative research, COWI markets SCALGO
produced Flash Flood Mapping product in Denmark
under name ”Skybrudskort®“
Produced for entire country
Sold to over half of local governments
Jones Edmunds compared Flash Flood Mapping to result of
advanced dynamic model (ICPR) for Marion County, Florida
Results very close
Significantly more detailed
Cost under 5%
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular AFTERNOON: ONLINE DEMONSTRATION
13 NOVEMBER 2014
PROFESSOR, CENTER DIRECTOR
LARS ARGE AARHUS UNIVERSITY AU
Overskrift én linje
Bold eller Regular CONCLUSIONS
Hope to have convinced you that
Big Data has huge potential
- in research, industry and society
Exploiting Big Data challenging
- research-industry-society collaboration
one way to success
Thanks!
www.madalgo.au.dk
AARHUS
UNIVERSITY AU