26
Big data : challenges and opportunities for mathematicians Citation for published version (APA): Di Bucchianico, A. (2015). Big data : challenges and opportunities for mathematicians. 51ste Nederlands Mathematisch Congres (NMC 2015), 14-15 April 2015, Leiden, Nederland, Leiden, Netherlands. Document status and date: Published: 01/01/2015 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim. Download date: 03. Oct. 2020

Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Big data : challenges and opportunities for mathematicians

Citation for published version (APA):Di Bucchianico, A. (2015). Big data : challenges and opportunities for mathematicians. 51ste NederlandsMathematisch Congres (NMC 2015), 14-15 April 2015, Leiden, Nederland, Leiden, Netherlands.

Document status and date:Published: 01/01/2015

Document Version:Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can beimportant differences between the submitted version and the official published version of record. Peopleinterested in the research are advised to contact the author for the final version of the publication, or visit theDOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and pagenumbers.Link to publication

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, pleasefollow below link for the End User Agreement:www.tue.nl/taverne

Take down policyIf you believe that this document breaches copyright please contact us at:[email protected] details and we will investigate your claim.

Download date: 03. Oct. 2020

Page 2: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Dutch Mathematical Congress April 15, 2015

Data Science Center Eindhoven

Big Data: Challenges and Opportunities for Mathematicians Alessandro Di Bucchianico

Page 3: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Contents

1. Big Data terminology 2. Various mathematical topics 3. Conclusions

Presenter
Presentation Notes
Word cloud from http://www.bath.ac.uk/math-sci/postgraduate/samba/images/samba-word-cloud.png
Page 4: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

What is big data?

Term coined by John Mashey, chief scientist at Silicon Graphics in the 1990’s …I was using one label for a range of issues, and I wanted the simplest, shortest phrase to convey that the boundaries of computing keep advancing… But : chemometrics has a long history of analyzing “large” data sets

Presenter
Presentation Notes
http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/ http://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-story/?_r=0 http://www.midsizeinsider.com/en-us/article/brief-history-big-data-everyone-should-read#.VSorLvmH4sc
Page 5: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

How big is big data?

Horizon 2020 EU-ICT16-2015 call: "extremely large" means "so large that today no amount of money could buy a system capable of handling it".

Page 6: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Four V’s of Big Data

Presenter
Presentation Notes
Picture taken from ey.com More V’s have been introduced like V for Value (mostly in business)
Page 7: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

High-dimensional data: “𝒏 ≪ 𝒑”

Presenter
Presentation Notes
book Van de Geer / Bühlmann http://www.stat.rice.edu/~gallen/examples_high_dim_data.pdf: DNA : 20,000 genes of 200 people fMRI: similar voxels (time series) of few people Netflix: ratings of 18,000 movies
Page 8: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Big Data buzz words

• scalable algorithm • data science / data scientist • streaming data • data warehousing and ETL • in-memory database • predictive analytics / predictive modelling • high performance computing (exascale computing)

Presenter
Presentation Notes
http://www.statistics.com/landing-page/data-analytics/ OLAP ETL = Extract, Transform, Load
Page 9: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Machine learning and data mining

Machine learning focuses on prediction based on known properties “learned” from training data Data mining focuses on the discovery of (previously) unknown properties of the data

Leo Breiman: Two cultures

Presenter
Presentation Notes
from http://en.wikipedia.org/wiki/Machine_learning
Page 10: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

MapReduce

Presenter
Presentation Notes
Picture from presentation at http://www.slideshare.net/lynnlangit/hadoop-mapreduce-fundamentals-21427224
Page 11: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Open source tools

Presenter
Presentation Notes
Apache Hadoop: open source software framework for distributed storing (HDFS) and processing big data (MapReduce) Picture is said to be favourite toy of son of one of the developers Hive = A SQL-like query and data warehouse engine Pig = high level tool for producing Hadoop MapReduce programs Apache Mahout = person riding elephant, here collection of scalable machine learning algorithms, often built on Apache Hadoop Apache Flink
Page 12: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Course @ Columbia University

Presenter
Presentation Notes
Course web site http://columbiadatascience.com/doing-data-science/
Page 13: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Topic: Network structure

• dependencies between nodes in networks measured by degrees of direct neighbours

• assortativity coefficient of Newman is nothing but Pearson’s correlation statistic

• inconsistent estimator when variances are infinite • Spearman rank correlation behaves better but

calculation is computationally intensive • requires heavy asymptotics

Van der Hofstad, R. and Litvak, N. (2014) Degree-Degree Dependencies in Random Graphs with Heavy-Tailed Degrees. Internet Mathematics, 10 (3-4). pp. 287-334

Page 14: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Topic : Network monitoring

0

0.002

0.004

0.006

0.008

0.01

0.012

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Measu

re V

alu

e

Time

Al-Qaeda Network Measures

Betweeness

Density

Closeness

Challenges: • monitor high-number of variables • models to capture structural changes • scalable algorithms for likelihood ratio calculations

Closeness CUSUM Statistic for Al-Qaeda (1994-2004)

0

1

2

3

4

5

6

7

8

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

C+

Change Point

Signal

Page 15: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Topic: Topological Data Analysis

Common Big Data problem is to choose relevant “features” from high-dimensional data Combination of machine learning with topological tools (simplices, cohomology) yields new algorithms for clustering

Presenter
Presentation Notes
http://en.wikipedia.org/wiki/Topological_data_analysis https://www.ias.edu/articles/topological-data-analysis http://www.ayasdi.com/blog/bigdata/why-topological-data-analysis-works/ https://youtu.be/XfWibrh6stw Gunnar Carsson talk by Robert Ghrist http://www.wired.com/2013/10/topology-data-sets/all/ https://www.ias.edu/about/publications/ias-letter/articles/2013-summer/lesnick-topological-data-analysis http://www.ima.umn.edu/2013-2014/ http://www.slideshare.net/paul_kyeong/topological-data-analysis-with-examples
Page 16: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Topic: Uncertainty Quantification

Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the initial

conditions of parameters 2. parametrisation of design building blocks For 1) : polynomial chaos (Wiener chaos) , inverse statistical models, Bayesian analysis (calibration) For 2): Model Order Reduction

Presenter
Presentation Notes
http://en.wikipedia.org/wiki/Uncertainty_quantification
Page 17: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Topic: Streaming algorithms

High volume high velocity data makes exact counting of frequencies or number of different items impossible but approximate answers suffice in big data. New ideas using probabilistic means (random hash functions): • Count-Min Sketch • MinHash

Presenter
Presentation Notes
talk Nikhil Bansal http://en.wikipedia.org/wiki/Count-min_sketch https://sites.google.com/site/countminsketch/
Page 18: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Topic: Compressed Sensing

Presenter
Presentation Notes
example from Studygroup Mathematics with Industry http://dsp.rice.edu/cs paper Candes surprising use of random matrices undersampling / sparsity / coherence beating Shannon SWI 2015 problem
Page 19: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Topic: Recommendation systems

Approach (Moitra): Use connection to existence theorems for polynomials solutions to algebraic equations and develop scalable algorithms for nonnegative matrix factorizations

Presenter
Presentation Notes
https://www.ias.edu/about/publications/ias-letter/articles/2012-summer/big-data-moitra
Page 20: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Statistical challenges

Big Data shows phenomena not present in “small data”: • heterogeneity • spurious correlations • noise accumulation • incidental endogeneity

This requires critically revising statistical models and developing new tools

Presenter
Presentation Notes
http://bulletin.imstat.org/2014/10/ims-presidential-address-let-us-own-data-science/ Fan Han Liu - http://arxiv.org/abs/1308.1479 Fürnkranz states that data mining constructs hypotheses while statistics tests existing hypotheses – cf talk Lunteren on modern advanced exploratory data analysis https://normaldeviate.wordpress.com/2012/06/12/statistics-versus-machine-learning-5-2/ http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/
Page 21: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Big Data Research Funding

Big Data Challenges Digital Sciences High Performance Computing

Presenter
Presentation Notes
NWO Big Data Challenge Big Data Value PPP Digital Science + HPC -> Anni Hellmann document
Page 22: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Mathematical contributions in general

• Modelling • Performance of algorithms • Statistical thinking

We need to work hard to make mathematical contributions more explicit.

Page 23: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Actions

• get involved in local data science centres • get into touch with NWO • get into touch with EU (e.g., Nov 6 2014 Workshop

“Mathematics for Digital Sciences”) • 2014 IMS presidential address Bin Yu: “work on real problems, relevant theory will follow”

Page 24: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Journals

Page 25: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Learn from the past

Page 26: Big data : challenges and opportunities for …Topic: Uncertainty Quantification Virtual prototyping using mathematical models. UQ does not deal with 1. unknown uncertainty in the

Conclusions

• interesting mathematics behind big data • statistics • combinatorics • numerical mathematics • topology • ...

• efforts required • learn big data concepts • get into touch with computer scientists

• actions required to ensure funding • data science centres • funding agencies

Presenter
Presentation Notes
http://www2.warwick.ac.uk/fac/sci/wdsi/events/yobd/bigmodels/