33
Data Science: A New Frontier for Design Theory and Methods Akin Kazakci MINES ParisTech [email protected] THANKS TO:

Data science as a new frontier for design

Embed Size (px)

Citation preview

Data Science: !A New Frontier for Design

Theory and Methods!Akin Kazakci !MINES [email protected]!

THANKS TO:!

Akın K

azakçı! M

INES

Par

isTec

h!

Question:!

« Design is an essential driver of innovation and

economic growth. »!

Do you disagree with the following statement? – !

Akın K

azakçı! M

INES

Par

isTec

h!

What is the role of design in comtemporary challenges society is facing …!

Motto of H2020: Economic growth, job creation, societal well-being, Europe’s competitiveness… !

– as seen by the decision-makers?!

80 billion a year

Akın K

azakçı! M

INES

Par

isTec

h!

yet… !

Akın K

azakçı! M

INES

Par

isTec

h!

Claim 1!

•  Design research is falling behind in facing contemporary challenges (enough with the chairs)!•  Claim 1a: Too much in-breeding and repetition!•  Claim 1b: Huge amount of work is based on ideas

from 80s’ !!

Akın K

azakçı! M

INES

Par

isTec

h!

Data deluge – a tremendous challenge !

Mattmann, C. A.. « A vision for data science », Nature, 2013.

Rougly 30, 000 modern laptop’s disk capacity

Roughly 1500 000 000 times more per year

Akın K

azakçı! M

INES

Par

isTec

h!

Some orders of magnitude…!

Image courtesy of Vladimir Gligorov, CERN

Akın K

azakçı! M

INES

Par

isTec

h!

Even harder for some…!

Image courtesy of Vladimir Gligorov, CERN

Akın K

azakçı! M

INES

Par

isTec

h!

A side remark… on what’s important!

Is this a cat? What is the role of Higgs boson in the structure of

the universe?

>

Akın K

azakçı! M

INES

Par

isTec

h!

Huge boost for data science / research!

National Big Data R&D Initiative of the White House in 2012 !

•  NSF, NIH, and DARPA , The Research Data Alliance (RDA), !

•  NYU, University of Washington, Berkeley University (with a five-year 37.8M$ funding from Moore and Sloan foundations)!

•  In Europe, University of Amsterdam, Edinburgh University, Imperial College (with Zhejiang University). !

•  In France, Universite ́ Paris-Saclay has created Centre for Data Science !

Harvard Business Review Davenport and Patill, 2012

Akın K

azakçı! M

INES

Par

isTec

h!

Data deluge: tremendous opportunity!

« Mastering the creation of value from big data !… will be a cornerstone in future economic

development and societal well-being: !

Source: EU Comission, Digital Agenda for Europe, Fact Sheet Data cPPP

-  %30 of the global market for European suppliers;!-  100,000 new jobs in Europe by 2020!-  %10 lower energy consumption, !

better health-care outcomes and ! more productive industrial machinery »!

Akın K

azakçı! M

INES

Par

isTec

h!

Data-Science: new phenomena or déjà-vu?!

« techniques for processing large amounts of information »"

« statistical and mathematical methods » "

« techniques like mathematical programming »!« methodologies like operations research »!

« no single established name yet,Let us call it IT »!

« higher-order thinking through computer programs »"

12

Akın K

azakçı! M

INES

Par

isTec

h!

The death of OR? – before it delivers its promise!

13

« OR is dead, even though it has yet to be burried. »!

« Little chance of ressurection; cause!little understanding of its demise. »!

Akın K

azakçı! M

INES

Par

isTec

h!

Salvation of OR!

14

prediction paradigm should be replaced…!

by a paradigm directed at designing a desirable future and inventing ways of bringing it about. !

(suggest that) OR replace its problem-solving orientation by one that focuses on planning and design"

- by design!

Akın K

azakçı! M

INES

Par

isTec

h!

Claim 2!

•  Claim 2: To avoid facing same difficulties as OR, data science should go beyond the predictive (analytics) paradigm and embrace a design paradigm!

•  Claim 2a: Data science cannot expect to solve the challenges imposed by the data solely based on technical breakthroughs:!! A renewal of data science methodology is also needed!!

•  Hypothesis: More than 50 years of research in design has allowed design research community to gather invaluable insights about the nature of creative activities !•  Corollary (Claim2b): Design theory and methods can

provide, at least to some extent, the much needed insights. !

Akın K

azakçı! M

INES

Par

isTec

h!

Analysing a data challenge!

Akın K

azakçı! M

INES

Par

isTec

h!

Learning to discover: HiggsML Challenge!

Akın K

azakçı! M

INES

Par

isTec

h!

Winners!

MINES ParisTech

Akın K

azakçı! M

INES

Par

isTec

h!

Record number of participants!

MINES ParisTech

Akın K

azakçı! M

INES

Par

isTec

h!

Great improvements!

MINES ParisTech

Akın K

azakçı! M

INES

Par

isTec

h!

Data Science Challenges: which effectiveness for innovation? !

•  1800+ teams, to develop methods for detecting Higgs on CERN data!

•  Important improvements (discovery significance rose from 3.2 to 3.8)!

•  Big buzz, huge visibility!•  Bringing ML and physics

communities closer!

•  Study of available data!-  Forums, !-  Documentation, !-  Prticipants’ blog entries and !-  GitHub codes!! 136 topics, 1400+ posts!

•  Qualitative interpretation combined with C-K modelling of participants’ strategies!

Akın K

azakçı! M

INES

Par

isTec

h!

Analysis of design strategies!

MINES ParisTech

Achieve 5σ! Dicovery condition: A discovery is claimed when we find a ‘region’ of the space where there is significant excess of ‘signal’ events. (rejecting background-only hypothesis with a p value less than 2,9 x 10-7, corresponding to 5 Sigma).

Problem formulation: Traditional classification setting: « the task of the participants is to train a classifier g based on the training data D with the goal of maximizing the AMS (7) on a held-out (test) data set » (HiggsML documentation) With 2 tweaks: -  Training set events are « weighted » -  Maximize « Approximate Median Significance »:

Select a classification method !

Pre-processing !

Choose hyper-params !

Train !

Optimize for X !

SVM Decision Trees

NN …..…..

Performance metrics: During the overall learning process performance metrics are used to supervise the quality and convergence of a learned model. A traditional metric is accuracy: where

Note that for HiggsML AMS, TP (s) and FP (b) are of particular importance.

Boosting ! Bagging ! others !

Ensemble Methods

(Extended) Dominant Design

Traditional workflow = Dominant design

C space K Space

Akın K

azakçı! M

INES

Par

isTec

h!

A deviation from dominant design!Achieve 5σ!

Select a classification method !

Pre-processing !

Choose hyper-params !

Train !

Optimize for accuracy !

SVM Decision Trees

NN …..…..

Integrate AMS directly in training

during Gradient Boosting

(John)

Dicovery condition: A discovery is claimed when we …

Problem formulation: Traditional classification setting…

Cross-Validation: Techniques for evaluating how a …

Ensemble Methods

Gradient boosting methods fit a classifier to the 'per data point loss' and since AMS is not a sum of per data point (event) losses, it's not obvious how to do use AMS as a loss in gradient boosting (Andre Holzner)

AMS: 3.3 ! The node split works by looking for the split that maximises the AMS of one side of the split when predicting it as pure signal (John)

during node split in random

forest (John)

An alternative may be to « use AUC in gradient boosting till you get to the max cv result and then tried to move forward with an AMS loss function from that point » In principle, the AMS approximate function is derivable (http://tinyurl.com/ov5pedq) at a node level (s and b being the totals of other nodes, considered constant, and x, w being the probability prediction and weight for the node to be split) and one could rewrite the part of code where the objective function is evaluated, replacing the sums with a different calculation » (Giulio Casa)

Akın K

azakçı! M

INES

Par

isTec

h!

Introduction of a new K pocket!Achieve 5σ!

Select a classification method !

Pre-processing !

Choose hyper-params !

Train !

Optimize for accuracy !

SVM Decision Trees

NN …..…..

Integrate AMS directly in training

during Gradient Boosting

(John)

Dicovery condition: A discovery is claimed when we …

Problem formulation: Traditional classification setting…

Cross-Validation: Techniques for evaluating how a …

Ensemble Methods

Gradient boosting methods fit a classifier to the 'per data point loss' and since AMS is not a sum of per data point (event) losses, it's not obvious how to do use AMS as a loss in gradient boosting (Andre Holzner)

during node split in random

forest (John) Weighted

Classification Cascades

Two participants observe that AMS can be refactorized and its terms can be rewritten in terms of their convex conjugate form – which allow to Fenchel-Young inequality from convex optimization litterature. Ref: http://arxiv.org/pdf/1409.2655v2.pdf, Mackey & Brian Optimization of AMS becomes possible by a procedure they name Weigthed Classification Cascades.(Rank: 451th)

? ? ? ? ?

Akın K

azakçı! M

INES

Par

isTec

h!

Winning strategy…!Achieve 5σ!

Select a classification method !

Pre-processing !

Choose hyper-params !

Train !

Optimize for accuracy !

SVM Decision Trees

NN …..…..

Integrate AMS directly in training

during Gradient Boosting

(John)

Dicovery condition: A discovery is claimed when we …

Problem formulation: Traditional classification setting…

Cross-Validation: Techniques for evaluating how a …

Ensemble Methods

during node split in random

forest (John) Weighted

Classification Cascades

? ? ? ? ?

Optimization of AMS

Design for statistical efficiency

The biggest challenge is the unstability of AMS. Competition results clearly show that only participants who dealt effectively with this issue have had higher ranks.

1st 2nd

3rd

Ensembles + CV monitoring + cutoff threshold seem to be a winning strategy

monitoring progress with

CV +

ensembles +

selecting a cutoff threshold that

optimise (or stabilise AMS)

Akın K

azakçı! M

INES

Par

isTec

h!

Fixating others…!Achieve 5σ!

Select a classification method !

Pre-processing !

Choose hyper-params !

Train !

Optimize for accuracy !

SVM Decision Trees

NN …..…..

Integrate AMS directly in training

during Gradient Boosting

(John)

Dicovery condition: A discovery is claimed when we …

Problem formulation: Traditional classification setting…

Cross-Validation: Techniques for evaluating how a …

Ensemble Methods

during node split in random

forest (John) Weighted

Classification Cascades

? ? ? ? ?

Optimization of AMS

Design for statistical efficiency

The biggest challenge is the unstability of AMS. Competition results clearly show that only participants who dealt effectively with this issue have had higher ranks.

1st 2nd

3rd

Ensembles + CV monitoring + cutoff threshold seem to be a winning strategy

monitoring progress with

CV +

ensembles +

selecting a cutoff threshold that

optimise (or stabilise AMS)

Public guide to AMS 3.6 « moves » many participants to the given path

Fixation vs. Creative Authority (Agogué et al, 2014)

Akın K

azakçı! M

INES

Par

isTec

h!

Analysing is one thing…!

!What about !

generating alternatives!using design theory? !

MINES ParisTech

Akın K

azakçı! M

INES

Par

isTec

h!

Generating new design strategies!

Data science as a new frontier for design A. Kazakci, ICED’15 (submitted)

Akın K

azakçı! M

INES

Par

isTec

h!

•  18 months of problem formulation (3 physicists, 3 data-scientists)

•  No innovation in DS – only differences in individual performances in the adaptation of a dominant design

•  No innovation in Physics (even critics – wrong problem?)

•  In their current form and organisation, data-challenges are « problem-solving » approaches

« Extracting value from data » requires a rigourous design process

You reap what you sow: Data-challenges will not yield innovations unless problem formulation bears originality and an ingenious organisation of the exploration.

Akın K

azakçı! M

INES

Par

isTec

h!

DKCP - Machine learning for HEP!

•  A DKCP process has been launched !•  for exploring innovation opportunities at

the crossroad of HEP and ML !!!

Akın K

azakçı! M

INES

Par

isTec

h!

Bootcamps -  how to ensure a controlled yet ! creative exploration?!

Akın K

azakçı! M

INES

Par

isTec

h!

• Thank you!

• Akin Kazakci!•  [email protected]!• Data science as a new frontier for design!!!

Akın K

azakçı! M

INES

Par

isTec

h!

DCC’14 !Machine Learning and Innovative Design Workshop!