Multiobjective Prediction with Expert Advicechernov/gtp.pdf · t,ωt) Expert Evaluator’s advice Loss(k) Ek (T) = XT t=1 λ(k)(γ(k) t,ωt) Loss(k)(T) = XT t=1 λ(k)(γ t,ωt) Alexey

Multiobjective Prediction with Expert Advice

Alexey Chernov

Computer Learning Research Centre and Department of Computer ScienceRoyal Holloway University of London

GTP Workshop, June 2010

Alexey Chernov (RHUL) Multiobjective Prediction with Expert Advice GTP Workshop, June 2010 1 / 18

Example: Prediction of Sport Match Outcome

V. Vovk, F. Zhdanov. Predictions with Expert Advice for Brier Game. ICML’08

Bookmakers data:4 bookmakers, odds for ∼ 10000 tennis matches (2 outcomes)8 bookmakers, odds for ∼ 9000 football matches (3 outcomes)

Odds ai can be transformed to probabilities Prob[i]:

Prob[i] =1/ai∑j 1/aj

The loss is measured by the square (Brier) loss function.

Learner’s strategy is the Aggregating Algorithm.


Tennis Prediction, Square Loss

0 2000 4000 6000 8000 10000 12000−4

−2

0

2

4

6

8

10

12

14

16

Theoretical boundsLearnerBookmakers

Graph of the negative regret LossEk (T )− Loss(T ), 4 ExpertsLearner is the AA for the square loss


Tennis Prediction, Log Loss

0 2000 4000 6000 8000 10000 12000−5

0

5

10

15

20

25


Graph of the negative regret LossEk (T )− Loss(T ), 4 ExpertsLearner is the AA for the log loss (Bayes mixture)


Tennis Predictions: “Wrong” Losses

Graphs of the negative regret LossEk (T )− Loss(T )

0 2000 4000 6000 8000 10000 12000−10

−5

0

5

10

15

20

25


0 2000 4000 6000 8000 10000 12000−5

0

5

10

15


log loss square lossthe AA for the square loss the AA for the log loss

Learner optimizes for a “wrong” loss function


Aggregating Algorithm with Wrong Losses

FactFor the game with 2 outcomes, one can construct a sequence ofpredictions of 2 Experts and a sequence of outcomes with thefollowing property. If Learner’s predictions are generated by theAggregating Algorithm for the log loss then for almost all T

Loss(T ) ≥ LossE1(T ) + T/10 ,

where Loss(T ) and LossE1(T ) are the square losses of Learner andExpert 1.

A similar statement holds for the Aggregating Algorithm for the squareloss evaluated by the log loss.


New Settings

Experts: γ(1)t , . . . , γ

(k)t

Learner: γtReality: ωt

Many loss functions

Loss(m)Ek

(T ) =T∑

t=1

λ(m)(γ(k)t , ωt)

Loss(m)(T ) =T∑

t=1

λ(m)(γt , ωt)

Expert Evaluator’s advice

Loss(k)Ek

(T ) =T∑

t=1

λ(k)(γ(k)t , ωt)

Loss(k)(T ) =T∑

t=1

λ(k)(γt , ωt)


New Settings

Experts: γ(1)t , . . . , γ

(k)t

Learner: γtReality: ωt

Many loss functions

Loss(m)Ek

(T ) =T∑

t=1

λ(m)(γ(k)t , ωt)

Loss(m)(T ) =T∑

t=1

λ(m)(γt , ωt)

Expert Evaluator’s advice

Loss(k)Ek

(T ) =T∑

t=1

λ(k)(γ(k)t , ωt)

Loss(k)(T ) =T∑

t=1

λ(k)(γt , ωt)


Bound for New Settings

TheoremIf λ(k) are η(k)-mixable proper loss functions, k = 1, . . . , K , Learnerhas a strategy (e. g. the Defensive Forecasting algorithm) thatguarantees, for all T and for all k, that

Loss(k)(T ) ≤ Loss(k)Ek

(T ) +1

η(k)ln K .

Corollary

If λ(m) are η(m)-mixable proper loss functions, m = 1, . . . , M, Learnerhas a strategy that guarantees, for all T , for all k and for all m, that

Loss(m)(T ) ≤ Loss(m)Ek

(T ) +1

η(m)(ln K + ln M) .


Tennis Predictions, Two Losses

Graphs of the negative regret Loss(m)Ek

(T )− Loss(m)(T )

0 2000 4000 6000 8000 10000 12000−4

−2

0

2

4

6

8

10

12

14

16


0 2000 4000 6000 8000 10000 12000−5

0

5

10

15

20

25


square loss log loss

Learner optimizes for both loss functions, using the DF algorithm.


Defensive Forecasting Algorithm

∃π ∀ωK∑

k=1

p(k)t−1eη(λ(π,ω)−λ(π

(k)t ,ω)) ≤ 1 ,

where p(k)t−1 = p(k)

0 eη(Loss(t−1)−LossEk (t−1))

To get this (from Levin’s Lemma) we need that λ(π, ω) is continuousand for all π, π′

Eπeη(λ(π,·)−λ(π′,·)) =∑ω∈Ω

π(ω)eη(λ(π,ω)−λ(π′,ω)) ≤ 1


Defensive Forecasting Algorithm

∃π ∀ωK∑

k=1

p(k)t−1eη(k)(λ(k)(π,ω)−λ(k)(π

(k)t ,ω)) ≤ 1 ,

where p(k)t−1 = p(k)

0 eη(k)(Loss(k)(t−1)−Loss(k)Ek

(t−1))

To get this (from Levin’s Lemma) we need that λ(k)(π, ω) is continuousand for all π, π′

Eπeη(k)(λ(k)(π,·)−λ(k)(π′,·)) =∑ω∈Ω

π(ω)eη(k)(λ(k)(π,ω)−λ(k)(π′,ω)) ≤ 1


The DFA and the AA

λ is continuous and ∀π, π′ Eπeη(λ(π,·)−λ(π′,·)) ≤ 1 ⇒ λ is η-mixable

∀π, π′ Eπeη(λ(π,·)−λ(π′,·)) ≤ 1 ⇒ λ is proper

λ is η-mixable and ? ⇒ ∀π, π′ Eπeη(λ(π,·)−λ(π′,·)) ≤ 1


The DFA and the AA

λ is continuous and ∀π, π′ Eπeη(λ(π,·)−λ(π′,·)) ≤ 1 ⇒ λ is η-mixable

∀π, π′ Eπeη(λ(π,·)−λ(π′,·)) ≤ 1 ⇒ λ is proper

λ is η-mixable and proper ⇒ ∀π, π′ Eπeη(λ(π,·)−λ(π′,·)) ≤ 1


Proper Loss Functions

λ is proper if for any π, π′ ∈ P(Ω)

Eπλ(π, ·) ≤ Eπλ(π′, ·)

If ω ∼ π then Eπλ(π′, ω) is the expected loss for prediction π′.

The expected loss is minimal for the true distribution⇒ the forecaster is encouraged to give the true probabilities

The square loss and the log loss are proper.


Example: Hellinger Loss

λHellinger (γ, ω) =12

r∑j=1

(√γ(j)−

√Iω=j

)2

The Hellinger loss is√

2-mixable

The Hellinger loss is not proper


Proper Mixable Loss Functions

Each mixable loss function λ(γ, ω) has a proper analogue λproper (π, ω)such that

1 ∀π∃γ∀ω λproper (π, ω) = λ(γ, ω)

2 ∀π∀γ Eπλproper (π, ·) ≤ Eπλ(γ, ·)

For the Hellinger loss, the proper analogue is the spherical loss

λspherical(π, ω) = 1− π(ω)√∑rj=1(π(j))2

λspherical(π, ω) = λHellinger (γ, ω) for γ(ω) = (π(ω))2Pr

j=1(π(j))2


Proper Mixable Loss Functions

Each mixable loss function λ(γ, ω) has a proper analogue λproper (π, ω)such that

1 ∀π∃γ∀ω λproper (π, ω) = λ(γ, ω)

2 ∀π∀γ Eπλproper (π, ·) ≤ Eπλ(γ, ·)

For the Hellinger loss, the proper analogue is the spherical loss

λspherical(π, ω) = 1− π(ω)√∑rj=1(π(j))2

λspherical(π, ω) = λHellinger (γ, ω) for γ(ω) = (π(ω))2Pr

j=1(π(j))2


Example: Mixable and Non-Mixable Losses

Experts 1, . . . , K predict π(k) ∈ P(0, 1).Experts 1, . . . , N predict γ(n) ∈ 0, 1.Learner predicts (π, π) ∈ P(0, 1)× P(0, 1) such thatif π(0) > 1/2 then π(0) = 1 and if π(1) > 1/2 then π(1) = 1.

There exists a strategy for Learner that guarantees for any k

T∑t=1

λsquare(πt , ωt) ≤T∑

t=1

λsquare(π(k)t , ωt) + ln(K + N)

and for any m

T∑t=1

λabs(πt , ωt) ≤T∑

t=1

λsimple(γ(n)t , ωt) + O(

√T ln(K + N) + T ln ln T )


Tennis Predictions, Square and Absolute Losses

Graphs of the negative regret Loss(m)Ek

(T )− Loss(m)(T )

0 50 100 150 200 250 300 350 400−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

theoretical boundsDFexperts

0 50 100 150 200 250 300 350 400−2

−1.5

−1

−0.5

0

0.5

DFexperts

square loss absolute loss

Learner optimizes for both loss functions, using the DF algorithm with“mixability” and Hoeffding supermartingales.


The Mixed Supermartingale

1K + N

K∑k=1

e2PT−1

t=1 ((pt−ωt )2−(p(k)

t −ωt )2) × e2((p−ω)2−(p(k)

T −ω)2)

+1

K + N

N∑n=1

∫ 1/e

0

dη

η(

ln 1η

)2 eηPT−1

t=1 (|pt−ωt |−[γ(n)t 6=ωt ])−η2/2

× eη(|p−ω|−[γ(n)T 6=ω])−η2/2

where pt = πt(1), p(k)t = π

(k)t (1), pt = πt(1).

[x 6= y ] = 1 if x 6= y and [x 6= y ] = 0 if x = y .


References

V. Vovk, F. Zhdanov. Predictions with Expert Advice for Brier Game.ICML 2008. http://arxiv.org/abs/0710.0485

A. Chernov, Y. Kalnishkan, F. Zhdanov, V. Vovk. Supermartingales inPrediction with Expert Advice. ALT 2008 and TCS.http://arxiv.org/abs/1003.2218

A. Chernov, V. Vovk. Prediction with expert evaluators’ advice.ALT 2009. http://arxiv.org/abs/0902.4127

A. Chernov, V. Vovk. Prediction with Advice of Unknown Number ofExperts. UAI 2010. http://arxiv.org/abs/1006.0475

http://onlineprediction.net/


Documents

Multiobjective Prediction with Expert Advicechernov/gtp.pdf · t,ωt) Expert Evaluator’s advice Loss(k) Ek (T) = XT t=1 λ(k)(γ(k) t,ωt) Loss(k)(T) = XT t=1 λ(k)(γ t,ωt) Alexey