20
This includes 3 articles, the first a nontechnical discussion of the book by science writer S. Pinker, the second a technical discussion of the flaw in Pinkers book forthcoming in Physica A: Statistical Mechanics and Applications, the third a technical discussion of what I call the Pinker Problem, a corruption of the law of large numbers. The Long Peaceis a Statistical Illusion Nassim Nicholas Taleb When I finished writing The Black Swan, in 2006, I was confronted with ideas of great moderation, by people who did not realize that the process was getting fatter and fatter tails (from operational and financial, leverage, complexity, interdependence, etc.), meaning fewer but deeper departures from the mean. The fact that nuclear bombs explode less often that regular shells does not make them safer. Needless to say that with the arrival of the events of 2008, I did not have to explain myself too much. Nevertheless people in economics are still using the methods that led to the great moderationnarrative, and Bernanke, the protagonist of the theory, had his mandate renewed. I had argued that we were undergoing a switch between the top graph (continuous low grade volatility) to the next one, with the process moving by jumps, with less and less variations outside of jumps. 50 100 150 200 250 Time 80 90 100 110 120 130 140 Position 50 100 150 200 250 Time 80 90 100 110 120 130 140 Position My idea of current threats outside finance: 1. Loss of the Island Effect: My point now is the loss of island effect, from interconnectedness. The number one danger is a biological agent that can travel on British Air and reach the entire planet. And it can be built in a high school lab, or, even, a garage. 2. Nuclear Potential: As I explain in Antifragile, risk management is about fragility, not naive interpretation of past data. If Fannie Mae is sitting on a barrel of dynamite I would not use past statistical data for my current analysis. Risks are in the fragility.( Sensitivity to counterfactuals is more important than past history, something Greenspan missed in his famous congressional testimony). The Pinker Argument Now to my horror I saw an identical theory of great moderation produced by Steven Pinker with the same naive statistically derived discussions (>700 pages of them!). 1. I agree that diabetes is a bigger risk than murder --we are victims of sensionalism. But our suckerdom for overblown narratives of violence does not imply that the risks of large scale violent shocks have declined. (The same as in economics, peoples mapping of risks are out of sync and they underestimate large deviations). We are just bad at evaluating risks.

Pinker

Embed Size (px)

DESCRIPTION

kedgeree recipe

Citation preview

  • This includes 3 articles, the first a nontechnical discussion of the book by science writer S. Pinker, the second a technical discussion of theflaw in Pinkers book forthcoming in Physica A: Statistical Mechanics and Applications, the third a technical discussion of what I call thePinker Problem, a corruption of the law of large numbers.

    The Long Peace is a Statistical IllusionNassim Nicholas Taleb

    When I finished writing The Black Swan, in 2006, I was confronted with ideas of great moderation, by people who did not realize that theprocess was getting fatter and fatter tails (from operational and financial, leverage, complexity, interdependence, etc.), meaning fewer butdeeper departures from the mean. The fact that nuclear bombs explode less often that regular shells does not make them safer. Needless to saythat with the arrival of the events of 2008, I did not have to explain myself too much. Nevertheless people in economics are still using themethods that led to the great moderation narrative, and Bernanke, the protagonist of the theory, had his mandate renewed. I had argued that we were undergoing a switch between the top graph (continuous low grade volatility) to the next one, with the process movingby jumps, with less and less variations outside of jumps.

    50 100 150 200 250Time

    80

    90

    100

    110

    120

    130

    140Position

    50 100 150 200 250Time

    80

    90

    100

    110

    120

    130

    140Position

    My idea of current threats outside finance:1. Loss of the Island Effect: My point now is the loss of island effect, from interconnectedness. The number one danger is a biological agent

    that can travel on British Air and reach the entire planet. And it can be built in a high school lab, or, even, a garage.2. Nuclear Potential: As I explain in Antifragile, risk management is about fragility, not naive interpretation of past data. If Fannie Mae is

    sitting on a barrel of dynamite I would not use past statistical data for my current analysis. Risks are in the fragility.( Sensitivity to counterfactuals is more important than past history, something Greenspan missed in his famous congressional testimony).

    The Pinker ArgumentNow to my horror I saw an identical theory of great moderation produced by Steven Pinker with the same naive statistically derived discussions(>700 pages of them!).

    1. I agree that diabetes is a bigger risk than murder --we are victims of sensionalism. But our suckerdom for overblown narratives of violence does not imply that the risks of large scale violent shocks have declined. (The same as in economics, peoples mapping of risks are out of sync and they underestimate large deviations). We are just bad at evaluating risks.

  • 2. Pinker conflates nonscalable Mediocristan (death from encounters with simple weapons) with scalable Extremistan (death from heavy shells and nuclear weapons). The two have markedly distinct statistical properties. Yet he uses statistics of one to make inferences about the other. And the book does not realize the core difference between scalable/nonscalable (although he tried to define powerlaws). He claims that crime has dropped, which does not mean anything concerning casualties from violent conflict.

    3. Another way to see the conflation, Pinker works with a times series process without dealing with the notion of temporal homogeneity. Ancestral man had no nuclear weapons, so it is downright foolish to assume the statistics of conflicts in the 14th century can apply to the 21st. A mean person with a stick is categorically different from a mean person with a nuclear weapon, so the emphasis should be on the weapon and not exclusively on the psychological makup of the person.

    4. The statistical discussions are disturbingly amateurish, which would not be a problem except that the point of his book is statistical. Pinker misdefines fat tails by talking about probability not contribution of rare events to the higher moments; he somehow himself accepts powerlaws, with low exponents, but he does not connect the dots that, if true, statistics can allow no claim about the mean of the process. Further, he assumes that data reveals its properties without inferential errors. He talks about the process switching from 80/20 to 80/02, when the first has a tail exponent of 1.16, and the other 1.06, meaning they are statistically indistinguishable. (Errors in computation of tail exponents are at leat .6, so this discussion is noise, and as shown in [1], [2], it is lower than 1. (It is an error to talk 80/20 and derive the statistics of cumulative contributions from samples rather than fit exponents; an 80/20 style statement is interpolative from the existing sample, hence biased to clip the tail, while exponents extrapolate.)

    5. He completely misses the survivorship biases (which I called the Casanova effect) that make an observation by an observer whose survival depends on the observation invalid probabilistically, or to the least, biased favorably. Had a nuclear event taken place Signor Pinker would not have been able to write the book.

    6. He calls John Grays critique anecdotal, yet it is more powerful statistically (argument of via negativa) than his >700 pages of pseudo-stats.

    7. Psychologically, he complains about the lurid leading people to make inferences about the state of the system, yet he uses lurid arguments to make his point.

    8. You can look at the data he presents and actually see a rise in war effects, comparing pre-1914 to post 1914.

    9. Recursing a Bit (Point added Nov 8): Had a book proclaiming The Long Peace been published in 1913 34

    it would carry similar arguments to those in Pinkers book.

    Is The World Getting Less Safe?When writing The Black Swan, I was worried about the state of the system from the new type of fragility, which I ended up defining later. I hadnoted that because of connectedness and complexity, earthquakes carried a larger and larger proportional economic costs, an analysis byZajdenwebber which saw many iterations and independent rediscoveries (the latest in a 2012 Economist article). The current deficit is largelythe result of the (up to) $3 trillion spent after September 11 on wars in Afganistan and Iraq. But connectedness has an even larger threat. With biological connectedness, the Island Effect is gone both economically and physically: tailsare so fat ...

    {Technical Appendix}

    Pinkers Rebuttal of This NotePinker has written a rebuttal (ad hominem blather, if he had a point he would have written something 1

    3of this, not 3 x the words). He still does

    not understand the difference between probability and expectation (drop in observed volatility/fluctualtion drop in risk) or the incompatibilityof his claims with his acceptance of fat tails (he does not understand asymmetries-- from his posts on FB and private correspondence). Yet itwas Pinker who said what is the actual risk for any individual? It is approaching zero.

    Second Thoughts on The Pinker Story: What Can We Learn From ItIt turned out, the entire exchange with S. Pinker was a dialogue de sourds. In my correspondence and exchange with him, I was under theimpression that he simply misunderstood the difference between inference from symmetric, thin-tailed random variables an one from asymmet-ric, fat-tailed ones --the 4th Quadrant problem. I thought that I was making him aware of the effects from the complications of the distribution.But it turned out things were worse, a lot worse than that.Pinker doesnt have a clear idea of the difference between science and journalism, or the one between rigorous empiricism and anecdotalstatements. Science is not about making claims about a sample, but using a sample to make general claims and discuss properties that applyoutside the sample.Take M* the observed arithmetic mean from the realizations (a sample path) for some process, and M the "true" mean. When someone says:"Crime rate in NYC dropped between 2000 and 2010", the claim is about M* the observed mean, not M the true mean, hence the claim can bedeemed merely journalistic, not scientific, and journalists are there to report "facts" not theories. No scientific and causal statement should bemade from M* on "why violence has dropped" unless one establishes a link to M the true mean. M* cannot be deemed "evidence" by itself.Working with M* cannot be called "empiricism". What I just wrote is at the foundation of statistics (and, it looks like, science). Bayesians disagree on how M* converges to M, etc., never on thispoint. From his statements, Pinker seems to be aware that M* may have dropped (which is a straight equality) and sort of perhaps we might notbe able to make claims on M which might not have really been dropping.Now Pinker is excusable. The practice is widespread in social science where academics use mechanistic techniques of statistics withoutunderstanding the properties of the statistical claims. And in some areas not involving time series, the differnce between M* and M is negligi-ble. So I rapidly jot down a few rules before showing proofs and derivations (limiting M to the arithmetic mean). Where E is the expectationoperator under real-world probability measure P:

    2 Long Peace.nb

  • Now Pinker is excusable. The practice is widespread in social science where academics use mechanistic techniques of statistics withoutunderstanding the properties of the statistical claims. And in some areas not involving time series, the differnce between M* and M is negligi-ble. So I rapidly jot down a few rules before showing proofs and derivations (limiting M to the arithmetic mean). Where E is the expectationoperator under real-world probability measure P:

    1. Tails Sampling Property: E[|M*-M|] increases in with fat-tailedness (the mean deviation of M* seen from the realizations in different samples of the same process). In other words, fat tails tend to mask the distributional properties.

    2. Counterfactual Property: Another way to view the previous point, [M*], The distance between different values of M* one gets from repeated sampling of the process (say counterfactual history) increases with fat tails.

    3. Survivorship Bias Property: E[M*-M ] increases under the presence of an absorbing barrier for the process. (Casanova effect)4. Left Tail Sample Insufficiency: E[M*-M] increases with negative skeweness of the true underying variable.5. Asymmetry in Inference: Under both negative skewness and fat tails, negative deviations from the mean are more informational than

    positive deviations.6. Power of Extreme Deviations (N=1 is OK): Under fat tails, large deviations from the mean are vastly more informational than small ones.

    They are not anecdotal. (The last two properties corresponds to the black swan problem).

    Fig1 First 100 years (Sample Path): A Monte Carlo generated realization of a process of the 80/20 or 80/02 style as described by Pinkers in his book, that is tail exponent $= 1.1, dangerously close to 1

    Fig 2: The Turkey Surprise: Now 200 years, the second 100 years dwarf the first; these are realizations of the exact same process, seen with a longer widow.

    Long Peace.nb 3

  • EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES

    On the Super-Additivity and Estimation Biases ofQuantile Contributions

    Nassim Nicholas Taleb, Raphael DouadySchool of Engineering, New York University

    Riskdata & C.N.R.S. Paris, Labex ReFi, Centre dEconomie de la Sorbonne

    AbstractSample measures of top centile contributions to thetotal (concentration) are downward biased, unstable estimators,extremely sensitive to sample size and concave in accounting forlarge deviations. It makes them particularly unfit in domainswith power law tails, especially for low values of the exponent.These estimators can vary over time and increase with thepopulation size, as shown in this article, thus providing theillusion of structural changes in concentration. They are alsoinconsistent under aggregation and mixing distributions, as theweighted average of concentration measures for A and B willtend to be lower than that from A [ B. In addition, it can beshown that under such fat tails, increases in the total sum needto be accompanied by increased sample size of the concentrationmeasurement. We examine the estimation superadditivity andbias under homogeneous and mixed distributions.

    Final version, Nov 11 2014. Accepted Physica A:StatisticalMechanics and Applications

    This work was achieved through the Laboratory of Ex-cellence on Financial Regulation (Labex ReFi) supported byPRES heSam under the reference ANR-10-LABX-0095.

    I. INTRODUCTION

    Vilfredo Pareto noticed that 80% of the land in Italybelonged to 20% of the population, and vice-versa, thus bothgiving birth to the power law class of distributions and thepopular saying 80/20. The self-similarity at the core of theproperty of power laws [1] and [2] allows us to recurse andreapply the 80/20 to the remaining 20%, and so forth until oneobtains the result that the top percent of the population willown about 53% of the total wealth.

    It looks like such a measure of concentration can beseriously biased, depending on how it is measured, so it isvery likely that the true ratio of concentration of what Paretoobserved, that is, the share of the top percentile, was closerto 70%, hence changes year-on-year would drift higher toconverge to such a level from larger sample. In fact, as wewill show in this discussion, for, say wealth, more completesamples resulting from technological progress, and also largerpopulation and economic growth will make such a measureconverge by increasing over time, for no other reason thanexpansion in sample space or aggregate value.

    The core of the problem is that, for the class one-tailedfat-tailed random variables, that is, bounded on the left andunbounded on the right, where the random variable X 2[xmin,1), the in-sample quantile contribution is a biasedestimator of the true value of the actual quantile contribution.

    Let us define the quantile contribution

    q

    = q

    E[X|X > h(q)]E[X]

    where h(q) = inf{h 2 [xmin

    ,+1) ,P(X > h) q} is theexceedance threshold for the probability q.

    For a given sample (Xk

    )1kn, its "natural" estimator bq q

    thpercentiletotal

    , used in most academic studies, can be expressed,as

    bq

    P

    n

    i=1Xi>h(q)

    X

    iPn

    i=1 Xi

    where h(q) is the estimated exceedance threshold for theprobability q :

    h(q) = inf{h : 1n

    nX

    i=1

    x>h

    q}

    We shall see that the observed variable bq

    is a downwardbiased estimator of the true ratio

    q

    , the one that would holdout of sample, and such bias is in proportion to the fatness oftails and, for very fat tailed distributions, remains significant,even for very large samples.

    II. ESTIMATION FOR UNMIXED PARETO-TAILEDDISTRIBUTIONS

    Let X be a random variable belonging to the class ofdistributions with a "power law" right tail, that is:

    P(X > x) L(x)x (1)

    where L : [xmin,+1) ! (0,+1) is a slowly varyingfunction, defined as lim

    x!+1L(kx)L(x) = 1 for any k > 0.

    There is little difference for small exceedance quantiles( 0, in other words: (x) = xminx1 xxmin , and

    1

  • EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 2

    P(X > x) =xminx

    . Under these assumptions, the cutpoint

    of exceedance is h(q) = xmin q1/ and we have:

    q

    =

    R1h(q) x(x)dxR1xmin

    x(x)dx

    =

    h(q)

    xmin

    1

    = q

    1 (2)

    If the distribution of X is -Pareto only beyond a cut-pointxcut, which we assume to be below h(q), so that we haveP(X > x) =

    x

    for some > 0, then we still have

    h(q) = q

    1/ and

    q

    =

    1

    E [X]q1

    The estimation of q

    hence requires that of the exponent aswell as that of the scaling parameter , or at least its ratio tothe expectation of X .

    Table I shows the bias of bq

    as an estimator of q

    in thecase of an -Pareto distribution for = 1.1, a value chosento be compatible with practical economic measures, such asthe wealth distribution in the world or in a particular country,including developped ones.1 In such a case, the estimator isextemely sensitive to "small" samples, "small" meaning inpractice 108. We ran up to a trillion simulations across varietiesof sample sizes. While 0.01 0.657933, even a sample sizeof 100 million remains severely biased as seen in the table.

    Naturally the bias is rapidly (and nonlinearly) reduced for further away from 1, and becomes weak in the neighborhoodof 2 for a constant , though not under a mixture distributionfor , as we shall se later. It is also weaker outside the top1% centile, hence this discussion focuses on the famed "onepercent" and on low values of the exponent.

    TABLE I: Biases of Estimator of = 0.657933 From 1012Monte Carlo Realizations

    b(n) Mean Median STDacross MC runs

    b(103) 0.405235 0.367698 0.160244b(104) 0.485916 0.458449 0.117917b(105) 0.539028 0.516415 0.0931362b(106) 0.581384 0.555997 0.0853593b(107) 0.591506 0.575262 0.0601528b(108) 0.606513 0.593667 0.0461397

    In view of these results and of a number of tests we haveperformed around them, we can conjecture that the bias

    q

    bq

    (n) is "of the order of" c(, q)nb(q)(1) where constantsb(q) and c(, q) need to be evaluated. Simulations suggest thatb(q) = 1, whatever the value of and q, but the rather slowconvergence of the estimator and of its standard deviation to0 makes precise estimation difficult.

    2) General Case: In the general case, let us fix the thresh-old h and define:

    h

    = P (X > h)

    E[X|X > h]E[X] =

    E[XX>h

    ]

    E[X]

    1This value, which is lower than the estimated exponents one can find inthe literature around 2 is, following [7], a lower estimate which cannotbe excluded from the observations.

    so that we have q

    =

    h(q). We also define the n-sampleestimator:

    bh

    P

    n

    i=1 Xi>hXiPn

    i=1 Xi

    where Xi

    are n independent copies of X . The intuitionbehind the estimation bias of

    q

    by bq

    lies in a differenceof concavity of the concentration measure with respect toan innovation (a new sample value), whether it falls belowor above the threshold. Let A

    h

    (n) =

    Pn

    i=1 Xi>hXi and

    S(n) =

    Pn

    i=1 Xi, so that bh(n) =A

    h

    (n)

    S(n)

    and assume a

    frozen threshold h. If a new sample value Xn+1 < h then the

    new value is bh

    (n+1) =

    A

    h

    (n)

    S(n) +X

    n+1. The value is convex

    in Xn+1 so that uncertainty on Xn+1 increases its expectation.

    At variance, if the new sample value Xn+1 > h, the new value

    bh

    (n + 1) Ah(n)+Xn+1hS(n)+Xn+1h = 1

    S(n)Ah(n)S(n)+Xn+1h , which is

    now concave in Xn+1, so that uncertainty on Xn+1 reduces its

    value. The competition between these two opposite effects is infavor of the latter, because of a higher concavity with respectto the variable, and also of a higher variability (whatever itsmeasurement) of the variable conditionally to being abovethe threshold than to being below. The fatter the right tailof the distribution, the stronger the effect. Overall, we find

    that E [bh

    (n)] E [Ah(n)]E [S(n)] = h (note that unfreezing the

    threshold h(q) also tends to reduce the concentration measureestimate, adding to the effect, when introducing one extrasample because of a slight increase in the expected value ofthe estimator h(q), although this effect is rather negligible).We have in fact the following:

    Proposition 1. Let X = (X)ni=1 a random sample of size

    n >

    1q

    , Y = Xn+1 an extra single random observation, and

    define: bh

    (X t Y ) =P

    n

    i=1 Xi>hXi + Y >hYPn

    i=1 Xi + Y. We remark

    that, whenever Y > h, one has:

    @

    2bh

    (X t Y )@Y

    2 0.

    This inequality is still valid with bq

    as the value h(q,X tY )doesnt depend on the particular value of Y > h(q,X).

    We face a different situation from the common small sampleeffect resulting from high impact from the rare observationin the tails that are less likely to show up in small samples,a bias which goes away by repetition of sample runs. Theconcavity of the estimator constitutes a upper bound for themeasurement in finite n, clipping large deviations, whichleads to problems of aggregation as we will state belowin Theorem 1. In practice, even in very large sample, thecontribution of very large rare events to

    q

    slows down theconvergence of the sample estimator to the true value. For abetter, unbiased estimate, one would need to use a differentpath: first estimating the distribution parameters

    ,

    and

    only then, estimating the theoretical tail contribution q

    (,

    ).Falk [7] observes that, even with a proper estimator of and, the convergence is extremely slow, namely of the order ofn

    /lnn, where the exponent depends on and on the

  • EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 3

    20000 40000 60000 80000 100000Y

    0.65

    0.70

    0.75

    0.80

    0.85

    0.90

    0.95

    kHSXi+YL

    Fig. 1: Effect of additional observations on

    20 40 60 80 100Y

    0.622

    0.624

    0.626

    kHSXi+YL

    Fig. 2: Effect of additional observations on , we can seeconvexity on both sides of h except for values of no effect tothe left of h, an area of order 1/n

    tolerance of the actual distribution vs. a theoretical Pareto,measured by the Hellinger distance. In particular, ! 0 as! 1, making the convergence really slow for low values of.

    III. AN INEQUALITY ABOUT AGGREGATING INEQUALITYFor the estimation of the mean of a fat-tailed r.v. (X)j

    i

    , inm sub-samples of size n

    i

    each for a total of n =P

    m

    i=1 ni,the allocation of the total number of observations n betweeni and j does not matter so long as the total n is unchanged.Here the allocation of n samples between m sub-samples doesmatter because of the concavity of .2 Next we prove thatglobal concentration as measured by b

    q

    on a broad set of datawill appear higher than local concentration, so aggregatingEuropean data, for instance, would give a b

    q

    higher thanthe average measure of concentration across countries an"inequality about inequality". In other words, we claim thatthe estimation bias when using b

    q

    (n) is even increased whendividing the sample into sub-samples and taking the weightedaverage of the measured values b

    q

    (n

    i

    ).

    2The same concavity and general bias applies when the distribution islognormal, and is exacerbated by high variance.

    Theorem 1. Partition the n data into m sub-samples N =N1[ . . .[Nm of respective sizes n1, . . . , nm, with

    Pm

    i=1 ni =

    n, and let S1, . . . , Sm be the sum of variables over each sub-sample, and S =

    Xm

    i=1S

    i

    be that over the whole sample.Then we have:

    E [bq

    (N)] mX

    i=1

    ES

    i

    S

    E [b

    q

    (N

    i

    )]

    If we further assume that the distribution of variables Xj

    isthe same in all the sub-samples. Then we have:

    E [bq

    (N)] mX

    i=1

    n

    i

    n

    E [bq

    (N

    i

    )]

    In other words, averaging concentration measures of sub-samples, weighted by the total sum of each subsample,produces a downward biased estimate of the concentrationmeasure of the full sample.

    Proof. An elementary induction reduces the question to thecase of two sub-samples. Let q 2 (0, 1) and (X1, . . . , Xm)and (X 01, . . . , X 0n) be two samples of positive i.i.d. randomvariables, the X

    i

    s having distributions p(dx) and the X 0j

    shaving distribution p0(dx0). For simplicity, we assume that

    both qm and qn are integers. We set S =mX

    i=1

    X

    i

    and

    S

    0=

    nX

    i=1

    X

    0i

    . We define A =mqX

    i=1

    X[i] where X[i] is the i-

    th largest value of (X1, . . . , Xm), and A0 =mqX

    i=1

    X

    0[i] where

    X

    0[i] is the i-th largest value of (X

    01, . . . , X

    0n

    ) . We also set

    S

    00= S + S

    0 and A =(m+n)qX

    i=1

    X

    00[i] where X

    00[i] is the i-th

    largest value of the joint sample (X1, . . . , Xm, X 01, . . . , X 0n).The q-concentration measure for the samples

    X = (X1, ..., Xm), X0

    = (X

    01, ..., X

    0n

    ) andX 00 = (X1, . . . , Xm, X 01, . . . , X

    0n

    ) are:

    =

    A

    S

    0=

    A

    0

    S

    0 00=

    A

    00

    S

    00

    We must prove that he following inequality holds for expectedconcentration measures:

    E [00] ES

    S

    00

    E [] + E

    S

    0

    S

    00

    E [0]

    We observe that:

    A = max

    J{1,...,m}|J|=m

    X

    i2JX

    i

    and, similarly A0 = maxJ

    0{1,...,n},|J 0|=qnP

    i2J 0 X0i

    andA

    00= max

    J

    00{1,...,m+n},|J 00|=q(m+n)P

    i2J 00 Xi, where wehave denoted X

    m+i = X0i

    for i = 1 . . . n. If J {1, ...,m} , |J | = m and J 0 {m+ 1, ...,m+ n} , |J 0| =qn, then J 00 = J [ J 0 has cardinal m + n, hence A + A0 =P

    i2J 00 Xi A00, whatever the particular sample. Therefore

    00 SS

    00+S

    0

    S

    000 and we have:

    E [00] ES

    S

    00

    + E

    S

    0

    S

    000

  • EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 4

    Let us now show that:

    ES

    S

    00

    = E

    A

    S

    00

    E

    S

    S

    00

    EA

    S

    If this is the case, then we identically get for 0 :

    ES

    0

    S

    000= E

    A

    0

    S

    00

    E

    S

    0

    S

    00

    EA

    0

    S

    0

    hence we will have:

    E [00] ES

    S

    00

    E [] + E

    S

    0

    S

    00

    E [0]

    Let T = X[mq] be the cut-off point (where [mq] is the

    integer part of mq), so that A =mX

    i=1

    X

    i XiT and let B =

    S A =mX

    i=1

    X

    i Xi h)E [X |X > h ]E [X]

    whereas the n-sample -concentration measure is bh

    (n) =

    A(n)S(n) , where A(n) and S(n) are defined as above for an n-sample X = (X1, . . . , Xn) of i.i.d. variables with the samedistribution as X .

    Theorem 2. For any n 2 N, we have:

    E [bh

    (n)] <

    h

    andlim

    n!+1bh

    (n) =

    h

    a.s. and in probability

    Proof. The above corrolary shows that the sequencenE [b

    h

    (n)] is super-additive, hence E [bh

    (n)] is an increasingsequence. Moreover, thanks to the law of large numbers,1n

    S(n) converges almost surely and in probability to E [X]and 1

    n

    A(n) converges almost surely and in probability toE [X

    X>h

    ] = P (X > h)E [X |X > h ], hence their ratioalso converges almost surely to

    h

    . On the other hand,this ratio is bounded by 1. Lebesgue dominated convergencetheorem concludes the argument about the convergence inprobability.

    IV. MIXED DISTRIBUTIONS FOR THE TAIL EXPONENTConsider now a random variable X , the distribution of

    which p(dx) is a mixture of parametric distributions withdifferent values of the parameter: p(dx) =

    Pm

    i=1 !ipi(dx).

    A typical n-sample of X can be made of ni

    = !

    i

    n samplesof X

    i with distribution pi . The above theorem shows that,in this case, we have:

    E [bq

    (n,X)] mX

    i=1

    ES(!

    i

    n,X

    i)

    S(n,X)

    E [b

    q

    (!

    i

    n,X

    i)]

    When n ! +1, each ratio S(!in,Xi)S(n,X)

    converges almost

    surely to !i

    respectively, therefore we have the followingconvexity inequality:

    q

    (X) mX

    i=1

    !

    i

    q

    (X

    i)

    The case of Pareto distribution is particularly interesting.Here, the parameter represents the tail exponent of thedistribution. If we normalize expectations to 1, the cdf of X

    is F

    (x) = 1

    x

    xmin

    and we have:

    q

    (X

    ) = q

    1

    andd

    2

    d

    2

    q

    (X

    ) = q

    1

    (log q)

    2

    3> 0

    Hence q

    (X

    ) is a convex function of and we can write:

    q

    (X) mX

    i=1

    !

    i

    q

    (X

    i) q(X)

    where =P

    m

    i=1 !i.Suppose now that X is a positive random variable with

    unknown distribution, except that its tail decays like a powerlow with unknown exponent. An unbiased estimation of the

  • EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 5

    exponent, with necessarily some amount of uncertainty (i.e.,a distribution of possible true values around some average),would lead to a downward biased estimate of

    q

    .

    Because the concentration measure only depends on the tailof the distribution, this inequality also applies in the case ofa mixture of distributions with a power decay, as in Equation1:

    P(X > x) NX

    j=1

    !

    i

    L

    i

    (x)x

    j (3)

    The slightest uncertainty about the exponent increases theconcentration index. One can get an actual estimate of thisbias by considering an average > 1 and two surroundingvalues + = + and = . The convexity inequalywrites as follows:

    q

    () = q

    1 1 2, that is finite variance but infinitekurtosis).

    It is strange that given the dominant role of fat tails nobody thought of calculating some practical equivalence table. Howcan people compare averages concerning street crime (very thin tailed) to casualties from war (very fat tailed) without somesample adjustment?1

    Perhaps the problem lies at the core of the law of large numbers: the average is not as "visible" as other statistical dimentions;there is no sound statistical procedure to derive the properties of a powerlaw tailed data by estimating the mean typicallyestimation is done by fitting the tail exponent (via, say, the Hill estimator or some other method), or dealing with extrema, yetit remains that many articles make comparisons about the mean since it is what descriptive statistics and, alas, decisions, arebased on.

    I. THE PROBLEM OF MATCHING ERRORSBy the weak law of large numbers, consider a sum of random variables X1, X2,..., Xn independent and identically distributed

    with finite mean m, that is E[Xi

    ] < 1, then 1n

    P1in Xi converges to m in probability, as n ! 1. And the idea is that

    we live with finite n.We get most of the intuitions from closed-form and semi-closed form expressions working with: stable distributions (which allow for a broad span of fat tails by varying the exponent, along with the asymmetry via

    the coefficient stable distributions with mixed exponent. other symmetric distributions with fat-tails (such as mixed Gaussians, Gamma-Variance Gaussians, or simple stochastic

    volatility)More complicated situations entailing more numerical tinkering are also covered: Pareto classes, lognormal, etc.

    Instability of Mean DeviationIndexing with p the property of the variable Xp and g for Xg the Gaussian:

    (np

    : E

    n

    pX Xpi

    mp

    np

    != E

    n

    gX Xgi

    mg

    ng

    !)(1)

    1The Pinker Problem A class of naive empiricism. It has been named so in reference to sloppy use of statistical techniques in social science and policymaking, based on a theory promoted by the science writer S. Pinker [1] about the drop of violence that is based on such statistical fallacies since wars unlikedomestic violence are fat tailed. But this is a very general problem with the (irresponsible) mechanistic use of statistical methods in social science andbiology.

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 3

    1.0 1.5 2.0 2.5 3.0

    1.4

    1.5

    1.6

    1.7

    C2

    C1( )

    Fig. 2: The ratio of cumulants for a symmetricpowerlaw, as a function of the tail exponent.

    And since we know that convergence for the Gaussian happens at speed n 12 , we can compare to convergence of other classes.We are expressing in Equation 1 the expected error (that is, a risk function) in L1 as mean absolute deviation from the

    observed average, to accommodate absence of variance but assuming of course existence of first moment without which thereis no point discussing averages.

    Typically, in statistical inference, one uses standard deviations of the observations to establish the sufficiency of n. But infat tailed data standard deviations do not exist, or, worse, when they exist, as in powerlaw with tail exponent > 3, they areextremely unstable, particularly in cases where kurtosis is infinite.

    Using mean deviations of the samples (when these exist) doesnt accommodate the fact that fat tailed data hide properties.The "volatility of volatility", or the dispersion around the mean deviation increases nonlinearly as the tails get fatter. Forinstance, a stable distribution with tail exponent at 32 matched to exactly the same mean deviation as the Gaussian will delivermeasurements of mean deviation 1.4 times as unstable as the Gaussian.

    Using mean absolute deviation for "volatility", and its mean deviation "volatility of volatility" expressed in the L1 norm, orC1 and C2 cumulant:

    C1 = E(|X m|)

    C2 = E (|X E(|X m|)|)

    We can compare that matching mean deviations does not go very far matching cumulants.(see Appendix 1)Further, a sum of Gaussian variables will have its extreme values distributed as a Gumbel while a sum of fat tailed will

    follow a Frchet distribution regardless of the the number of summands. The difference is not trivial, as shown in figures , asin 106 realizations for an average with 100 summands, we can be expected observe maxima > 4000 the average while fora Gaussian we can hardly encounter more than > 5.

    II. GENERALIZING MEAN DEVIATION AS PARTIAL EXPECTATIONIt is unfortunate that even if one matches mean deviations, the dispersion of the distributions of the mean deviations (and

    their skewness) would be such that a "tail" would remain markedly different in spite of a number of summands that allowsthe matching of the first order cumulant. So we can match the special part of the distribution, the expectation > K or < K,where K can be any arbitrary level.

    Let (t) be the characteristic function of the random variable. Let be the Heaviside theta function. Since sgn(x) = 2(x)1

    (t) =

    Z 1

    1eitx (2(xK) 1) dx = 2ie

    iKt

    t

    And the special expectation becomes, by convoluting the Fourier transforms:

    E(X|X>K

    ) = i @@t

    Z 1

    1 (t u) (u)du|

    t=0 (2)

    Mean deviation becomes a special case of equation 2, E(|X|) = E(X|X>

    ) + E(X|X

    ).

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 4

    III. CLASS OF STABLE DISTRIBUTIONSAssume alpha-stable the class S of probability distribution that is closed under convolution: S(,, ,) represents the

    stable distribution with tail index 2 (0, 2], symmetry parameter 2 [0, 1], location parameter 2 R, and scale parameter 2 R+. The Generalized Central Limit Theorem gives sequences a

    n

    and bn

    such that the distribution of the shifted andrescaled sum Z

    n

    = (

    Pn

    i

    Xi

    an

    ) /bn

    of n i.i.d. random variates Xi

    the distribution function of which FX

    (x) has asymptotes1 cx as x! +1 and d(x) as x! 1 weakly converges to the stable distribution

    S(^,2, 0K

    )

    under the stable distribution above. From Equation 2:

    E(X|X>K

    ) =

    1

    2

    Z 1

    1 |u|2

    1 + i tan

    2

    sgn(u)

    exp

    |u|

    1 i tan

    2

    sgn(u)

    + iKu

    du

    with explicit solution:

    (3)E(,,, 0) = 1

    1

    1 + i tan

    2

    1/+

    1 i tan

    2

    1/.

    and semi-explicit generalized form:

    E(,,,K) =

    1

    1 + i tan

    2

    1/+

    1 i tan

    2

    1/

    2

    +

    1X

    k=1

    ikKkk+1

    2 tan2

    2

    + 1

    1k

    (1)k

    1 + i tan

    2

    k1

    +

    1 i tan

    2

    k1

    2k1k!

    (4)

    Our formulation in Equation 4 generalizes and simplifies the commonly used one from Wolfe [5] from which Hardin [6]got the explicit form, promoted in Samorodnitsky and Taqqu [4] and Zolotarev[3]:

    E(|X|) = 1

    2

    1 1

    2 tan2

    2

    + 1

    12

    cos

    tan

    1 tan

    2

    !!

    Which allows us to prove the following statements:

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 5

    1) Relative convergence: The general case with 6= 0: for so and so, assuming so and so, (precisions) etc.,

    (5)n

    = 2

    1

    22

    1

    png

    1 i tan

    2

    1

    +

    1 + i tan

    2

    1

    1

    with alternative expression:

    n

    =

    22

    0

    BB@

    sec

    2

    2

    12/sec

    tan1

    (

    tan(

    2 ))

    png

    1

    1

    CCA

    1

    (6)

    Which in the symmetric case = 0 reduces to:

    n

    =

    2(1)

    1

    png

    1

    !

    1

    (7)

    2) Speed of convergence: 8k 2 N+ and 2 (1, 2]

    E

    kn

    X Xi

    m

    n

    !/E

    n

    X Xi

    m

    n

    != k

    1

    1 (8)

    Table I shows the equivalence of summands between processes.

    TABLE I: Corresponding n

    , or how many for equivalent -stable distribution. The Gaussian case is the = 2. For the casewith equivalent tails to the 80/20 one needs 1011 more data than the Gaussian.

    n n= 12 n

    =1

    1 Fughedaboudit - -

    98 6.09 10

    12 2.8 1013 1.86 1014

    54 574,634 895,952 1.88 10

    6

    118 5,027 6,002 8,632

    32 567 613 737

    138 165 171 186

    74 75 77 79

    158 44 44 44

    2 30. 30 30

    Remark 1. The ratio mean deviation of distributions in S is homogeneous of degree k 1. 1. This is not the case for otherclasses "nonstable".

    Proof. (Sketch) From the characteristic function of the stable distribution. Other distributions need to converge to the basinS.

    B. Stochastic Alpha or Mixed SamplesDefine mixed population X

    and (X

    ) as the mean deviation of ...

    Proposition 1. For so and so

    (X

    ) mX

    i=1

    !i

    (X

    i

    )

    where =P

    m

    i=1 !ii andP

    m

    i=1 !i = 1.

    Proof. A sketch for now: 8 2 (1, 2), where is the Euler-Mascheroni constant 0.5772, (1) the first derivative of the PolyGamma function (x) = 0[x]/[x], and H

    n

    the nth harmonic number:

    @2

    @2=

    2

    4

    1

    n

    1

    1 (1)

    1

    +

    H 1

    + log(n) +

    2H 1

    + log(n) +

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 6

    = 5 /4

    = 3 /2

    = 7 /4

    -1.0 -0.5 0.5 1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    (|X|)

    Fig. 3: Asymmetries and MeanDeviation.

    1.4 1.6 1.8 2.0

    50

    100

    150

    200

    250

    300

    350

    2

    2

    Fig. 4: Mixing distributions: theeffect is pronounced at lower val-ues of , as tail uncertainty createsmore fat-tailedness.

    which is positive for values in the specified range, keeping < 2 as it would no longer converge to the Stable basin.

    Which is also negative with respect to alpha as can be seen in Figure 4. The implication is that ones sample underestimatesthe required "n". (Commentary).

    IV. SYMMETRIC NONSTABLE DISTRIBUTIONS IN THE SUBEXPONENTIAL CLASSA. Symmetric Mixed Gaussians, Stochastic Mean

    While mixing Gaussians the kurtosis rises, which makes it convenient to simulate fattailedness. But mixing means has theopposite effect, as if it were more "stabilizing". We can observe a similar effect of "thin-tailedness" as far as the n requiredto match the standard benchmark. The situation is the result of multimodality, noting that stable distributions are unimodal(Ibragimov and Chernin) [7] and infinitely divisible Wolfe [8]. For X

    i

    Gaussian with mean , E = erf

    p2

    +

    q2

    e

    2

    22 ,and keeping the average with probability 1/2 each. With the perfectly symmetric case = 0 and sampling with equal

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 7

    2

    9

    8

    3

    2

    20000 40000 60000 80000 100000

    1

    2

    3

    4

    5

    2

    3

    2

    20000 40000 60000 80000 100000

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    Fig. 5: Different Speed: the fatter tailed processes are not just more uncertain; they also converge more slowly.

    probability:

    1

    2

    (E++E) =

    0

    @e 2

    22

    p2

    +

    1

    2

    erf

    p2

    1

    A erf

    0

    @e 2

    22

    p

    +

    erf

    p2

    p2

    1

    A+

    p2

    exp

    0

    B@

    q2

    e

    2

    22+ erf

    p2

    2

    22

    1

    CA

    B. Half cubic Student T (Lvy Stable Basin)Relative convergence:

    Theorem 1. For all so and so, (details), etc.

    c1 EP

    kn X

    i

    m

    n

    EP

    n

    X

    i

    m

    n

    c2 (9)

    where:c1 = k

    1

    1

    c2 = 27/21/2

    14

    2

    Note that because the instability of distribution outside the basin, they end up converging to SMin(,2), so at k = 2, n = 1,

    equation 9 becomes an equality and k !1 we satisfy the equalities in ?? and 8.

    Proof. (Sketch)The characteristic function for = 32 :

    (t) =3

    3/8 |t|3/4 K 34

    q32 |t|

    8p2

    34

    Leading to convoluted density p2 for a sum n = 2:

    p2(x) =

    54

    2F1

    54 , 2;

    74 ;

    2x2

    3

    p3

    34

    2

    74

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 8

    10 20 30 40 50n

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    |1

    n

    n

    xi |

    Fig. 6: Student T with exponent =3.This applies to the general class ofsymmetric power law distributions.

    C. Cubic Student T (Gaussian Basin)Student T with 3 degrees of freedom (higher exponent resembles Gaussian). We can get a semi-explicit density for the Cubic

    Student T.

    p(x) =6

    p3

    (x2 + 3)2

    we have:'(t) = E[eitX ] = (1 +

    p3 |t|) e

    p3 |t|

    hence the n-summed characteristic function is:

    '(t) = (1 +p3|t|)n en

    p3 |t|

    and the pdf of Y is given by:

    p(x) =1

    Z +1

    0(1 +

    p3 t)n en

    p3 t

    cos(tx) dt

    using Z 1

    0tket cos(st) dt =

    T1+k(1/p1 + s2)k!

    (1 + s2)(k+1)/2

    where Ta

    (x) is the T-Chebyshev polynomial,2 the pdf p(x) can be writen:

    p(x) =

    n2 + x

    2

    3

    n1

    p3

    nX

    k=0

    n!n2 + x

    2

    3

    1k2 +n

    Tk+1

    1q

    x

    2

    3n2+1

    !

    (n k)!

    which allows explicit solutions for specific values of n, not not for the general form:

    {En

    }1 n

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 9

    p= 12

    47

    p= 18

    Gaussian

    4 6 8 10

    0.2

    0.3

    0.4

    0.5

    Fig. 7: Sum of bets convergerapidly to Gaussian bassin but re-main clearly subgaussian for smallsamples.

    Betting against the long shot (1/100)

    p= 12

    47

    p= 1100

    Gaussian

    20 40 60 80 100

    0.1

    0.2

    0.3

    0.4

    0.5

    Fig. 8: For asymmetric binary bets,at small values of p, convergence isslower.

    V. ASYMMETRIC NONSTABLE DISTRIBUTIONS IN THE SUBEXPONETIAL CLASSA. One-tailed Pareto DistributionsB. The Lognormal and Borderline Subexponential Class

    VI. ASYMMETRIC DISTRIBUTIONS IN THE SUPEREXPONENTIAL CLASSA. Mixing Gaussian Distributions and Poisson CaseB. Skew Normal Distribution

    This is the most untractable case mathematically, apparently though the most present when we discuss fat tails [9].

    C. Super-thin tailed distributions: SubgaussiansConsider a sum of Bernoulli variables X . The average

    Pn

    P

    in xi follows a Binomial Distribution. Assuming np 2 N+to simplify:

    E (|n

    |) = 2X

    i0np(x np) px

    n

    x

    (1 p)nx

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 10

    E (|n

    |) = 2(1 p)n(p)+n2pnp+1(np+ 2)(p 1)

    n

    np+ 1

    1 p(np+ 2)

    n

    np+ 2

    2

    where:1 =2 F1

    1, n(p 1) + 1;np+ 2; p

    p 1

    and2 =2 F1

    2, n(p 1) + 2;np+ 3; p

    p 1

    VII. ACKNOWLEDGEMENTColman Humphrey,...

    APPENDIX: METHODOLOGY, PROOFS, ETC.A. Cumulants

    we have in the Gaussian case indexed by g:

    Cg2 =

    erf(

    1p+ e1/

    Cg1

    which is 1.30Cg1 .For a powerlaw distribution, cumulants are more unwieldy:

    C=3/21 =2

    q6

    54

    34

    Move to appendix

    C =3/22 =1

    2

    p63/231 (

    21 +

    23)

    5/4

    3845/432

    5/21 + 24

    9/42

    9/21 29/4

    p2

    4

    q21 +

    23

    9/21 H1

    + 1536

    52

    4

    q21 +

    23H2 +

    3 4q21 +

    23

    3

    p2

    31 + 33 (H2 + 2) 2

    4p23/4H1

    where 1 = 34

    , 2 =

    54

    , 3 =

    14

    , H1 = 2F1

    34 ,

    54 ;

    74 ;

    2123

    , and H2 = 2F1

    12 ,

    54 ;

    32 ;

    2321

    .

    B. Derivations using explicit E(|X|)See Wolfe [5] from which Hardin got the explicit form[6].

    C. Derivations using the Hilbert Transform and = 0Section obsolete since I found forms for asymmetric stable distributions. Some commentary on Hilbert transforms for

    symmetric stable distributions, given that for Z = |X|, dFz

    (z) = dFX

    (x)(1 sgn(x)), that type of thing.Hilbert Transform for a function f (see Hlusel, [10], Pinelis [11]):

    H(f) =1

    p.v.

    Z 1

    1

    f(x)

    t xdx

    Here p.v. means principal value in the Cauchy sense, in other words

    p.v.Z 1

    1= lim

    a!1lim

    b!0

    Z b

    a+

    Za

    b

    E(|X|) = @@t

    H( (0)) =1

    @

    @tp.v.

    Z 1

    1

    (z)

    t zdz|t=0

    E(|X|) = 1

    p.v.Z 1

    1

    (z)

    z2dz

    In our case:E(|X|) = 1

    p.v.

    Z 1

    1e|t|

    t2dt =

    2

    1

  • DRAF

    T

    EXTREME RISK INITIATIVE NYU SCHOOL OF ENGINEERING WORKING PAPER SERIES 11

    REFERENCES[1] S. Pinker, The better angels of our nature: Why violence has declined. Penguin, 2011.[2] V. V. Uchaikin and V. M. Zolotarev, Chance and stability: stable distributions and their applications. Walter de Gruyter, 1999.[3] V. M. Zolotarev, One-dimensional stable distributions. American Mathematical Soc., 1986, vol. 65.[4] G. Samorodnitsky and M. S. Taqqu, Stable non-Gaussian random processes: stochastic models with infinite variance. CRC Press, 1994, vol. 1.[5] S. J. Wolfe, On the local behavior of characteristic functions, The Annals of Probability, pp. 862866, 1973.[6] C. D. Hardin Jr, Skewed stable variables and processes. DTIC Document, Tech. Rep., 1984.[7] I. Ibragimov and K. Chernin, On the unimodality of geometric stable laws, Theory of Probability & Its Applications, vol. 4, no. 4, pp. 417419, 1959.[8] S. J. Wolfe, On the unimodality of infinitely divisible distribution functions, Probability Theory and Related Fields, vol. 45, no. 4, pp. 329335, 1978.[9] I. Zaliapin, Y. Y. Kagan, and F. P. Schoenberg, Approximating the distribution of pareto sums, Pure and Applied geophysics, vol. 162, no. 6-7, pp.

    11871228, 2005.[10] M. Hlusek, On distribution of absolute values, 2011.[11] I. Pinelis, On the characteristic function of the positive part of a random variable, arXiv preprint arXiv:1309.5928, 2013.