Heavy-tailed distributions

evergreen#foundations#distributions

Up: 10_Foundations-MOC

Bar: explain why clickstream popularity is Zipf/Pareto, recognize a power law on a log-log plot, and know when the mean/variance simply don't exist.

What "heavy-tailed" means

A distribution whose tail decays slower than exponential — extreme values are rare but not negligibly rare. Consequence: a handful of observations dominate the totals, and classic intuitions built on the Normal (where everything clusters near the mean) break.

Common heavy tails in this domain:

The dangerous part: moments may not exist

For Pareto with exponent α\alpha:

E[X]=αxmα1 (α>1);Var[X]< only if α>2.\mathbb{E}[X]=\frac{\alpha\,x_m}{\alpha-1}\ (\alpha>1);\qquad \text{Var}[X]<\infty\ \text{only if } \alpha>2.

So "the average session has X pageviews" can be a meaningless statistic. Use medians and quantiles, and report the tail explicitly.

Diagnostics (how to spot it)

Worked (Zipf/80-20 intuition): with frequency 1/rank\propto 1/\text{rank}, the top page gets twice the 2nd, three times the 3rd… A few pages soak up most traffic; the long tail is enormous but individually tiny. For Pareto α=1.16\alpha=1.16, the top 20% of items hold ~80% of the mass (that's where "80/20" comes from).

Why this matters here

By-hand exercise (meets the bar)

  1. Pareto with xm=1x_m=1. For which α\alpha is the mean finite but the variance infinite? (Answer: 1<α21<\alpha\le2.)
  2. Sketch how an exponential vs a power-law tail look on a log-log CCDF, and explain why only one is straight.