Heavy-tailed distributions
Bar: explain why clickstream popularity is Zipf/Pareto, recognize a power law on a log-log plot, and know when the mean/variance simply don't exist.
What "heavy-tailed" means
A distribution whose tail decays slower than exponential — extreme values are rare but not negligibly rare. Consequence: a handful of observations dominate the totals, and classic intuitions built on the Normal (where everything clusters near the mean) break.
Common heavy tails in this domain:
- Log-normal —
is Normal. Dwell times, session durations, time-between-visits. - Pareto / power law —
for . The "80/20" rule. - Zipf — discrete power law over ranks: frequency
. Page popularity, search terms, event-type counts.
The dangerous part: moments may not exist
For Pareto with exponent
: infinite variance — the sample mean is unstable, CLT-based confidence intervals lie. : infinite mean — the sample average keeps growing as you collect more data.
So "the average session has X pageviews" can be a meaningless statistic. Use medians and quantiles, and report the tail explicitly.
Diagnostics (how to spot it)
- Log-log CCDF plot: a power law is a straight line on
vs ; the slope is . A log-normal curves; an exponential drops off a cliff. - Mean-vs-sample-size plot: if the running mean never settles, suspect infinite/near-infinite moments.
- Zipf check: plot frequency vs rank on log-log; slope near
⇒ Zipfian.
Worked (Zipf/80-20 intuition): with frequency
Why this matters here
- Sampling & evaluation: uniform sampling under-represents the tail; rare-but-important events (high-value pages, rare conversions) get lost. Class imbalance (1–3% conversions) is the same phenomenon (Evaluation theory, Difficulty knobs).
- Synthetic realism: to fake a believable clickstream you must inject Zipf page popularity and heavy-tailed dwell times, or your background is too tidy to be a fair test (Generating background traffic).
- Don't trust the mean in dashboards or features; heavy tails quietly wreck naïvely-Normal statistics.
By-hand exercise (meets the bar)
- Pareto with
. For which is the mean finite but the variance infinite? (Answer: .) - Sketch how an exponential vs a power-law tail look on a log-log CCDF, and explain why only one is straight.
Links
- Feeds: Point processes (heavy-tailed gaps), Generating background traffic, What is a clickstream
- Tension with: Evaluation theory (imbalance), Healthy paranoia (mean-based metrics that lie)