Concepts

This page builds up the differential-privacy vocabulary used throughout the tutorial. If you already know what ε, sensitivity, and "local vs central" mean, skim and move on.

The problem DP solves

You have a dataset of individual records. You want to release something useful about it — a count, an average, a histogram, the most common category — without revealing any individual's row.

Naïve approaches fail in subtle ways. Stripping identifiers leaks via quasi-identifiers (Netflix Prize, AOL search logs). Releasing exact aggregates leaks via differencing attacks: publish the average salary of NN employees, then N1N - 1 employees, and the difference is one person's salary.

Differential privacy fixes this by adding calibrated random noise to the released value. Done correctly, an attacker comparing the released output for two datasets that differ in one record cannot reliably tell which dataset they were looking at.

The ε guarantee, formally

A randomized mechanism MM satisfies ε-differential privacy if for every pair of datasets D,DD, D' that differ in a single record, and every possible output SS:

Pr[M(D)S]    eεPr[M(D)S]\Pr[M(D) \in S] \;\leq\; e^{\varepsilon} \cdot \Pr[M(D') \in S]

In words: changing one record can change the probability of any output by at most a factor of eεe^{\varepsilon}.

  • ε=0\varepsilon = 0: outputs are independent of the data — perfect privacy, no utility.
  • ε\varepsilon small (≤ 1): strong privacy, more noise.
  • ε\varepsilon large (≥ 5): weak privacy, less noise.

There is no "right" ε. Common production choices land between 0.1 and 5, depending on the application and how often the data is queried.

ε is a budget, not a knob. ε accounts for all releases on the same dataset over its lifetime. Ten ε=0.5 statistics on the same data spend ε=5 in total under basic composition. Plan the budget before you start querying, not after.

Sensitivity

To calibrate noise, the mechanism needs to know how much one record can change the output. That quantity is the sensitivity, written Δf\Delta f:

Δf  =  maxDDf(D)f(D)\Delta f \;=\; \max_{D \,\sim\, D'} \, \big| f(D) - f(D') \big|

The maximum is over all pairs of neighboring datasets — datasets that differ in one record.

querysensitivity
count of rows11
sum over a column with values in [lo,hi][\text{lo}, \text{hi}]hilo\text{hi} - \text{lo}
mean over nn values in [lo,hi][\text{lo}, \text{hi}](hilo)/n(\text{hi} - \text{lo})/n
maxhilo\text{hi} - \text{lo} (worst case)

Sensitivity is the whole game. Two mechanisms with the same ε but different sensitivities give wildly different accuracy. Averaging nn values shrinks sensitivity by nn — that's why anon.dp_laplace_avg is far tighter than per-row LDP for the same ε.

Local vs central

The single most important distinction in this tutorial.

local DP (LDP)central DP
Who sees raw values?nobodya trusted curator
Where is noise added?per record, before leaving the useronce, on the released aggregate
Sensitivity usedfull per-record rangereduced by aggregation (e.g. /n/n for a mean)
Accuracy at same εworsebetter — often by n\sqrt{n} or nn
When to useuntrusted aggregator, browser/device telemetrytrusted database, internal analytics

Both give meaningful privacy. They protect different things.

Two free lunches

  • Post-processing. Any deterministic function of the output of an ε-DP mechanism is still ε-DP. You can reshape, plot, or feed the noisy output into a downstream model without spending more budget.

  • Parallel composition. If queries hit disjoint slices of the data, their ε's don't add — the worst-case ε is the maximum, not the sum. Useful for per-group statistics.

Where to next

  • Mechanisms overview: GRRM, Laplace, Gaussian, the one-hot variants, and the central averaging helper.
  • Quick start: run your first private query against the anon extension.
  • Or jump straight to the interactive demos and play with ε on real data.