Concepts

This page builds up the differential-privacy vocabulary used throughout the tutorial. If you already know what ε, sensitivity, and "local vs central" mean, skim and move on.

The problem DP solves

You have a dataset of individual records. You want to release something useful about it — a count, an average, a histogram, the most common category — without revealing any individual's row.

Naïve approaches fail in subtle ways. Stripping identifiers leaks via quasi-identifiers (Netflix Prize, AOL search logs). Releasing exact aggregates leaks via differencing attacks: publish the average salary of $N$ employees, then $N - 1$ employees, and the difference is one person's salary.

Differential privacy fixes this by adding calibrated random noise to the released value. Done correctly, an attacker comparing the released output for two datasets that differ in one record cannot reliably tell which dataset they were looking at.

The ε guarantee, formally

A randomized mechanism $M$ satisfies ε-differential privacy if for every pair of datasets $D, D'$ that differ in a single record, and every possible output $S$ :

\Pr[M(D) \in S] \;\leq\; e^{\varepsilon} \cdot \Pr[M(D') \in S]

In words: changing one record can change the probability of any output by at most a factor of $e^{\varepsilon}$ .

$\varepsilon = 0$ : outputs are independent of the data — perfect privacy, no utility.
$\varepsilon$ small (≤ 1): strong privacy, more noise.
$\varepsilon$ large (≥ 5): weak privacy, less noise.

There is no "right" ε. Common production choices land between 0.1 and 5, depending on the application and how often the data is queried.

ε is a budget, not a knob. ε accounts for all releases on the same dataset over its lifetime. Ten ε=0.5 statistics on the same data spend ε=5 in total under basic composition. Plan the budget before you start querying, not after.

Sensitivity

To calibrate noise, the mechanism needs to know how much one record can change the output. That quantity is the sensitivity, written $\Delta f$ :

\Delta f \;=\; \max_{D \,\sim\, D'} \, \big| f(D) - f(D') \big|

The maximum is over all pairs of neighboring datasets — datasets that differ in one record.

query	sensitivity
count of rows	$1$
sum over a column with values in $[\text{lo}, \text{hi}]$	$\text{hi} - \text{lo}$
mean over $n$ values in $[\text{lo}, \text{hi}]$	$(\text{hi} - \text{lo})/n$
max	$\text{hi} - \text{lo}$ (worst case)

Sensitivity is the whole game. Two mechanisms with the same ε but different sensitivities give wildly different accuracy. Averaging $n$ values shrinks sensitivity by $n$ — that's why anon.dp_laplace_avg is far tighter than per-row LDP for the same ε.

Local vs central

The single most important distinction in this tutorial.

	local DP (LDP)	central DP
Who sees raw values?	nobody	a trusted curator
Where is noise added?	per record, before leaving the user	once, on the released aggregate
Sensitivity used	full per-record range	reduced by aggregation (e.g. $/n$ for a mean)
Accuracy at same ε	worse	better — often by $\sqrt{n}$ or $n$
When to use	untrusted aggregator, browser/device telemetry	trusted database, internal analytics

Both give meaningful privacy. They protect different things.

Two free lunches

Post-processing. Any deterministic function of the output of an ε-DP mechanism is still ε-DP. You can reshape, plot, or feed the noisy output into a downstream model without spending more budget.
Parallel composition. If queries hit disjoint slices of the data, their ε's don't add — the worst-case ε is the maximum, not the sum. Useful for per-group statistics.

Where to next

Mechanisms overview: GRRM, Laplace, Gaussian, the one-hot variants, and the central averaging helper.
Quick start: run your first private query against the anon extension.
Or jump straight to the interactive demos and play with ε on real data.