Concepts
This page builds up the differential-privacy vocabulary used throughout the tutorial. If you already know what ε, sensitivity, and "local vs central" mean, skim and move on.
The problem DP solves
You have a dataset of individual records. You want to release something useful about it — a count, an average, a histogram, the most common category — without revealing any individual's row.
Naïve approaches fail in subtle ways. Stripping identifiers leaks via quasi-identifiers (Netflix Prize, AOL search logs). Releasing exact aggregates leaks via differencing attacks: publish the average salary of employees, then employees, and the difference is one person's salary.
Differential privacy fixes this by adding calibrated random noise to the released value. Done correctly, an attacker comparing the released output for two datasets that differ in one record cannot reliably tell which dataset they were looking at.
The ε guarantee, formally
A randomized mechanism satisfies ε-differential privacy if for every pair of datasets that differ in a single record, and every possible output :
In words: changing one record can change the probability of any output by at most a factor of .
- : outputs are independent of the data — perfect privacy, no utility.
- small (≤ 1): strong privacy, more noise.
- large (≥ 5): weak privacy, less noise.
There is no "right" ε. Common production choices land between 0.1 and 5, depending on the application and how often the data is queried.
ε is a budget, not a knob. ε accounts for all releases on the same dataset over its lifetime. Ten ε=0.5 statistics on the same data spend ε=5 in total under basic composition. Plan the budget before you start querying, not after.
Sensitivity
To calibrate noise, the mechanism needs to know how much one record can change the output. That quantity is the sensitivity, written :
The maximum is over all pairs of neighboring datasets — datasets that differ in one record.
| query | sensitivity |
|---|---|
| count of rows | |
| sum over a column with values in | |
| mean over values in | |
| max | (worst case) |
Sensitivity is the whole game. Two mechanisms with the same ε but different
sensitivities give wildly different accuracy. Averaging values shrinks
sensitivity by — that's why anon.dp_laplace_avg is far tighter than
per-row LDP for the same ε.
Local vs central
The single most important distinction in this tutorial.
| local DP (LDP) | central DP | |
|---|---|---|
| Who sees raw values? | nobody | a trusted curator |
| Where is noise added? | per record, before leaving the user | once, on the released aggregate |
| Sensitivity used | full per-record range | reduced by aggregation (e.g. for a mean) |
| Accuracy at same ε | worse | better — often by or |
| When to use | untrusted aggregator, browser/device telemetry | trusted database, internal analytics |
Both give meaningful privacy. They protect different things.
Two free lunches
-
Post-processing. Any deterministic function of the output of an ε-DP mechanism is still ε-DP. You can reshape, plot, or feed the noisy output into a downstream model without spending more budget.
-
Parallel composition. If queries hit disjoint slices of the data, their ε's don't add — the worst-case ε is the maximum, not the sum. Useful for per-group statistics.
Where to next
- Mechanisms overview: GRRM, Laplace, Gaussian, the one-hot variants, and the central averaging helper.
- Quick start: run your first private query
against the
anonextension. - Or jump straight to the interactive demos and play with ε on real data.