One-hot variants
anon.ldp_laplace_onehot and anon.ldp_gaussian_onehot release a
categorical value as a noisy histogram bin vector rather than a single
perturbed category. Each row outputs a float8[d] array — a noisy
one-hot encoding of the true value — and summing those arrays across
rows gives an unbiased histogram of the underlying distribution. Unlike
GRRM, no debiasing post-processing is needed:
the column sum is already the right quantity.
These variants extend the Adding Noise category of the masking-functions catalog with a vector-valued formulation.
When to use it
The one-hot variants are the right choice when the goal is a histogram or distribution over a small public domain . Each row contributes its noisy vector independently and the aggregator simply sums, so they suit streaming and federated pipelines.
Per-row scalar GRRM followed by
frequency estimation gives
the same kind of result with a different bias-variance trade-off, and is
preferable when is large or storing float8[d] per row is too
heavy.
Calling the function
The output type is float8[d], not the original column type, so one-hot
variants do not fit a per-column SECURITY LABEL. Use them in a
masking view
or as ad-hoc queries.
-- Each row produces a noisy one-hot vector of length d:
SELECT user_id,
anon.ldp_laplace_onehot(rating, 1.0, 5) AS noisy_vec
FROM responses
LIMIT 5;
To get the noisy histogram, sum across rows position by position:
SELECT idx, SUM(value) AS noisy_count
FROM responses,
unnest(anon.ldp_laplace_onehot(rating, 1.0, 5))
WITH ORDINALITY AS u(value, idx)
GROUP BY idx
ORDER BY idx;
The Gaussian variant takes a in addition to ε:
SELECT idx, SUM(value) AS noisy_count
FROM responses,
unnest(anon.ldp_gaussian_onehot(rating, 1.0, 5, 1e-5))
WITH ORDINALITY AS u(value, idx)
GROUP BY idx
ORDER BY idx;
Sensitivity and noise calibration
| variant | sensitivity | per-bin noise |
|---|---|---|
ldp_laplace_onehot(value, ε, d) | ||
ldp_gaussian_onehot(value, ε, d, δ) | , |
The sensitivities are constants, independent of . Two coordinates change between neighboring datasets (one bit goes from 0 to 1, another from 1 to 0), giving and .
Choosing parameters
d. Public domain size; same constraint as GRRM.epsilon. Typical 0.5 to 2 for histogram release. Smaller ε produces wider per-bin error bars, which matter most for small and small bins.delta(Gaussian). Same advice as Gaussian: is a common default; keep it cryptographically small.
Security & limitations
- Averaging attack under dynamic masking. Each call produces fresh
noise. The same caveat applies to
anon.noise(), documented in Adding Noise. Apply through static masking or anonymous dumps. Under dynamic masking, budget ε across every query a role can issue. dmust be public. Inferring the domain from the data leaks information about which categories are present.- No debiasing on the sum. The histogram from summed one-hot output is already unbiased; do not re-apply frequency estimation on top of it.
- Negative noisy counts. Per-bin counts can come out negative for rare bins. Clamping to is post-processing (still ε-DP / -DP) but introduces a small upward bias on small bins.
The math
For a value , the true one-hot encoding is . Each variant releases where is i.i.d. noise. For Laplace, gives ε-DP because for any . For Gaussian, with the σ above gives -DP because . Summing over rows produces an unbiased estimate of the histogram with per-bin std error for Laplace or for Gaussian.
Try it live
- /onehot-histogram: one-hot vs scalar GRRM histogram estimation, head-to-head, for both Laplace and Gaussian.