Post-processing
Once an LDP mechanism has produced its output, the noise is built in. Post-processing functions transform that noisy output into useful estimators (debiased counts, frequencies, means, confidence intervals) without spending additional privacy budget. Any deterministic function of an ε-DP release is still ε-DP.
These functions operate on the analyst side, not the masking side. They
take the output of GRRM, the only mechanism in
this catalog that benefits from explicit debiasing, and return unbiased
point estimates, intervals, and aggregates. They are called ad-hoc from
analytic queries, not wired into SECURITY LABEL rules.
Frequency estimation
anon.ldp_frequency_estimate debiases the count of one category in
GRRM-perturbed output:
SELECT anon.ldp_frequency_estimate(
observed_count => COUNT(*) FILTER (WHERE noisy_rating = 3),
n => COUNT(*),
epsilon => 1.0,
d => 5
) AS unbiased_count
FROM responses_anonymized;
The estimator subtracts the expected lie-count from the observed count and rescales. It is unbiased for large; its variance grows as ε shrinks.
For the whole distribution at once, use
anon.ldp_correct_distribution:
WITH agg AS (
SELECT
COUNT(*) FILTER (WHERE noisy_rating = 1) AS c1,
COUNT(*) FILTER (WHERE noisy_rating = 2) AS c2,
COUNT(*) FILTER (WHERE noisy_rating = 3) AS c3,
COUNT(*) FILTER (WHERE noisy_rating = 4) AS c4,
COUNT(*) FILTER (WHERE noisy_rating = 5) AS c5
FROM responses_anonymized
)
SELECT anon.ldp_correct_distribution(
counts => ARRAY[c1, c2, c3, c4, c5],
epsilon => 1.0,
d => 5
) AS unbiased_counts
FROM agg;
This inverts the GRRM transition matrix on the full vector of counts and returns a vector of debiased per-bin counts.
Frequency ratios
anon.ldp_frequency_ratio computes the debiased ratio between two
category counts directly, without going through individual frequency
estimates first:
SELECT anon.ldp_frequency_ratio(
numerator => COUNT(*) FILTER (WHERE noisy_rating = 5),
denominator => COUNT(*) FILTER (WHERE noisy_rating = 1),
n => COUNT(*),
epsilon => 1.0,
d => 5
) AS ratio_5_to_1
FROM responses_anonymized;
Useful for relative-frequency queries ("how much more common is X than Y") without releasing the underlying debiased counts.
Confidence intervals
anon.ldp_ci_lower and anon.ldp_ci_upper give the bounds of a
confidence interval around a single category's true count:
SELECT anon.ldp_ci_lower(observed_count, n, epsilon, d, alpha => 0.05) AS lo,
anon.ldp_ci_upper(observed_count, n, epsilon, d, alpha => 0.05) AS hi
FROM ...;
alpha is the desired non-coverage rate (0.05 for 95% intervals).
The intervals widen as ε shrinks and as shrinks.
anon.ldp_frequency_variance returns the analytic variance of the
single-category estimate, which is the quantity the CI bounds are
derived from:
SELECT anon.ldp_frequency_variance(observed_count, n, 1.0, 5)
AS estimator_variance
FROM ...;
Mean estimation
For a categorical column whose category labels are numeric (a 1..5
rating treated as a number, a Likert scale), anon.ldp_mean_from_frequencies
returns a debiased mean by debiasing the per-category count vector and
weighting by the category values:
WITH agg AS (
SELECT ARRAY[
COUNT(*) FILTER (WHERE noisy_rating = 1),
COUNT(*) FILTER (WHERE noisy_rating = 2),
COUNT(*) FILTER (WHERE noisy_rating = 3),
COUNT(*) FILTER (WHERE noisy_rating = 4),
COUNT(*) FILTER (WHERE noisy_rating = 5)
] AS counts
FROM responses_anonymized
)
SELECT anon.ldp_mean_from_frequencies(
counts => counts,
values => ARRAY[1, 2, 3, 4, 5]::float8[],
epsilon => 1.0,
d => 5
) AS unbiased_mean
FROM agg;
This is more accurate than averaging GRRM output directly because the debiasing happens on the count vector before the mean is computed.
Pitfalls
- Estimator variance grows fast at low ε. Below ε≈0.5 the variance of the single-category estimator can dominate the signal for moderate . Confidence intervals widen accordingly.
- Negative debiased counts. Debiasing can produce negative point estimates for rare categories. They are statistically valid; the unbiased estimator is unconstrained. Clamping them to zero introduces a small upward bias on rare bins.
- Bias from a wrong . Every estimator here relies on the public matching the value used at masking time. Using the wrong silently produces wrong estimates.
- Don't mix with one-hot output. The one-hot variants produce already-unbiased per-bin counts when summed; running these post-processing functions on top of them is incorrect.
Try it live
- /correction: single-value estimate, full-distribution recovery, and mean-from-frequencies on real GRRM output.