Post-processing

Once an LDP mechanism has produced its output, the noise is built in. Post-processing functions transform that noisy output into useful estimators (debiased counts, frequencies, means, confidence intervals) without spending additional privacy budget. Any deterministic function of an ε-DP release is still ε-DP.

These functions operate on the analyst side, not the masking side. They take the output of GRRM, the only mechanism in this catalog that benefits from explicit debiasing, and return unbiased point estimates, intervals, and aggregates. They are called ad-hoc from analytic queries, not wired into SECURITY LABEL rules.

Frequency estimation

anon.ldp_frequency_estimate debiases the count of one category in GRRM-perturbed output:

SELECT anon.ldp_frequency_estimate(
         observed_count => COUNT(*) FILTER (WHERE noisy_rating = 3),
         n              => COUNT(*),
         epsilon        => 1.0,
         d              => 5
       ) AS unbiased_count
FROM   responses_anonymized;

The estimator subtracts the expected lie-count from the observed count and rescales. It is unbiased for nn large; its variance grows as ε shrinks.

For the whole distribution at once, use anon.ldp_correct_distribution:

WITH agg AS (
  SELECT
    COUNT(*) FILTER (WHERE noisy_rating = 1) AS c1,
    COUNT(*) FILTER (WHERE noisy_rating = 2) AS c2,
    COUNT(*) FILTER (WHERE noisy_rating = 3) AS c3,
    COUNT(*) FILTER (WHERE noisy_rating = 4) AS c4,
    COUNT(*) FILTER (WHERE noisy_rating = 5) AS c5
  FROM responses_anonymized
)
SELECT anon.ldp_correct_distribution(
         counts  => ARRAY[c1, c2, c3, c4, c5],
         epsilon => 1.0,
         d       => 5
       ) AS unbiased_counts
FROM agg;

This inverts the GRRM transition matrix on the full vector of counts and returns a vector of debiased per-bin counts.

Frequency ratios

anon.ldp_frequency_ratio computes the debiased ratio between two category counts directly, without going through individual frequency estimates first:

SELECT anon.ldp_frequency_ratio(
         numerator   => COUNT(*) FILTER (WHERE noisy_rating = 5),
         denominator => COUNT(*) FILTER (WHERE noisy_rating = 1),
         n           => COUNT(*),
         epsilon     => 1.0,
         d           => 5
       ) AS ratio_5_to_1
FROM   responses_anonymized;

Useful for relative-frequency queries ("how much more common is X than Y") without releasing the underlying debiased counts.

Confidence intervals

anon.ldp_ci_lower and anon.ldp_ci_upper give the bounds of a confidence interval around a single category's true count:

SELECT anon.ldp_ci_lower(observed_count, n, epsilon, d, alpha => 0.05) AS lo,
       anon.ldp_ci_upper(observed_count, n, epsilon, d, alpha => 0.05) AS hi
FROM   ...;

alpha is the desired non-coverage rate (0.05 for 95% intervals). The intervals widen as ε shrinks and as nn shrinks.

anon.ldp_frequency_variance returns the analytic variance of the single-category estimate, which is the quantity the CI bounds are derived from:

SELECT anon.ldp_frequency_variance(observed_count, n, 1.0, 5)
       AS estimator_variance
FROM   ...;

Mean estimation

For a categorical column whose category labels are numeric (a 1..5 rating treated as a number, a Likert scale), anon.ldp_mean_from_frequencies returns a debiased mean by debiasing the per-category count vector and weighting by the category values:

WITH agg AS (
  SELECT ARRAY[
    COUNT(*) FILTER (WHERE noisy_rating = 1),
    COUNT(*) FILTER (WHERE noisy_rating = 2),
    COUNT(*) FILTER (WHERE noisy_rating = 3),
    COUNT(*) FILTER (WHERE noisy_rating = 4),
    COUNT(*) FILTER (WHERE noisy_rating = 5)
  ] AS counts
  FROM responses_anonymized
)
SELECT anon.ldp_mean_from_frequencies(
         counts  => counts,
         values  => ARRAY[1, 2, 3, 4, 5]::float8[],
         epsilon => 1.0,
         d       => 5
       ) AS unbiased_mean
FROM agg;

This is more accurate than averaging GRRM output directly because the debiasing happens on the count vector before the mean is computed.

Pitfalls

  • Estimator variance grows fast at low ε. Below ε≈0.5 the variance of the single-category estimator can dominate the signal for moderate nn. Confidence intervals widen accordingly.
  • Negative debiased counts. Debiasing can produce negative point estimates for rare categories. They are statistically valid; the unbiased estimator is unconstrained. Clamping them to zero introduces a small upward bias on rare bins.
  • Bias from a wrong dd. Every estimator here relies on the public dd matching the value used at masking time. Using the wrong dd silently produces wrong estimates.
  • Don't mix with one-hot output. The one-hot variants produce already-unbiased per-bin counts when summed; running these post-processing functions on top of them is incorrect.

Try it live

  • /correction: single-value estimate, full-distribution recovery, and mean-from-frequencies on real GRRM output.