Skip to main content

Latest Tweets

Step 5 - Validation

Print Friendly, PDF & Email

Each synthetic dataset hosted on the SDS allows for validation against the confidential data.

What is Validation?

Validation means running the analysis originally performed on the synthetic data against the confidential data

How does Validation work?

Each data provider may have their own requirements, but generically, validation works as follows:

  • Researchers follow programming requirements. Generically, this means preparing programs as if for replication. The SSB provides  SSB Validation Request Guidelines, and other data providers will have similar requirements.
  • Programs must run successfully and completely on the synthetic data. Only then can validation be requested. Programs that fail on the synthetic data will also fail on the confidential data, and will result in the validation request being denied.
  • Because results are run against confidential data, they will be reviewed by the data custodians for disclosure concerns. These will vary by datasets. In general, researchers should familiarize themselves with standard Census Bureau disclosure rules for external (FSRDC) projects (a copy of the RDC Researcher Handbook can be useful) and should prepare the appropriate memo documenting the requested output (see RDC Disclosure Request Memo for an example, but contact the data custodian for any special requirements).
  • Approved output will be released to users, but may also be shared with additional data custodians (in the case of the Census Bureau, results may be shared with the Social Security Administration and IRS, because they jointly share custody of the confidential data).
  • The validation process can be accomplished in as little as one week for simple results that are generated by clean code and have no disclosure issues. However, if the code does not run properly, the sample sizes are too small, or the researcher does not accurately fill out the disclosure memo, the process can take much longer.

Output that can be validated

In general,

  • several dozen regression coefficients are generally not problematic, though coefficients on dichotomous (dummy) variables may have additional documentation requirements. If running a large quantity of models, summaries of regression coefficients are encouraged.
  • tabular output with no more detail then the usual "summary" table in an academic article are generally permissible
  • more detailed tabular output is generally NOT feasible

For instance, if you plan to compute detailed moments of the data, generally, the goal is to include it in some larger model or simulation. The suggested procedure is to run the larger model or simulation on the synthetic (and then the confidential data), and request release of the simulated model parameters instead of the underlying moment table.

In all cases, data custodians can provide additional information.