Skip to main content

Latest Tweets

World Statistical Congress 2015 (ISI2015) session on "Synthetic establishment microdata around the world"

Print Friendly, PDF & Email

A session on "Synthetic establishment microdata around the world" was organized at the International Statistical Institute (ISI)'s 60th World Statistics Congress – ISI2015 in Rio de Janeiro, Brazil by Lars Vilhuber (Cornell):

Session abstract:

Around the world, national statistical agencies face substantial challenges in attempting to release establishment-level business microdata to researchers. Doing so often represents too large a risk to establishments' confidentiality. The U.S. Census Bureau created a synthetic longitudinal business database, and released it for limited distribution. Subsequent have assessed how well the approach translates to other countries' data and legal environments. The session will inform on progress in the United States, in Germany, and on lessons learned about the utility of such an approach. Justification: The synthetic data approach is of potential interest to many statistical agencies, and the session will provide valuable information about utility, cost, and risk of such an approach. Discussants are from statistical agencies, and can speak to the policy and implementation issues associated with these approaches.

Chair:

Vishesh Karwa (CMU and Harvard)

Presentations:

  • John M. Abowd (Cornell) and Kevin L. McKinney (U.S. Census Bureau), "Noise Infusion as a Confidentiality Protection Measure for Graph-based Statistics" (available as CES WP-14-30)
  • Lars Vilhuber (Cornell) and Javier Miranda (U.S. Census Bureau), "Using partially synthetic data to replace suppression in the Business Dynamics Statistics"
  • Jörg Drechsler (IAB Germany) and Lars Vilhuber (Cornell), "Synthetic Longitudinal Business Databases for International Comparisons"
  • Satkartar Kinney (NISS), Jerry Reiter (Duke), and Javier Miranda (U.S. Census Bureau), "Improving the Synthetic Longitudinal Business Database: Synthesizing Firms"
  • Ian Schmutte (University of Georgia), "Differentially Private Publication of Data on Wages and Job Mobility"

Discussant

  • Stefan Bender (Deutsche Bundesbank, Germany)

Publications

Some of the articles above were published in the Statistical Journal of the IAOS (SJIAOS)  Volume 32, issue 1 in 2016, as well as in the Census Bureau's Center for Economic Studies' Working Paper series, and the NSF-Census Research Network's paper archive.

Articles

  • L. Vilhuber, J. M. Abowd, and J. P. Reiter, "Synthetic establishment microdata around the world," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 65-68, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.

    @article {VilhuberAbowdReiter-SJIAOS-2016,
    title = {Synthetic establishment microdata around the world},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    pages = {65-68},
    chapter = {65},
    abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.},
    keywords = {Business data, confidentiality, differential privacy, international comparison, Multiple imputation, synthetic},
    doi = {10.3233/SJI-160964},
    url = {http://content.iospress.com/download/statistical-journal-of-the-iaos/sji964},
    author = {Vilhuber, Lars and Abowd, John M. and Reiter, Jerome P.}
    }
  • I. M. Schmutte, "Differentially private publication of data on wages and job mobility," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 81-92, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    Brazil, like many countries, is reluctant to publish business-level data, because of legitimate concerns about the establishments{\textquoteright} confidentiality. A trusted data curator can increase the utility of data, while managing the risk to establishments, either by releasing synthetic data, or by infusing noise into published statistics. This paper evaluates the application of a differentially private mechanism to publish statistics on wages and job mobility computed from Brazilian employer-employee matched data. The publication mechanism can result in both the publication of specific statistics as well as the generation of synthetic data. I find that the tradeoff between the privacy guaranteed to individuals in the data, and the accuracy of published statistics, is potentially much better that the worst-case theoretical accuracy guarantee. However, the synthetic data fare quite poorly in analyses that are outside the set of queries to which it was trained. Note that this article only explores and characterizes the feasibility of these publication strategies, and will not directly result in the publication of any data.

    @article {Schmutte-SJIAOS-2016,
    title = {Differentially private publication of data on wages and job mobility},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    month = {02/2016/2016},
    pages = {81-92},
    chapter = {81},
    abstract = {Brazil, like many countries, is reluctant to publish business-level data, because of legitimate concerns about the establishments{\textquoteright} confidentiality. A trusted data curator can increase the utility of data, while managing the risk to establishments, either by releasing synthetic data, or by infusing noise into published statistics. This paper evaluates the application of a differentially private mechanism to publish statistics on wages and job mobility computed from Brazilian employer-employee matched data. The publication mechanism can result in both the publication of specific statistics as well as the generation of synthetic data. I find that the tradeoff between the privacy guaranteed to individuals in the data, and the accuracy of published statistics, is potentially much better that the worst-case theoretical accuracy guarantee. However, the synthetic data fare quite poorly in analyses that are outside the set of queries to which it was trained. Note that this article only explores and characterizes the feasibility of these publication strategies, and will not directly result in the publication of any data. },
    keywords = {Demand for public statistics, differential privacy, job mobility, matched employer-employee data, optimal confidentiality protection, optimal data accuracy, technology for statistical agencies},
    doi = {10.3233/SJI-160962},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji962},
    author = {Schmutte, Ian M.}
    }
  • J. Miranda and L. Vilhuber, "Using partially synthetic microdata to protect sensitive cells in business statistics," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 69-80, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau{\textquoteright}s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

    @article {MirandaVilhuber-SJIAOS-2016,
    title = {Using partially synthetic microdata to protect sensitive cells in business statistics},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    month = {2016},
    pages = {69-80},
    chapter = {69},
    abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau{\textquoteright}s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
    keywords = {confidentiality protection, gross job flows, local labor markets, Statistical Disclosure Limitation, Synthetic data, time-series},
    doi = {10.3233/SJI-160963},
    url = {http://content.iospress.com/download/statistical-journal-of-the-iaos/sji963},
    author = {Miranda, Javier and Vilhuber, Lars}
    }
  • J. M. Abowd and K. L. McKinney, "Noise infusion as a confidentiality protection measure for graph-based statistics," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 127-135, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau{\textquoteright}s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.

    @article {AbowdMcKinney-SJIAOS-2016,
    title = {Noise infusion as a confidentiality protection measure for graph-based statistics},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    pages = {127-135},
    chapter = {127},
    abstract = {We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau{\textquoteright}s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.},
    doi = {10.3233/SJI-160958},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji958},
    author = {Abowd, John M. and McKinney, Kevin L.}
    }

Working papers

  • L. Vilhuber, J. A. Abowd, and J. P. Reiter, "Synthetic Establishment Microdata Around the World," NSF Census Research Network - NCRN-Cornell, Document 1813:42340, 2016.
    [Abstract] [URL] [Bibtex]

    In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.

    @TechReport{vilhuber-abowd-reiter-2016-ecommons,
    Title = {{Synthetic Establishment Microdata Around the World}},
    Author = {Lars Vilhuber and John A. Abowd and Jerome P. Reiter},
    institution = {NSF Census Research Network - NCRN-Cornell },
    Year = {2016},
    type={Document},
    number = {1813:42340},
    Abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.},
    DOI = {},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    URL = {http://hdl.handle.net/1813/42340}
    }
  • J. Miranda and L. Vilhuber, "Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics," Center for Economic Studies, U.S. Census Bureau, Working Papers 16-10, 2016.
    [Abstract] [URL] [Bibtex]

    We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

    @TechReport{RePEc:cen:wpaper:16-10,
    author={Javier Miranda and Lars Vilhuber},
    title={{Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics}},
    year=2016,
    month=Feb,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/16-10.html},
    number={16-10},
    abstract={We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
    keywords={synthetic data; statistical disclosure limitation; time-series; local labor markets; gross job flows},
    doi={},
    }
  • J. Miranda and L. Vilhuber, "Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics," NSF Census Research Network - NCRN-Cornell, Document 1813:42339, 2016.
    [Abstract] [URL] [Bibtex]

    We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

    @techreport{miranda-vilhuber-2016-ecommons,
    Title = {{Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics}},
    Author = {Miranda, Javier and Lars Vilhuber},
    institution = {NSF Census Research Network - NCRN-Cornell },
    Year = {2016},
    type={Document},
    number = {1813:42339},
    Abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
    DOI = {},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    URL = {http://hdl.handle.net/1813/42339}
    }
  • J. M. Abowd and K. L. McKinney, "Noise infusion as a confidentiality protection measure for graph-based statistics," NSF Census Research Network - NCRN-Cornell, Document 1813:42338, 2016.
    [Abstract] [URL] [Bibtex]

    We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau{\textquoteright}s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.

    @techreport{AbowdMcKinney-2016-ecommons,
    title = {Noise infusion as a confidentiality protection measure for graph-based statistics},
    year = {2016},
    abstract = {We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau{\textquoteright}s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.},
    author = {Abowd, John M. and McKinney, Kevin L.},
    institution = {NSF Census Research Network - NCRN-Cornell },
    type={Document},
    number = {1813:42338},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    URL = {http://hdl.handle.net/1813/42338}
    }
  • J. M. Abowd and K. L. McKinney, "Noise Infusion As A Confidentiality Protection Measure For Graph-Based Statistics," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-30, 2014.
    [Abstract] [URL] [Bibtex]

    We use the bipartite graph representation of longitudinally linked em-ployer-employee data, and the associated projections onto the employer and em-ployee nodes, respectively, to characterize the set of potential statistical summar-ies that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightfor-ward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.

    @TechReport{RePEc:cen:wpaper:14-30,
    author={John M. Abowd and Kevin L. McKinney},
    title={{Noise Infusion As A Confidentiality Protection Measure For Graph-Based Statistics}},
    year=2014,
    month=Sep,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/14-30.html},
    number={14-30},
    abstract={We use the bipartite graph representation of longitudinally linked em-ployer-employee data, and the associated projections onto the employer and em-ployee nodes, respectively, to characterize the set of potential statistical summar-ies that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightfor-ward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.},
    keywords={},
    doi={},
    }

Support

The research presented in this session has received support provided through NSF grants 09412261131848, 1012593, and 1042181.