Skip to main content

Latest Tweets

World Statistics Congress 2013 - Session on Synthetic LBD

Print Friendly, PDF & Email

We organized a session at the World Statistics Congress 2013 in Hong Kong on the Synthetic LBD. Session STS062 can be found on the WSC2013 website and in the proceedings, and is replicated below:

Synthetic establishment microdata – Enhancing access to confidential data in novel ways

Location : STS062 (28 Aug 2013, 15:30 - 17:45, Room S423)
Organiser : Lars Vilhuber
15:30 Chair : John M. Abowd
15:40 Paper 1 : Looking back on three years of Synthetic LBD Beta
Lars Vilhuber, Javier Miranda Abstract / Paper / Presentation
16:05 Paper 2 : SynLBD: Providing firm characteristics on synthetic establishment data
Satkartar K. Kinney, Jerome P. Reiter Abstract / Paper / Presentation (J. Abowd presented for Kinney)
16:30 Paper 3 : Replicating the synthetic LBD with German establishment data
Jörg Drechsler, Lars Vilhuber AbstractPaper / Presentation
16:55 Paper 4 : Expanding the Role of Synthetic Data at the U.S. Census Bureau
Thomas A. Louis, Ron S. Jarmin, Javier Miranda Abstract / Paper / Presentation
17:20 Discussant(s) : Stefan Bender (Discussion)
17:35 : Open discussion

Other publications

The articles above were published in the Census Bureau's Center for Economic Studies' Working Paper series, and as an (open-access) number of the Statistical Journal of the IAOS (SJIAOS), see  the June 2014 issue of the Statistical Journal of the IAOS for free at https://madmimi.com/s/ae1dd4.

Articles
  • J. Miranda and L. Vilhuber, "Looking Back On Three Years Of Using The Synthetic LBD Beta," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.

    @Article{SJIAOS-2014a,
    Title = {{Looking Back On Three Years Of Using The {S}ynthetic {LBD} Beta}},
    Author = {Miranda, Javier and Lars Vilhuber},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Abstract = {Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.},
    Doi = {10.3233/SJI-140811},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00811}
    }
  • S. K. Kinney, J. P. Reiter, and J. Miranda, "Improving The Synthetic Longitudinal Business Database," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

    @Article{SJIAOS-2014d,
    author={Satkartar K. Kinney and Jerome P. Reiter and Javier Miranda},
    title={{Improving The Synthetic Longitudinal Business Database}},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Doi = {10.3233/SJI-140808},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    abstract={In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00808},
    keywords={},
    }
  • R. S. Jarmin, T. A. Louis, and J. Miranda, "Expanding The Role Of Synthetic Data At The U.S. Census Bureau," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.

    @Article{SJIAOS-2014c,
    author={Ron S. Jarmin and Thomas A. Louis and Javier Miranda},
    title={{Expanding The Role Of Synthetic Data At The U.S. Census Bureau}},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Doi = {10.3233/SJI-140813},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    abstract={National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.},
    keywords={confidentiality; synthetic micro data; official statistics},
    Url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00813}
    }
  • J. Drechsler and L. Vilhuber, "A First Step Towards A German SynLBD: Constructing A German Longitudinal Business Database," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.

    @Article{SJIAOS-2014b,
    Title = {{A First Step Towards A {German} {SynLBD}: {C}onstructing A {G}erman {L}ongitudinal {B}usiness {D}atabase}},
    Author = {J{\"o}rg Drechsler and Lars Vilhuber},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.},
    Doi = {10.3233/SJI-140812},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00812}
    }
  • J. M. Abowd, "Synthetic establishment data: Origins and introduction to current research," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [DOI] [URL] [Bibtex]
    @Article{SJIAOS-2014e,
    author={John M. Abowd},
    title={Synthetic establishment data: Origins and introduction to current research},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Doi = {10.3233/SJI-140810},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00810}
    }
Technical Reports
  • J. Miranda and L. Vilhuber, "Looking Back On Three Years Of Using The Synthetic LBD Beta," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-11, 2014.
    [Abstract] [URL] [Bibtex]

    Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.

    @TechReport{RePEc:cen:wpaper:14-11,
    Title = {{Looking Back On Three Years Of Using The {S}ynthetic {LBD} Beta}},
    Author = {Miranda, Javier and Lars Vilhuber},
    Institution = {Center for Economic Studies, U.S. Census Bureau},
    Year = {2014},
    Month = Feb,
    Number = {14-11},
    Type = {Working Papers},
    Abstract = {Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://ideas.repec.org/p/cen/wpaper/14-11.html}
    }
  • S. K. Kinney, J. P. Reiter, and J. Miranda, "Improving The Synthetic Longitudinal Business Database," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-12, 2014.
    [Abstract] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

    @TechReport{RePEc:cen:wpaper:14-12,
    author={Satkartar K. Kinney and Jerome P. Reiter and Javier Miranda},
    title={{Improving The Synthetic Longitudinal Business Database}},
    year=2014,
    month=Feb,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={http://ideas.repec.org/p/cen/wpaper/14-12.html},
    number={14-12},
    abstract={In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
    keywords={},
    }
  • R. S. Jarmin, T. A. Louis, and J. Miranda, "Expanding The Role Of Synthetic Data At The U.S. Census Bureau," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-10, 2014.
    [Abstract] [URL] [Bibtex]

    National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.

    @TechReport{RePEc:cen:wpaper:14-10,
    author={Ron S. Jarmin and Thomas A. Louis and Javier Miranda},
    title={{Expanding The Role Of Synthetic Data At The U.S. Census Bureau}},
    year=2014,
    month=Feb,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={http://ideas.repec.org/p/cen/wpaper/14-10.html},
    number={14-10},
    abstract={National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.},
    keywords={confidentiality; synthetic micro data; official statistics},
    }
  • J. Drechsler and L. Vilhuber, "A First Step Towards A German SynLBD: Constructing A German Longitudinal Business Database," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-13, 2014.
    [Abstract] [URL] [Bibtex]

    One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.

    @TechReport{RePEc:cen:wpaper:14-13,
    Title = {{A First Step Towards A {German} {SynLBD}: {C}onstructing A {G}erman {L}ongitudinal {B}usiness {D}atabase}},
    Author = {J{\"o}rg Drechsler and Lars Vilhuber},
    Institution = {Center for Economic Studies, U.S. Census Bureau},
    Year = {2014},
    Month = Feb,
    Number = {14-13},
    Type = {Working Papers},
    Abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.},
    Keywords = {confidentiality; comparative studies; German Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://ideas.repec.org/p/cen/wpaper/14-13.html}
    }

Funding acknowledgement

The organization of this session, and the participation of Vilhuber and Abowd, is partially funded through NSF grant SES-1042181.