Skip to main content

Latest Tweets

SDS Bibliography

Print Friendly

Papers and publications

Papers that used the resources provided by the Synthetic Data Server or were funded by NSF Grant SES-1042181:

Publications

2016
  • J. M. Abowd and K. L. McKinney, "Noise infusion as a confidentiality protection measure for graph-based statistics," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 127-135, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau{\textquoteright}s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.

    @article {AbowdMcKinney-SJIAOS-2016,
    title = {Noise infusion as a confidentiality protection measure for graph-based statistics},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    pages = {127-135},
    chapter = {127},
    abstract = {We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau{\textquoteright}s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.},
    doi = {10.3233/SJI-160958},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji958},
    author = {Abowd, John M. and McKinney, Kevin L.}
    }
  • J. Miranda and L. Vilhuber, "Using partially synthetic microdata to protect sensitive cells in business statistics," Statistical Journal of the IAOS, vol. 32, iss. 1, pp. 69-80, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

    @Article{MirandaVilhuber-SJIAOS2016,
    author = {Javier Miranda and Lars Vilhuber},
    title = {Using partially synthetic microdata to protect sensitive cells in business statistics},
    journal = {Statistical Journal of the IAOS},
    year = {2016},
    volume = {32},
    number = {1},
    pages = {69--80},
    month = {Feb},
    abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
    doi = {10.3233/SJI-160963},
    file = {:MirandaVilhuber-SJIAOS2016.pdf:PDF},
    issn = {1874-7655},
    owner = {vilhuber},
    publisher = {IOS Press},
    timestamp = {2016.09.30},
    url = {http://doi.org/10.3233/SJI-160963},
    }
  • J. Miranda and L. Vilhuber, "Using partially synthetic microdata to protect sensitive cells in business statistics," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 69-80, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau{\textquoteright}s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

    @article {MirandaVilhuber-SJIAOS-2016,
    title = {Using partially synthetic microdata to protect sensitive cells in business statistics},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    month = {2016},
    pages = {69-80},
    chapter = {69},
    abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau{\textquoteright}s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
    keywords = {confidentiality protection, gross job flows, local labor markets, Statistical Disclosure Limitation, Synthetic data, time-series},
    doi = {10.3233/SJI-160963},
    url = {http://content.iospress.com/download/statistical-journal-of-the-iaos/sji963},
    author = {Miranda, Javier and Vilhuber, Lars}
    }
  • I. M. Schmutte, "Differentially private publication of data on wages and job mobility," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 81-92, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    Brazil, like many countries, is reluctant to publish business-level data, because of legitimate concerns about the establishments{\textquoteright} confidentiality. A trusted data curator can increase the utility of data, while managing the risk to establishments, either by releasing synthetic data, or by infusing noise into published statistics. This paper evaluates the application of a differentially private mechanism to publish statistics on wages and job mobility computed from Brazilian employer-employee matched data. The publication mechanism can result in both the publication of specific statistics as well as the generation of synthetic data. I find that the tradeoff between the privacy guaranteed to individuals in the data, and the accuracy of published statistics, is potentially much better that the worst-case theoretical accuracy guarantee. However, the synthetic data fare quite poorly in analyses that are outside the set of queries to which it was trained. Note that this article only explores and characterizes the feasibility of these publication strategies, and will not directly result in the publication of any data.

    @article {Schmutte-SJIAOS-2016,
    title = {Differentially private publication of data on wages and job mobility},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    month = {02/2016/2016},
    pages = {81-92},
    chapter = {81},
    abstract = {Brazil, like many countries, is reluctant to publish business-level data, because of legitimate concerns about the establishments{\textquoteright} confidentiality. A trusted data curator can increase the utility of data, while managing the risk to establishments, either by releasing synthetic data, or by infusing noise into published statistics. This paper evaluates the application of a differentially private mechanism to publish statistics on wages and job mobility computed from Brazilian employer-employee matched data. The publication mechanism can result in both the publication of specific statistics as well as the generation of synthetic data. I find that the tradeoff between the privacy guaranteed to individuals in the data, and the accuracy of published statistics, is potentially much better that the worst-case theoretical accuracy guarantee. However, the synthetic data fare quite poorly in analyses that are outside the set of queries to which it was trained. Note that this article only explores and characterizes the feasibility of these publication strategies, and will not directly result in the publication of any data. },
    keywords = {Demand for public statistics, differential privacy, job mobility, matched employer-employee data, optimal confidentiality protection, optimal data accuracy, technology for statistical agencies},
    doi = {10.3233/SJI-160962},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji962},
    author = {Schmutte, Ian M.}
    }
  • L. Vilhuber, J. M. Abowd, and J. P. Reiter, "Synthetic establishment microdata around the world," Statistical Journal of the IAOS, vol. 32, iss. 1, pp. 65-68, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.

    @Article{VilhuberAbowdReiter-SJIAOS2016,
    author = {Lars Vilhuber and John M. Abowd and Jerome P. Reiter},
    title = {Synthetic establishment microdata around the world},
    journal = {Statistical Journal of the IAOS},
    year = {2016},
    volume = {32},
    number = {1},
    pages = {65--68},
    month = {Feb},
    abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.},
    doi = {10.3233/SJI-160964},
    file = {:VilhuberAbowdReiter-SJIAOS2016.pdf:PDF},
    issn = {1874-7655},
    owner = {vilhuber},
    publisher = {IOS Press},
    timestamp = {2016.09.30},
    url = {http://doi.org/10.3233/SJI-160964},
    }
  • L. Vilhuber, J. M. Abowd, and J. P. Reiter, "Synthetic establishment microdata around the world," Statistical Journal of the International Association for Official Statistics, vol. 32, pp. 65-68, 2016.
    [Abstract] [DOI] [URL] [Bibtex]

    In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.

    @article {VilhuberAbowdReiter-SJIAOS-2016,
    title = {Synthetic establishment microdata around the world},
    journal = {Statistical Journal of the International Association for Official Statistics},
    volume = {32},
    year = {2016},
    pages = {65-68},
    chapter = {65},
    abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business microdata is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic \emph{establishment} microdata. This overview situates those papers, published in this issue, within the broader literature.},
    keywords = {Business data, confidentiality, differential privacy, international comparison, Multiple imputation, synthetic},
    doi = {10.3233/SJI-160964},
    url = {http://content.iospress.com/download/statistical-journal-of-the-iaos/sji964},
    author = {Vilhuber, Lars and Abowd, John M. and Reiter, Jerome P.}
    }
2015
  • M. Bertrand, E. Kamenica, and J. Pan, "Gender Identity and Relative Income within Households," The Quarterly Journal of Economics, vol. 130, iss. 2, 2015.
    [Abstract] [DOI] [URL] [Bibtex]

    We examine causes and consequences of relative income within households. We show that the distribution of the share of income earned by the wife exhibits a sharp drop to the right of 12, where the wife’s income exceeds the husband’s income. We argue that this pattern is best explained by gender identity norms, which induce an aversion to a situation where the wife earns more than her husband. We present evidence that this aversion also impacts marriage formation, the wife’s labor force participation, the wife’s income conditional on working, marriage satisfaction, likelihood of divorce, and the division of home production. Within marriage markets, when a randomly chosen woman becomes more likely to earn more than a randomly chosen man, marriage rates decline. In couples where the wife’s potential income is likely to exceed the husband’s, the wife is less likely to be in the labor force and earns less than her potential if she does work. In couples where the wife earns more than the husband, the wife spends more time on household chores; moreover, those couples are less satisfied with their marriage and are more likely to divorce. These patterns hold both cross-sectionally and within couples over time. JEL Codes: D10, J12, J16.

    @article{Bertrand29012015,
    author = {Bertrand, Marianne and Kamenica, Emir and Pan, Jessica},
    title = {Gender Identity and Relative Income within Households},
    year = {2015},
    volume = 130,
    number = 2,
    doi = {10.1093/qje/qjv001},
    abstract ={We examine causes and consequences of relative income within households. We show that the distribution of the share of income earned by the wife exhibits a sharp drop to the right of 12, where the wife’s income exceeds the husband’s income. We argue that this pattern is best explained by gender identity norms, which induce an aversion to a situation where the wife earns more than her husband. We present evidence that this aversion also impacts marriage formation, the wife’s labor force participation, the wife’s income conditional on working, marriage satisfaction, likelihood of divorce, and the division of home production. Within marriage markets, when a randomly chosen woman becomes more likely to earn more than a randomly chosen man, marriage rates decline. In couples where the wife’s potential income is likely to exceed the husband’s, the wife is less likely to be in the labor force and earns less than her potential if she does work. In couples where the wife earns more than the husband, the wife spends more time on household chores; moreover, those couples are less satisfied with their marriage and are more likely to divorce. These patterns hold both cross-sectionally and within couples over time. JEL Codes: D10, J12, J16.},
    URL = {http://qje.oxfordjournals.org/content/early/2015/04/11/qje.qjv001.abstract},
    eprint = {http://qje.oxfordjournals.org/content/early/2015/04/11/qje.qjv001.full.pdf+html},
    journal = {The Quarterly Journal of Economics}
    }
2014
  • J. M. Abowd, "Synthetic establishment data: Origins and introduction to current research," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [DOI] [URL] [Bibtex]
    @Article{SJIAOS-2014e,
    author={John M. Abowd},
    title={Synthetic establishment data: Origins and introduction to current research},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Doi = {10.3233/SJI-140810},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00810}
    }
  • J. Drechsler and L. Vilhuber, "Synthetic Longitudinal Business Databases for International Comparisons," in Privacy in Statistical Databases, J. Domingo-Ferrer, Ed., Springer International Publishing, 2014, vol. 8744, pp. 243-252.
    [Abstract] [DOI] [URL] [Bibtex]

    International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.

    @InCollection{psd2014b,
    author = {Drechsler, J\"org and Vilhuber, Lars},
    title = {Synthetic Longitudinal Business Databases for International Comparisons},
    booktitle = {Privacy in Statistical Databases},
    publisher = {Springer International Publishing},
    year = {2014},
    editor = {Domingo-Ferrer, Josep},
    volume = {8744},
    series = {Lecture Notes in Computer Science},
    pages = {243-252},
    abstract = {International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.},
    doi = {10.1007/978-3-319-11257-2_19},
    isbn = {978-3-319-11256-5},
    keywords = {business data; confidentiality; international comparison; multiple imputation; synthetic},
    language = {English},
    owner = {vilhuber},
    timestamp = {2016.10.17},
    url = {http://dx.doi.org/10.1007/978-3-319-11257-2_19},
    }
  • J. Drechsler and L. Vilhuber, "A First Step Towards A German SynLBD: Constructing A German Longitudinal Business Database," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.

    @Article{SJIAOS-2014b,
    Title = {{A First Step Towards A {German} {SynLBD}: {C}onstructing A {G}erman {L}ongitudinal {B}usiness {D}atabase}},
    Author = {J{\"o}rg Drechsler and Lars Vilhuber},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.},
    Doi = {10.3233/SJI-140812},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00812}
    }
  • R. S. Jarmin, T. A. Louis, and J. Miranda, "Expanding The Role Of Synthetic Data At The U.S. Census Bureau," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.

    @Article{SJIAOS-2014c,
    author={Ron S. Jarmin and Thomas A. Louis and Javier Miranda},
    title={{Expanding The Role Of Synthetic Data At The U.S. Census Bureau}},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Doi = {10.3233/SJI-140813},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    abstract={National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.},
    keywords={confidentiality; synthetic micro data; official statistics},
    Url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00813}
    }
  • S. K. Kinney, J. P. Reiter, and J. Miranda, "Improving The Synthetic Longitudinal Business Database," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, iss. 2, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

    @Article{SJIAOS-2014d,
    author={Satkartar K. Kinney and Jerome P. Reiter and Javier Miranda},
    title={{Improving The Synthetic Longitudinal Business Database}},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Number = {2},
    Doi = {10.3233/SJI-140808},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    abstract={In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
    url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00808},
    keywords={},
    }
  • J. Miranda and L. Vilhuber, "Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results," in Privacy in Statistical Databases, J. Domingo-Ferrer, Ed., Springer International Publishing, 2014, vol. 8744, pp. 232-242.
    [Abstract] [DOI] [URL] [Bibtex]

    The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.

    @InCollection{psd2014a,
    Title = {Using Partially Synthetic Data to Replace Suppression in the Business Dynamics Statistics: Early Results},
    Author = {Miranda, Javier and Vilhuber, Lars},
    Booktitle = {Privacy in Statistical Databases},
    Publisher = {Springer International Publishing},
    Year = {2014},
    Editor = {Domingo-Ferrer, Josep},
    Pages = {232-242},
    Series = {Lecture Notes in Computer Science},
    Volume = {8744},
    Abstract = {The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.},
    DOI = {10.1007/978-3-319-11257-2_18},
    ISBN = {978-3-319-11256-5},
    Keywords = {synthetic data; statistical disclosure limitation; time-series; local labor markets; gross job flows; confidentiality protection},
    Language = {English},
    URL = {http://dx.doi.org/10.1007/978-3-319-11257-2_18}
    }
  • J. Miranda and L. Vilhuber, "Looking Back On Three Years Of Using The Synthetic LBD Beta," Statistical Journal of the IAOS: Journal of the International Association for Official Statistics, vol. 30, 2014.
    [Abstract] [DOI] [URL] [Bibtex]

    Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.

    @Article{SJIAOS-2014a,
    Title = {{Looking Back On Three Years Of Using The {S}ynthetic {LBD} Beta}},
    Author = {Miranda, Javier and Lars Vilhuber},
    Journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics},
    Year = {2014},
    Volume = {30},
    Abstract = {Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.},
    Doi = {10.3233/SJI-140811},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://content.iospress.com/articles/statistical-journal-of-the-iaos/sji00811}
    }
2013
  • J. M. Abowd and M. H. Stinson, "Estimating Measurement Error in Annual Job Earnings: A Comparison of Survey and Administrative Data," Review of Economics and Statistics, p. --, 2013.
    [DOI] [URL] [Bibtex]
    @ARTICLE{Abowd2013,
    author = {Abowd, John M. and Stinson, Martha H.},
    title = {Estimating Measurement Error in Annual Job Earnings: A Comparison
    of Survey and Administrative Data},
    journal = {Review of Economics and Statistics},
    year = {2013},
    pages = {--},
    month = jan,
    __markedentry = {[vilhuber:6]},
    doi = {10.1162/REST_a_00352},
    issn = {0034-6535},
    owner = {vilhuber},
    publisher = {MIT Press},
    timestamp = {2013.10.07},
    url = {http://dx.doi.org/10.1162/REST_a_00352}
    }
  • G. Saioc, "Essays on Public Policy and Real Estate Dynamics," PhD Thesis, 2013.
    [URL] [Bibtex]
    @phdthesis{Saioc_phdthesis,
    author = {George Saioc},
    title = {Essays on Public Policy and Real Estate Dynamics},
    school = {University of California, Irvine},
    year = 2013,
    url = {http://search.proquest.com/docview/1415453049}
    }
2012
  • A. Henriques, "Essays in Applied Microeconomics," PhD Thesis, 2012.
    [Abstract] [URL] [Bibtex]

    This dissertation consists of three essays in applied microeconomics. The first chapter looks at whether the Social Security claiming behavior of husbands respond to the presence of Social Security spouse and survivor benefits paid to wives based on his earnings record. I separately estimate the claiming response to incentives for each of the three types of Social Security benefits: retired worker, spousal, and survivor. This approach departs from the previous literature, which estimates behavioral responses to household incentives. I begin by documenting that failure to maximize household Social Security wealth results in a financial burden borne primarily by the wife. I next estimate husbands' behavioral response to Social Security benefit incentives, with my focus exclusively upon incentives due to the actuarial adjustment from delayed claiming. Variation in incentives comes from rule changes to the Social Security benefit calculation, in addition to the age difference between spouses and the relative strength of the wife's labor force history. I find while husbands are responsive to their own benefit incentives, they are barely responsive to household, spousal, and survivor benefit incentives. A variety of robustness checks looking at segments of the population predicted to be more responsive to incentives provide very similar results to main specification. The second chapter examines the incidence of health insurance coverage for displaced workers during the periods preceding and subsequent to job displacement. Most individuals lose health insurance coverage upon job separation. There is concern that individuals are unable to recover insurance coverage following separation. I find within 18 months following job loss the level of health insurance coverage returns to pre-displacement level. Furthermore, I find that obtaining insurance coverage upon reemployment does not impact wages. The third chapter first examines how much of the fall in poverty among elderly women can be attributed to changes in the distributions of age, marital status, and education of elderly women using the Current Population Survey. Increased educational attainment has put tremendous downward pressure on the poverty rate driven primarily by the shift of high school dropouts to those with a high school diploma. I also find poverty would be slightly lower in the absence of changes to the age distribution and no direct impact on poverty levels due to the changes in distribution of marital status. I also investigate the role of both labor force participation and marital status over the life-cycle on old age outcomes using survey data matched to administrative earnings records from the Census Bureau. I find even after controlling for Social Security and marital status over prime-age years, lifetime earnings and labor force experience still has a significant impact on poverty incidence of elderly women. Projecting poverty for cohorts who have not reached old age, I find increased wages and LFP over the life-cycle places large downward pressure on predicted poverty. Marital status over the life-cycle exerts its own negative impact on poverty.

    @PHDTHESIS{Henriques2012,
    author = {Alice Henriques},
    title = {Essays in Applied Microeconomics},
    school = {Columbia University},
    year = {2012},
    note = {Advisor: Till von Wachter},
    abstract = {This dissertation consists of three essays in applied microeconomics.
    The first chapter looks at whether the Social Security claiming behavior
    of husbands respond to the presence of Social Security spouse and
    survivor benefits paid to wives based on his earnings record. I separately
    estimate the claiming response to incentives for each of the three
    types of Social Security benefits: retired worker, spousal, and survivor.
    This approach departs from the previous literature, which estimates
    behavioral responses to household incentives. I begin by documenting
    that failure to maximize household Social Security wealth results
    in a financial burden borne primarily by the wife. I next estimate
    husbands' behavioral response to Social Security benefit incentives,
    with my focus exclusively upon incentives due to the actuarial adjustment
    from delayed claiming. Variation in incentives comes from rule changes
    to the Social Security benefit calculation, in addition to the age
    difference between spouses and the relative strength of the wife's
    labor force history. I find while husbands are responsive to their
    own benefit incentives, they are barely responsive to household,
    spousal, and survivor benefit incentives. A variety of robustness
    checks looking at segments of the population predicted to be more
    responsive to incentives provide very similar results to main specification.
    The second chapter examines the incidence of health insurance coverage
    for displaced workers during the periods preceding and subsequent
    to job displacement. Most individuals lose health insurance coverage
    upon job separation. There is concern that individuals are unable
    to recover insurance coverage following separation. I find within
    18 months following job loss the level of health insurance coverage
    returns to pre-displacement level. Furthermore, I find that obtaining
    insurance coverage upon reemployment does not impact wages. The third
    chapter first examines how much of the fall in poverty among elderly
    women can be attributed to changes in the distributions of age, marital
    status, and education of elderly women using the Current Population
    Survey. Increased educational attainment has put tremendous downward
    pressure on the poverty rate driven primarily by the shift of high
    school dropouts to those with a high school diploma. I also find
    poverty would be slightly lower in the absence of changes to the
    age distribution and no direct impact on poverty levels due to the
    changes in distribution of marital status. I also investigate the
    role of both labor force participation and marital status over the
    life-cycle on old age outcomes using survey data matched to administrative
    earnings records from the Census Bureau. I find even after controlling
    for Social Security and marital status over prime-age years, lifetime
    earnings and labor force experience still has a significant impact
    on poverty incidence of elderly women. Projecting poverty for cohorts
    who have not reached old age, I find increased wages and LFP over
    the life-cycle places large downward pressure on predicted poverty.
    Marital status over the life-cycle exerts its own negative impact
    on poverty.},
    owner = {vilhuber},
    timestamp = {2012.09.04},
    url = {http://hdl.handle.net/10022/AC:P:11813}
    }
2011
  • S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, "Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database," International Statistical Review, vol. 79, iss. 3, pp. 362-384, 2011.
    [Abstract] [DOI] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.

    @ARTICLE{KinneyEtAl2011,
    author = {Kinney, Satkartar K. and Reiter, Jerome P. and Reznek, Arnold P.
    and Miranda, Javier and Jarmin, Ron S. and Abowd, John M.},
    title = {Towards Unrestricted Public Use Business Microdata: The Synthetic
    Longitudinal Business Database},
    journal = {International Statistical Review},
    year = {2011},
    volume = {79},
    pages = {362--384},
    number = {3},
    doi = {10.1111/j.1751-5823.2011.00153.x},
    issn = {1751-5823},
    keywords = {Economic census, data confidentiality, synthetic data, disclosure
    limitation},
    owner = {vilhuber},
    publisher = {Blackwell Publishing Ltd},
    timestamp = {2012.09.04},
    abstract = {In most countries, national statistical agencies do not release establishment-level
    business microdata, because doing so represents too large a risk
    to establishments\' confidentiality. One approach with the potential
    for overcoming these risks is to release synthetic data; that is,
    the released establishment data are simulated from statistical models
    designed to mimic the distributions of the underlying real microdata.
    In this article, we describe an application of this strategy to create
    a public use file for the Longitudinal Business Database, an annual
    economic census of establishments in the United States comprising
    more than 20 million records dating back to 1976. The U.S. Bureau
    of the Census and the Internal Revenue Service recently approved
    the release of these synthetic microdata for public use, making the
    synthetic Longitudinal Business Database the first-ever business
    microdata set publicly released in the United States. We describe
    how we created the synthetic data, evaluated analytical validity,
    and assessed disclosure risk.},
    url = {http://dx.doi.org/10.1111/j.1751-5823.2011.00153.x}
    }
2009
  • C. N. Kohnen and J. P. Reiter, "Multiple imputation for combining confidential data owned by two agencies," Journal of the Royal Statistical Society, vol. 172, iss. 2, pp. 511-528, 2009.
    [Abstract] [DOI] [URL] [Bibtex]

    Statistical agencies that own different databases on overlapping subjects can benefit greatly from combining their data. These benefits are passed on to secondary data analysts when the combined data are disseminated to the public. Sometimes combining data across agencies or sharing these data with the public is not possible: one or both of these actions may break promises of confidentiality that have been given to data subjects. We describe an approach that is based on two stages of multiple imputation that facilitates data sharing and dissemination under restrictions of confidentiality. We present new inferential methods that properly account for the uncertainty that is caused by the two stages of imputation. We illustrate the approach by using artificial and genuine data.

    @ARTICLE{KohnenReiter_2009,
    author = {Christine N. Kohnen and Jerome P. Reiter},
    title = {Multiple imputation for combining confidential data owned by two agencies},
    abstract = {Statistical agencies that own different databases on overlapping subjects can benefit greatly from combining their data. These benefits are passed on to secondary data analysts when the combined data are disseminated to the public. Sometimes combining data across agencies or sharing these data with the public is not possible: one or both of these actions may break promises of confidentiality that have been given to data subjects. We describe an approach that is based on two stages of multiple imputation that facilitates data sharing and dissemination under restrictions of confidentiality. We present new inferential methods that properly account for the uncertainty that is caused by the two stages of imputation. We illustrate the approach by using artificial and genuine data.},
    journal = {Journal of the Royal Statistical Society},
    year = {2009},
    volume = {172},
    pages = {511-528},
    number = {2},
    doi = {10.1111/j.1467-985X.2008.00574.x},
    url = {http://onlinelibrary.wiley.com/doi/10.1111/j.1467-985X.2008.00574.x/pdf}
    }

Others

2016
  • J. Miranda and L. Vilhuber, "Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics," Center for Economic Studies, U.S. Census Bureau, Working Papers 16-10, 2016.
    [Abstract] [URL] [Bibtex]

    We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).

    @TechReport{RePEc:cen:wpaper:16-10,
    author={Javier Miranda and Lars Vilhuber},
    title={{Using Partially Synthetic Microdata to Protect Sensitive Cells in Business Statistics}},
    year=2016,
    month=Feb,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={https://ideas.repec.org/p/cen/wpaper/16-10.html},
    number={16-10},
    abstract={We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).},
    keywords={synthetic data; statistical disclosure limitation; time-series; local labor markets; gross job flows},
    doi={},
    }
2015
  • M. Greenstone, A. Mas, and H. Nguyen, "Do Credit Market Shocks affect the Real Economy? Quasi-Experimental Evidence from the Great Recession and Normal Economic Times," Princeton University 2015.
    [Abstract] [URL] [Bibtex]

    This paper uses comprehensive data on bank lending and establishment-level outcomes from 1997-2011 to test whether changes in small business bank lending affect the real economy. The shift-share style research design predicts county-level lending shocks using variation in pre-existing bank market shares and estimated bank supply-shifts. Counties with negative predicted supply shocks experienced declines in small business loan originations throughout the entire period, indicating that it is costly for these businesses to find new lenders. Using confidential microdata from the Longitudinal Business Database, we find the predicted lending shocks led to statistically significant, but economically small, declines in both small firm and overall employment during the Great Recession, but did not affect employment during the 1997-2007 period. Overall, this paper’s evidence fails to support the hypothesis that the small business lending channel is an important determinant of economic activity

    @TechReport{GreenstoneMasNguyen,
    author = {Michael Greenstone and Alexandre Mas and Hoai-Luu Nguyen},
    title = {Do Credit Market Shocks affect the Real Economy? Quasi-Experimental Evidence from the Great Recession and Normal Economic Times},
    institution = {Princeton University},
    year = {2015},
    month = sep,
    abstract = {This paper uses comprehensive data on bank lending and establishment-level outcomes from 1997-2011 to test whether changes in small business bank lending affect the real economy. The shift-share style research design predicts county-level lending shocks using variation in pre-existing bank market shares and estimated bank supply-shifts. Counties with negative predicted supply shocks experienced declines in small business loan originations throughout the entire period, indicating that it is costly for these businesses to find new lenders. Using confidential microdata from the Longitudinal Business Database, we find the predicted lending shocks led to statistically significant, but economically small, declines in both small firm and overall employment during the Great Recession, but did not affect employment during the 1997-2007 period. Overall, this paper’s evidence fails to support the hypothesis that the small business lending channel is an important determinant of economic activity},
    file = {:http\://www.princeton.edu/~amas/papers/gmn_20150925.pdf:URL},
    owner = {vilhuber},
    timestamp = {2016.10.17},
    url = {http://www.princeton.edu/~amas/papers/gmn_20150925.pdf},
    }
2014
  • P. Armour, "How much work would a 50\% disability insurance benefit offset encourage? An analysis using SSI and SSDI incentives," NBER Disability Research Center, Presentation at the 2nd Disability Research Consortium , 2014.
    [PDF] [URL] [Bibtex]
    @TECHREPORT{Armour-2014-drc,
    author = {Philip Armour},
    title = {How much work would a 50\% disability insurance benefit offset encourage? An analysis using SSI and SSDI incentives},
    institution = {NBER Disability Research Center},
    year = {2014},
    type = {Presentation at the 2nd Disability Research Consortium},
    timestamp = {2015.02.12},
    url = {http://www.nber.org/aging/drc/10312014drcmeeting/5.1Summary.pdf}
    }
  • J. Drechsler and L. Vilhuber, "A First Step Towards A German SynLBD: Constructing A German Longitudinal Business Database," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-13, 2014.
    [Abstract] [URL] [Bibtex]

    One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.

    @TechReport{RePEc:cen:wpaper:14-13,
    Title = {{A First Step Towards A {German} {SynLBD}: {C}onstructing A {G}erman {L}ongitudinal {B}usiness {D}atabase}},
    Author = {J{\"o}rg Drechsler and Lars Vilhuber},
    Institution = {Center for Economic Studies, U.S. Census Bureau},
    Year = {2014},
    Month = Feb,
    Number = {14-13},
    Type = {Working Papers},
    Abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.},
    Keywords = {confidentiality; comparative studies; German Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://ideas.repec.org/p/cen/wpaper/14-13.html}
    }
  • R. S. Jarmin, T. A. Louis, and J. Miranda, "Expanding The Role Of Synthetic Data At The U.S. Census Bureau," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-10, 2014.
    [Abstract] [URL] [Bibtex]

    National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.

    @TechReport{RePEc:cen:wpaper:14-10,
    author={Ron S. Jarmin and Thomas A. Louis and Javier Miranda},
    title={{Expanding The Role Of Synthetic Data At The U.S. Census Bureau}},
    year=2014,
    month=Feb,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={http://ideas.repec.org/p/cen/wpaper/14-10.html},
    number={14-10},
    abstract={National Statistical offices (NSOs) create official statistics from data collected from survey respondents, government administrative records and other sources. The raw source data is usually considered to be confidential. In the case of the U.S. Census Bureau, confidentiality of survey and administrative records microdata is mandated by statute, and this mandate to protect confidentiality is often at odds with the needs of users to extract as much information from the data as possible. Traditional disclosure protection techniques result in official data products that do not fully utilize the information content of the underlying microdata. Typically, these products take the form of simple aggregate tabulations. In a few cases anonymized public- use micro samples are made available, but these face a growing risk of re-identification by the increasing amounts of information about individuals and firms available in the public domain. One approach for overcoming these risks is to release products based on synthetic data where values are simulated from statistical models designed to mimic the (joint) distributions of the underlying microdata. We discuss re- cent Census Bureau work to develop and deploy such products. We discuss the benefits and challenges involved with extending the scope of synthetic data products in official statistics.},
    keywords={confidentiality; synthetic micro data; official statistics},
    }
  • S. K. Kinney, J. P. Reiter, and J. Miranda, "Improving The Synthetic Longitudinal Business Database," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-12, 2014.
    [Abstract] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.

    @TechReport{RePEc:cen:wpaper:14-12,
    author={Satkartar K. Kinney and Jerome P. Reiter and Javier Miranda},
    title={{Improving The Synthetic Longitudinal Business Database}},
    year=2014,
    month=Feb,
    institution={Center for Economic Studies, U.S. Census Bureau},
    type={Working Papers},
    url={http://ideas.repec.org/p/cen/wpaper/14-12.html},
    number={14-12},
    abstract={In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments’ confidentiality. Agencies potentially can manage these risks by releasing synthetic microdata, i.e., individual establishment records simulated from statistical models de- signed to mimic the joint distribution of the underlying observed data. Previously, we used this approach to generate a public-use version—now available for public use—of the U. S. Census Bureau’s Longitudinal Business Database (LBD), a longitudinal cen- sus of establishments dating back to 1976. While the synthetic LBD has proven to be a useful product, we now seek to improve and expand it by using new synthesis models and adding features. This article describes our efforts to create the second generation of the SynLBD, including synthesis procedures that we believe could be replicated in other contexts.},
    keywords={},
    }
  • J. Miranda and L. Vilhuber, "Looking Back On Three Years Of Using The Synthetic LBD Beta," Center for Economic Studies, U.S. Census Bureau, Working Papers 14-11, 2014.
    [Abstract] [URL] [Bibtex]

    Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.

    @TechReport{RePEc:cen:wpaper:14-11,
    Title = {{Looking Back On Three Years Of Using The {S}ynthetic {LBD} Beta}},
    Author = {Miranda, Javier and Lars Vilhuber},
    Institution = {Center for Economic Studies, U.S. Census Bureau},
    Year = {2014},
    Month = Feb,
    Number = {14-11},
    Type = {Working Papers},
    Abstract = {Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.},
    Keywords = {confidentiality; comparative studies; US Longitudinal Business Database; synthetic data},
    Owner = {vilhuber},
    Timestamp = {2014.03.24},
    Url = {http://ideas.repec.org/p/cen/wpaper/14-11.html}
    }
  • M. S. Rutledge, A. Y. Wu, and F. M. Vitagliano, "Do tax incentives increase 401(K) retirement saving? Evidence from the adoption of catch-up contributions," Center for Retirement Research at Boston College, Working Paper CRR WP 2014-17, 2014.
    [Abstract] [URL] [Bibtex]

    The U.S. government subsidizes retirement saving through 401(k) plans with $61.4 billion in tax expenditures annually, but the question of whether these tax incentives are effective in increasing saving remains unanswered. Using longitudinal U.S. Social Security Administration data on tax-deferred earnings linked to the Survey of Income and Program Participation, the project examines whether the “catch-up provision,� which was enacted in 2001 and allows workers over age 50 to contribute more to their 401(k) plans, has been effective in increasing earnings deferrals. Compared with similar workers under age 50, the study finds that contributions increased by $540 more among age-50-plus individuals who had approached the 401(k) tax-deferral limits prior to turning 50, suggesting that the older individuals respond to the expanded tax incentives. For this group, the elasticity of retirement savings to the tax incentive is quite high: a one-dollar increase in the tax-deferred limit leads to an immediate 49-cent increase in 401(k) contributions.

    @TECHREPORT{Rutledgeetal_2014,
    author = {Mathew S. Rutledge and April Yanyuan Wu and Francis M. Vitagliano},
    title = {Do tax incentives increase 401(K) retirement saving? Evidence from the adoption of catch-up contributions},
    abstract = {The U.S. government subsidizes retirement saving through 401(k) plans with $61.4 billion in tax expenditures annually, but the question of whether these tax incentives are effective in increasing saving remains unanswered. Using longitudinal U.S. Social Security Administration data on tax-deferred earnings linked to the Survey of Income and Program Participation, the project examines whether the “catch-up provision,� which was enacted in 2001 and allows workers over age 50 to contribute more to their 401(k) plans, has been effective in increasing earnings deferrals. Compared with similar workers under age 50, the study finds that contributions increased by $540 more among age-50-plus individuals who had approached the 401(k) tax-deferral limits prior to turning 50, suggesting that the older individuals respond to the expanded tax incentives. For this group, the elasticity of retirement savings to the tax incentive is quite high: a one-dollar increase in the tax-deferred limit leads to an immediate 49-cent increase in 401(k) contributions.},
    institution = {Center for Retirement Research at Boston College},
    year = {2014},
    month = {November},
    type = {Working Paper},
    number = {CRR WP 2014-17},
    timestamp = {2015.02.12},
    url = {http://dx.doi.org/10.2139/ssrn.2530026}
    }
2013
  • J. M. Abowd and L. Vilhuber, "Improved Research Access to Census Bureau Linked Administrative Data via Public-use Products," in Eighteenth Annual Meetings of the Society of Labor Economists, 2013.
    [URL] [Bibtex]
    @InProceedings{Improved-Access-Abowd-SOLE-20130503,
    author = {John M. Abowd and Lars Vilhuber},
    title = "Improved Research Access to {Census Bureau} Linked Administrative Data via Public-use Products",
    url = "http://www.sole-jole.org/Abowd-Data.pdf",
    booktitle="Eighteenth Annual Meetings of the Society of Labor Economists",
    year = "2013",
    }
  • G. Benedetto, M. Stinson, and J. M. Abowd, "The creation and use of the SIPP Synthetic Beta," US Census Bureau 2013.
    [URL] [Bibtex]
    @TECHREPORT{Benedettoetal_2013,
    author = {Gary Benedetto and Martha Stinson and John M. Abowd},
    title = {The creation and use of the SIPP Synthetic Beta},
    institution = {US Census Bureau},
    year = {2013},
    timestamp = {2015.02.11},
    url = {http://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/SSBdescribe_nontechnical.pdf}
    }
  • M. Bertrand, E. Kamenica, and J. Pan, "Gender Identity and Relative Income within Households," Chicago Booth, Research Paper 13-08, 2013.
    [Abstract] [DOI] [URL] [Bibtex]

    We examine causes and consequences of relative income within households. We establish that gender identity - in particular, an aversion to the wife earning more than the husband - impacts marriage formation, the wife's labor force participation, the wife's income conditional on working, satisfaction with the marriage, divorce, and the division of home production. The distribution of the share of household income earned by the wife exhibits a sharp cliff at 0.5, which suggests that a couple is less willing to match if her income exceeds his. Within marriage markets, when a randomly chosen woman becomes more likely to earn more than a randomly chosen man, the marriage rates decline. Within couples, if the wife's potential income (based on her demographics) is likely to exceed the husband's, the wife is less likely to be in the labor force and earns less than her potential if she does work. Couples where the wife earns more than the husband are less satisfied with their marriage and are more likely to divorce. Finally, based on time use surveys, the gender gap in non-market work is larger if the wife earns more than the husband.

    @TECHREPORT{BertrandKamenicaPan2013,
    author = {Bertrand, Marianne and Kamenica, Emir and Pan, Jessica},
    title = {Gender Identity and Relative Income within Households},
    institution = {Chicago Booth},
    year = {2013},
    type = {Research Paper},
    number = {13-08},
    abstract = {We examine causes and consequences of relative income within households.
    We establish that gender identity - in particular, an aversion to
    the wife earning more than the husband - impacts marriage formation,
    the wife's labor force participation, the wife's income conditional
    on working, satisfaction with the marriage, divorce, and the division
    of home production. The distribution of the share of household income
    earned by the wife exhibits a sharp cliff at 0.5, which suggests
    that a couple is less willing to match if her income exceeds his.
    Within marriage markets, when a randomly chosen woman becomes more
    likely to earn more than a randomly chosen man, the marriage rates
    decline. Within couples, if the wife's potential income (based on
    her demographics) is likely to exceed the husband's, the wife is
    less likely to be in the labor force and earns less than her potential
    if she does work. Couples where the wife earns more than the husband
    are less satisfied with their marriage and are more likely to divorce.
    Finally, based on time use surveys, the gender gap in non-market
    work is larger if the wife earns more than the husband.},
    doi = {10.2139/ssrn.2216750},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    url = {http://dx.doi.org/10.2139/ssrn.2216750}
    }
  • M. Bertrand, E. Kamenica, and J. Pan, "Gender Identity and Relative Income within Households," NBER, Working Paper 19023, 2013.
    [Abstract] [DOI] [URL] [Bibtex]

    We examine causes and consequences of relative income within households. We establish that gender identity - in particular, an aversion to the wife earning more than the husband - impacts marriage formation, the wife's labor force participation, the wife's income conditional on working, marriage satisfaction, likelihood of divorce, and the division of home production. The distribution of the share of household income earned by the wife exhibits a sharp cliff at 0.5, which suggests that a couple is less willing to match if her income exceeds his. Within marriage markets, when a randomly chosen woman becomes more likely to earn more than a randomly chosen man, marriage rates decline. Within couples, if the wife's potential income (based on her demographics) is likely to exceed the husband's, the wife is less likely to be in the labor force and earns less than her potential if she does work. Couples where the wife earns more than the husband are less satisfied with their marriage and are more likely to divorce. Finally, based on time use surveys, the gender gap in non-market work is larger if the wife earns more than the husband.

    @TECHREPORT{nber19023,
    author = {Bertrand, Marianne and Kamenica, Emir and Pan, Jessica},
    title = {Gender Identity and Relative Income within Households},
    institution = {NBER},
    year = {2013},
    month = {May},
    type = {Working Paper},
    number = {19023},
    abstract = {We examine causes and consequences of relative income within households. We establish that gender identity - in particular, an aversion to the wife earning more than the husband - impacts marriage formation, the wife's labor force participation, the wife's income conditional on working, marriage satisfaction, likelihood of divorce, and the division of home production. The distribution of the share of household income earned by the wife exhibits a sharp cliff at 0.5, which suggests that a couple is less willing to match if her income exceeds his. Within marriage markets, when a randomly chosen woman becomes more likely to earn more than a randomly chosen man, marriage rates decline. Within couples, if the wife's potential income (based on her demographics) is likely to exceed the husband's, the wife is less likely to be in the labor force and earns less than her potential if she does work. Couples where the wife earns more than the husband are less satisfied with their marriage and are more likely to divorce. Finally, based on time use surveys, the gender gap in non-market work is larger if the wife earns more than the husband.},
    doi = {10.3386/w19023},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    url = {http://www.nber.org/papers/w19023},
    }
  • J. Drechsler and L. Vilhuber, "Replicating the Synthetic LBD with German Establishment Data," in Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session STS062), 2013, pp. 2291-2296.
    [URL] [Bibtex]
    @inproceedings{ISI2013-3,
    author = {J{\"o}rg Drechsler and Lars Vilhuber},
    title = {Replicating the {S}ynthetic {LBD} with {G}erman Establishment Data},
    booktitle = {Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong
    (Session STS062) },
    year = {2013},
    pages = {2291-2296},
    isbn = {978-90-73592-34-6},
    url = {http://2013.isiproceedings.org},
    urldate = {2014-03-24},
    }
2012
  • R. Chenevert, "Changing Levels of Spousal Education and Labor Force Supply," US Census Bureau 2012.
    [Abstract] [URL] [Bibtex]

    The purpose of this p aper is to describe the labor force behavior of married couples where the woman is more educated than her husband, as well as study how this relates to breadwinner status. We see that the fraction of couples where the wife is more educated than the husband is increasing over the time studied , and tha t labor force participation rates of women more educated than their husbands increase as well . Next, I study female breadwinner status by education level of the spouses by replicating Winkler, McBride and Andrews, and extending their work using the SIPP Gold Standard Completed Data and the SIPP Synthetic Beta (SSB). This helps us create a more complete picture of the interaction of household and labor market dynamics.

    @TECHREPORT{Chenevert_2012,
    author = {Rebecca Chenevert},
    title = {Changing Levels of Spousal Education and Labor Force Supply},
    abstract = {The purpose of this p aper is to describe the labor force behavior of married couples where the woman is more educated than her husband, as well as study how this relates to breadwinner status. We see that the fraction of couples where the wife is more educated than the husband is increasing over the time studied , and tha t labor force participation rates of women more educated than their husbands increase as well . Next, I study female breadwinner status by education level of the spouses by replicating Winkler, McBride and Andrews, and extending their work using the SIPP Gold Standard Completed Data and the SIPP Synthetic Beta (SSB). This helps us create a more complete picture of the interaction of household and labor market dynamics.},
    institution = {US Census Bureau},
    year = {2012},
    timestamp = {2015.02.11},
    url = {http://beta.census.gov/people/laborforce/publications/Chenevert_MEA2012.pdf}
    }
  • A. Henriques, "How Does Social Security Claiming Respond to Incentives? Considering Husbands' and Wives' Benefits Separately," The Federal Reserve Board, Finance and Economics Discussion Series Working Paper 2012-19, 2012.
    [Abstract] [DOI] [URL] [Bibtex]

    A majority of women receive most of their Social Security benefits based upon their husbands' earnings history, but previous research has shown that husbands' benefit claiming is inconsistent with maximizing lifetime benefits for the couple. However, that research assumes husbands choose their claim age based on all Social Security incentives facing the household. I show that husbands' claiming behavior responds to the actuarial incentives built into their own retired worker benefit formula, but not to the incentives built into the spouse and survivor formulas that determine their wives' benefits. This failure to incorporate his spouses' incentives reduces wives' lifetime benefits. Variation in incentives comes from rule changes to the Social Security benefit calculation in addition to the age difference between spouses and the relative strength of the wife's labor force history. A variety of robustness checks looking at segments of the population predicted to be more responsive to incentives provide similar results to the main specification.

    @TECHREPORT{Henriques_2012,
    author = {Alice Henriques},
    title = {How Does Social Security Claiming Respond to Incentives? Considering Husbands' and Wives' Benefits Separately},
    abstract = {A majority of women receive most of their Social Security benefits based upon their husbands' earnings history, but previous research has shown that husbands' benefit claiming is inconsistent with maximizing lifetime benefits for the couple. However, that research assumes husbands choose their claim age based on all Social Security incentives facing the household. I show that husbands' claiming behavior responds to the actuarial incentives built into their own retired worker benefit formula, but not to the incentives built into the spouse and survivor formulas that determine their wives' benefits. This failure to incorporate his spouses' incentives reduces wives' lifetime benefits. Variation in incentives comes from rule changes to the Social Security benefit calculation in addition to the age difference between spouses and the relative strength of the wife's labor force history. A variety of robustness checks looking at segments of the population predicted to be more responsive to incentives provide similar results to the main specification.},
    institution = {The Federal Reserve Board},
    year = {2012},
    month = {March},
    type = {Finance and Economics Discussion Series Working Paper},
    number = {2012-19},
    timestamp = {2015.02.12},
    doi = {10.2139/ssrn.2054772},
    url = {http://www.federalreserve.gov/pubs/feds/2012/201219/201219pap.pdf}
    }
2011
  • J. M. Abowd and M. H. Stinson, "Estimating Measurement Error in SIPP Annual Job Earnings: A Comparison of Census Survey and SSA Administrative Data," US Census Bureau, SEHSD Working Paper 2011-19, 2011.
    [URL] [Bibtex]
    @TECHREPORT{AbowdStinson2011,
    author = {John M. Abowd and Martha H. Stinson},
    title = {Estimating Measurement Error in SIPP Annual Job Earnings: A Comparison
    of Census Survey and SSA Administrative Data},
    institution = {US Census Bureau},
    type={SEHSD Working Paper},
    number={2011-19},
    month = may,
    year = {2011},
    owner = {kr328},
    timestamp = {2012.07.02},
    url = {http://www.census.gov/sipp/workpapr/abowd-stinson-ME-ss-20110707-sehsdworking.pdf}
    }
  • S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, "Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database," Center for Economic Studies, U.S. Census Bureau, Working Papers 11-04, 2011.
    [Abstract] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.

    @TECHREPORT{RePEc:cen:wpaper:11-04,
    author = {Satkartar K. Kinney and Jerome P. Reiter and Arnold P. Reznek and
    Javier Miranda and Ron S. Jarmin and John M. Abowd},
    title = {Towards Unrestricted Public Use Business Microdata: The Synthetic
    Longitudinal Business Database},
    institution = {Center for Economic Studies, U.S. Census Bureau},
    year = {2011},
    type = {Working Papers},
    number = {11-04},
    month = Feb,
    abstract = {In most countries, national statistical agencies do not release establishment-level
    business microdata, because doing so represents too large a risk
    to establishments\' confidentiality. One approach with the potential
    for overcoming these risks is to release synthetic data; that is,
    the released establishment data are simulated from statistical models
    designed to mimic the distributions of the underlying real microdata.
    In this article, we describe an application of this strategy to create
    a public use file for the Longitudinal Business Database, an annual
    economic census of establishments in the United States comprising
    more than 20 million records dating back to 1976. The U.S. Bureau
    of the Census and the Internal Revenue Service recently approved
    the release of these synthetic microdata for public use, making the
    synthetic Longitudinal Business Database the first-ever business
    microdata set publicly released in the United States. We describe
    how we created the synthetic data, evaluated analytical validity,
    and assessed disclosure risk.},
    owner = {vilhuber},
    timestamp = {2013.10.14},
    url = {http://ideas.repec.org/p/cen/wpaper/11-04.html}
    }
2010
  • G. Benedetto, G. Gathright, and M. Stinson, "The Earnings Impact of Graduating from College during a Recession," US Census Bureau 2010.
    [PDF] [Bibtex]
    @TECHREPORT{benedettogathrightstinson-11301,
    author = {Gary Benedetto and Graton Gathright and Martha Stinson},
    title = {The Earnings Impact of Graduating from College during a Recession},
    institution = {US Census Bureau},
    year = {2010},
    owner = {kr328},
    timestamp = {2012.07.02},
    note = {Obtained from http://www.sole-jole.org/11301.pdf on 2012-07-02.},
    }
2009
  • K. E. Smith, D. A. Wissoker, and additional authors, "SSA/SIPP/IRS Synthetic Beta File: Analytic Evaluation," Urban Institute and NORC Evaluation Team, Working Paper , 2009.
    [Abstract] [URL] [Bibtex]

    The paper provides an independent evaluation of the SIPP Synthetic Beta File. This file, created by the Bureau of the Census, is intended to provide a public use database with similar statistical properties as the confidential Social Security Administration's earnings and benefit data linked to the SIPP. There is much to praise in the Census work. Many univariate distributions were "spot on." Unweighted regression analyses had some problems and results for them were mixed. In policy simulation modeling there were many instances of differences between the Synthetic and actual data that would have led researchers to wrong conclusions.

    @TECHREPORT{Smithetal_2009,
    author = {Karen E. Smith and Douglas A. Wissoker and additional authors},
    title = {SSA/SIPP/IRS Synthetic Beta File: Analytic Evaluation},
    abstract = {The paper provides an independent evaluation of the SIPP Synthetic Beta File. This file, created by the Bureau of the Census, is intended to provide a public use database with similar statistical properties as the confidential Social Security Administration's earnings and benefit data linked to the SIPP. There is much to praise in the Census work. Many univariate distributions were "spot on." Unweighted regression analyses had some problems and results for them were mixed. In policy simulation modeling there were many instances of differences between the Synthetic and actual data that would have led researchers to wrong conclusions.},
    institution = {Urban Institute and NORC Evaluation Team},
    year = {2009},
    type = {Working Paper},
    timestamp = {2015.02.11},
    url = {http://www.urban.org/uploadedpdf/412005_syntheticbetafile.pdf}
    }
2007
  • J. M. Abowd, G. Benedetto, and M. H. Stinson, "The covariance of earnings and hours revisited," AEA Annual Meetings, Working Paper , 2007.
    [Abstract] [URL] [Bibtex]

    In this paper we examine the earnings covariance matrix generated from a ten-year time series and estimate a variance components model that parameterizes the process generating earnings. We use our estimated variance components to test key hypotheses concerning life-cycle human capital investment and labor supply separately for men and women. Hu- man capital investment models predict that individuals with higher initial earnings have lower growth rates of earnings and that earnings follow a random growth model with individual specific rates of growth due to experience. Life-cycle labor supply models predict that variation in individual productivity affects earnings more than hours supplied. In order to test these hypotheses, we look for permanent individual variance components in the growth rate of earnings and significant auto-correlation in earnings overtime. We also test for the presence of a common component of variation between hours and earnings and explore how this component contributes to earnings relative to hours. We look for evidence to support or contradict the predictions of the models using a new data source — a set of SIPP panels linked to administrative tax data on labor market earnings. Our data contain Survey of Income and Program Participation (SIPP) respondents from the five panels conducted by the Census Bureau in the 1990s with linked W-2 wage records filed by employers with the IRS. The sum of these wage records for a given year provides an uncapped annual earnings measure. We use survey information on the number of weeks worked full-time and part-time in a year to estimate annual hours worked. Because of the length of the time period covered (1990-1999), the size of the sample (approximately 230,000 individuals), and the high quality of the earnings measure, these data offer a unique opportunity to re-visit several classic labor economics questions and provide fresh evidence for on-going debate.

    @TECHREPORT{Abowdetal_2007,
    author = {John M. Abowd and Gary Benedetto and Martha H. Stinson},
    title = {The covariance of earnings and hours revisited},
    abstract = {In this paper we examine the earnings covariance matrix generated from a ten-year time series and estimate a variance components model that parameterizes the process generating earnings. We use our estimated variance components to test key hypotheses concerning life-cycle human capital investment and labor supply separately for men and women. Hu- man capital investment models predict that individuals with higher initial earnings have lower growth rates of earnings and that earnings follow a random growth model with individual specific rates of growth due to experience. Life-cycle labor supply models predict that variation in individual productivity affects earnings more than hours supplied. In order to test these hypotheses, we look for permanent individual variance components in the growth rate of earnings and significant auto-correlation in earnings overtime. We also test for the presence of a common component of variation between hours and earnings and explore how this component contributes to earnings relative to hours. We look for evidence to support or contradict the predictions of the models using a new data source — a set of SIPP panels linked to administrative tax data on labor market earnings. Our data contain Survey of Income and Program Participation (SIPP) respondents from the five panels conducted by the Census Bureau in the 1990s with linked W-2 wage records filed by employers with the IRS. The sum of these wage records for a given year provides an uncapped annual earnings measure. We use survey information on the number of weeks worked full-time and part-time in a year to estimate annual hours worked. Because of the length of the time period covered (1990-1999), the size of the sample (approximately 230,000 individuals), and the high quality of the earnings measure, these data offer a unique opportunity to re-visit several classic labor economics questions and provide fresh evidence for on-going debate.},
    institution = {AEA Annual Meetings},
    year = {2007},
    type = {Working Paper},
    timestamp = {2015.02.12},
    url = {https://www.aeaweb.org/annual\_mtg\_papers/2008/2008\_254.pdf}
    }

Complete Bibtex files

Articles and working papers and separately WSC2013 documents, WSC2015 documents.

Technical documentation on datasets

SIPP Synthetic Beta

  • J. M. Abowd, G. Benedetto, and M. Stinson, "Using the SIPP Synthetic Beta for Analysis," U.S. Census Bureau, Training provided to participants at a meeting at the U.S. Census Bureau on October 26, 2007 , 2007.
    [PDF] [URL] [Bibtex]
    @TECHREPORT{sipp_synthetic_beta_training_final_20071026,
    author = {John M. Abowd and Gary Benedetto and Martha Stinson},
    title = {Using the {SIPP} {Synthetic} {Beta} for Analysis},
    institution = {U.S. Census Bureau},
    year = {2007},
    type = {Training provided to participants at a meeting at the U.S. Census
    Bureau on October 26, 2007},
    owner = {vilhuber},
    timestamp = {2013.10.08},
    url = {http://www2.vrdc.cornell.edu/news/?p=306},
    url = {http://hdl.handle.net/1813/43930}
    }
  • J. M. Abowd, M. Stinson, and G. Benedetto, "Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project," U.S. Census Bureau 2006.
    [Abstract] [PDF] [URL] [Bibtex]

    The creation of public use data that combine variables from the Census Bureau's Survey of Income and Program Participation (SIPP), the Internal Revenue Service's (IRS) individual lifetime earnings data, and the Social Security Administration's (SSA) individual benefit data began as part of ongoing collaborative research at the Census Bureau and SSA. The current project had its genesis with the formation of a joint committee containing representatives from the Census Bureau, SSA, IRS, and the Congressional Budget Office (CBO) that designed a prospective public use file. Aimed at a user community that was primarily interested in national retirement and disability programs, the selection of variables for the proposed SIPP/SSA/IRS-PUF focused on the critical demographic data to be supplied from the SIPP, earnings histories from the IRS data maintained at SSA, and benefit data from SSA’s master beneficiary records. After attempting to determine the feasibility of adding a limited number of variables from the SIPP directly to the linked earnings and benefit data, it was decided that the set of variables that could be added without compromising the confidentiality protection of the existing SIPP public use files was so limited that alternative methods had to be used to create a useful new public use file. The committee agreed to allow the Census Bureau to experiment with the confidentiality protection system known generically as "synthetic data." The actual technique adopted is called partially synthetic data with multiple imputation of missing items. As the term is used in this report, "partially synthetic data" means the release of person-level records containing some variables from the actual responses and other variables where the actual responses have been replaced by values sampled from the posterior predictive distribution for that record, conditional on all of the confidential data. This final report accompanies the delivery of version 4.0 to SSA as part of the fiscal year 2006 Jointly Financed Cooperative Agreement between the Census Bureau and SSA.

    @TECHREPORT{ssafinal,
    author = {John M. Abowd and Martha Stinson and Gary Benedetto},
    title = {Final Report to the {Social Security Administration} on the {SIPP/SSA/IRS}
    {Public} {Use} {File} {Project}},
    institution = {U.S. Census Bureau},
    year = {2006},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    abstract = {The creation of public use data that combine variables from the Census Bureau's Survey of Income and Program Participation (SIPP), the Internal Revenue Service's (IRS) individual lifetime earnings data, and the Social Security Administration's (SSA) individual benefit data began as part of ongoing collaborative research at the Census Bureau and SSA. The current project had its genesis with the formation of a joint committee containing representatives from the Census Bureau, SSA, IRS, and the Congressional Budget Office (CBO) that designed a prospective public use file. Aimed at a user community that was primarily interested in national retirement and disability programs, the selection of variables for the proposed SIPP/SSA/IRS-PUF focused on the critical demographic data to be supplied from the SIPP, earnings histories from the IRS data maintained at SSA, and benefit data from SSA’s master beneficiary records. After attempting to determine the feasibility of adding a limited number of variables from the SIPP directly to the linked earnings and benefit data, it was decided that the set of variables that could be added without compromising the confidentiality protection of the existing SIPP public use files was so limited that alternative methods had to be used to create a useful new public use file. The committee agreed to allow the Census Bureau to experiment with the confidentiality protection system known generically as "synthetic data." The actual technique adopted is called partially synthetic data with multiple imputation of missing items. As the term is used in this report, "partially synthetic data" means the release of person-level records containing some variables from the actual responses and other variables where the actual responses have been replaced by values sampled from the posterior predictive distribution for that record, conditional on all of the confidential data. This final report accompanies the delivery of version 4.0 to SSA as part of the fiscal year 2006 Jointly Financed Cooperative Agreement between the Census Bureau and SSA.},
    oldurl = {http://www2.vrdc.cornell.edu/news/?p=308},
    url = {http://hdl.handle.net/1813/43929}
    }
  • L. B. Reeder, M. Stinson, K. E. Trageser, and L. Vilhuber, "Codebook for the SIPP Synthetic Beta v5.1 [Codebook file]," {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document , 2014.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SSBv51,
    author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
    title = {Codebook for the {SIPP} {S}ynthetic {B}eta v5.1 [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2014},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v51}
    }
  • L. B. Reeder, M. Stinson, K. E. Trageser, and L. Vilhuber, "Codebook for the SIPP Synthetic Beta v6.0 [Codebook file]," {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document , 2015.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SSBv6,
    author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
    title = {Codebook for the {SIPP} {S}ynthetic {B}eta v6.0 [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2015},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v6}
    }
  • L. B. Reeder, M. Stinson, K. E. Trageser, and L. Vilhuber, "Codebook for the SIPP Synthetic Beta v6.0.2 [Codebook file]," {Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University, Ithaca, NY, USA, {DDI-C} document , 2015.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SSBv602,
    author = {Lori B. Reeder and Martha Stinson and Kelly E. Trageser and Lars Vilhuber},
    title = {Codebook for the {SIPP} {S}ynthetic {B}eta v6.0.2 [Codebook file]},
    institution = {{Cornell Institute for Social and Economic Research} and {Labor Dynamics Institute} [distributor]. Cornell University},
    type = {{DDI-C} document},
    address = {Ithaca, NY, USA},
    year = {2015},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/ssb/v/v602}
    }
  • U.S. Census Bureau, "DRB Memo on Disclosure Testing the SIPP Synthetic Beta," U.S. Census Bureau 2006.
    [Abstract] [URL] [Bibtex]

    As the result of a four year joint project between the Census Bureau, the Internal Revenue Service, and the Social Security Administration, the LEHD Program has created an enhanced SIPP file that links a subset of SIPP variables to ad- ministrative earnings and benefits data. We have reviewed this file for disclosure risk and here present our results to the Census Disclosure Review Board. We believe that the procedures we used to create the synthetic data conform to the Census Bureau’s disclosure avoidance requirements and request that the DRB grant permission for the file release.

    @TECHREPORT{drbmemnov2006,
    author = {{U.S. Census Bureau}},
    title = {{DRB} Memo on Disclosure Testing the {SIPP} {Synthetic} {Beta}},
    institution = {U.S. Census Bureau},
    year = {2006},
    month = {September 20},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    abstract = {As the result of a four year joint project between the Census Bureau, the Internal
    Revenue Service, and the Social Security Administration, the LEHD Program
    has created an enhanced SIPP file that links a subset of SIPP variables to ad-
    ministrative earnings and benefits data. We have reviewed this file for disclosure
    risk and here present our results to the Census Disclosure Review Board. We
    believe that the procedures we used to create the synthetic data conform to the
    Census Bureau’s disclosure avoidance requirements and request that the DRB
    grant permission for the file release.},
    oldurl = {http://www2.vrdc.cornell.edu/news/?p=307},
    url = {http://hdl.handle.net/1813/43928}
    }
  • U.S. Census Bureau, "Codebook for the SIPP Synthetic Beta Version 4.1," U.S. Census Bureau 2007.
    [Abstract] [PDF] [URL] [Bibtex]

    This codebook documents version 4.1 of the SIPP Synthetic Beta (SSB). The SSB is a set of files containing individual-level data synthesized from linked survey and administrative data. The SSB is produced by the US Census Bureau as part of a joint project with the Social Security Administration (SSA), and the Internal Revenue Service (IRS). The goal of the project is to make some of the benefits of linked survey and administrative data available to researchers outside of restricted‐access Census Bureau facilities in a manner that protects the confidentiality of the underlying data.

    @TECHREPORT{technicaldescriptionsippsyntheticbetaoct42007,
    author = {{U.S. Census Bureau}},
    title = {Codebook for the {SIPP} {Synthetic} {Beta} Version 4.1},
    institution = {U.S. Census Bureau},
    year = {2007},
    month = {October},
    abstract = {This codebook documents version 4.1 of the SIPP Synthetic Beta (SSB).
    The SSB is a set of files containing individual-level data synthesized
    from linked survey and administrative data. The SSB is produced by
    the US Census Bureau as part of a joint project with the Social Security
    Administration (SSA), and the Internal Revenue Service (IRS). The
    goal of the project is to make some of the benefits of linked survey
    and administrative data available to researchers outside of restricted‐access
    Census Bureau facilities in a manner that protects the confidentiality
    of the underlying data.},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    oldurl = {http://www.census.gov/sipp/technicaldescriptionsippsyntheticbetaoct42007.pdf},
    url = {http://hdl.handle.net/1813/43927}
    }
  • U.S. Census Bureau, "DRB Memo September 20, 2010," U.S. Census Bureau 2010.
    [PDF] [URL] [Bibtex]
    @TECHREPORT{drbmemo2010,
    author = {{U.S. Census Bureau}},
    title = {{DRB} {M}emo {S}eptember 20, 2010},
    institution = {U.S. Census Bureau},
    year = {2010},
    month = {September 20},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    oldurl = {http://www2.vrdc.cornell.edu/news/wp-content/uploads/2011/01/DRBMemoSeptember202010.pdf},
    url = {http://hdl.handle.net/1813/43926}
    }
  • U.S. Census Bureau, "Codebook for SIPP Synthetic Beta version 5.0," U.S. Census Bureau 2010.
    [Abstract] [PDF] [URL] [Bibtex]

    This codebook documents version 5.0 of the SIPP Synthetic Beta (SSB). The SSB is a set of files containing individual-level data synthesized from linked survey and administrative data. The SSB is produced by the US Census Bureau as part of a joint project with the Social Security Administration (SSA), and the Internal Revenue Service (IRS). The goal of the project is to make some of the benefits of linked survey and administrative data available to researchers outside of restricted‐access Census Bureau facilities in a manner that protects the confidentiality of the underlying data.

    @TECHREPORT{ssb_codebook,
    author = {{U.S. Census Bureau}},
    title = {Codebook for {SIPP} {Synthetic} {Beta} version 5.0},
    institution = {U.S. Census Bureau},
    year = {2010},
    abstract = {This codebook documents version 5.0 of the SIPP Synthetic Beta (SSB).
    The SSB is a set of files containing individual-level data synthesized
    from linked survey and administrative data. The SSB is produced by
    the US Census Bureau as part of a joint project with the Social Security
    Administration (SSA), and the Internal Revenue Service (IRS). The
    goal of the project is to make some of the benefits of linked survey
    and administrative data available to researchers outside of restricted‐access
    Census Bureau facilities in a manner that protects the confidentiality
    of the underlying data.},
    comment = {Original location: http://www.census.gov/sipp/SSB_Codebook.pdf},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    oldurl = {http://www2.vrdc.cornell.edu/news/wp-content/uploads/2011/01/SSB_Codebook.pdf},
    url = {http://hdl.handle.net/1813/43925}
    }
  • U.S. Census Bureau, "Codebook for SIPP Synthetic Beta version 5.1," U.S. Census Bureau 2013.
    [Abstract] [PDF] [URL] [Bibtex]

    This codebook documents version 5.0 of the SIPP Synthetic Beta (SSB). The SSB is a set of files containing individual-level data synthesized from linked survey and administrative data. The SSB is produced by the US Census Bureau as part of a joint project with the Social Security Administration (SSA), and the Internal Revenue Service (IRS). The goal of the project is to make some of the benefits of linked survey and administrative data available to researchers outside of restricted‐access Census Bureau facilities in a manner that protects the confidentiality of the underlying data.

    @TECHREPORT{ssb_v5_1_codebook,
    author = {{U.S. Census Bureau}},
    title = {Codebook for {SIPP} {Synthetic} {Beta} version 5.1},
    institution = {U.S. Census Bureau},
    year = {2013},
    abstract = {This codebook documents version 5.0 of the SIPP Synthetic Beta (SSB).
    The SSB is a set of files containing individual-level data synthesized
    from linked survey and administrative data. The SSB is produced by
    the US Census Bureau as part of a joint project with the Social Security
    Administration (SSA), and the Internal Revenue Service (IRS). The
    goal of the project is to make some of the benefits of linked survey
    and administrative data available to researchers outside of restricted‐access
    Census Bureau facilities in a manner that protects the confidentiality
    of the underlying data.},
    owner = {vilhuber},
    timestamp = {2013.10.07},
    url = {http://hdl.handle.net/1813/42335}
    }
  • U.S. Census Bureau, "Disclosure Review Board Memo: Second Request for Release of SIPP Synthetic Beta Version 6.0," U.S. Census Bureau 2015.
    [PDF] [URL] [Bibtex]
    @TECHREPORT{drbmemo2015,
    author = {{U.S. Census Bureau}},
    title = {Disclosure Review Board Memo: {S}econd Request for Release of {SIPP} {S}ynthetic
    {B}eta Version 6.0},
    institution = {U.S. Census Bureau},
    year = {2015},
    month = {January 15},
    owner = {vilhuber},
    timestamp = {2015.03.13},
    comment = {Original location http://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/DRBMemoTablesVersion2SSBv6_0.pdf},
    url = {http://hdl.handle.net/1813/42334}
    }
  • G. Benedetto, M. Stinson, and J. M. Abowd, "The Creation and Use of the SIPP Synthetic Beta," U.S. Census Bureau 2013.
    [URL] [Bibtex]
    @TECHREPORT{CreationSSB,
    author = {Gary Benedetto and Martha Stinson and John M. Abowd},
    title = {The Creation and Use of the {SIPP} {Synthetic} {Beta}},
    institution = {U.S. Census Bureau},
    year = {2013},
    month = apr,
    oldurl={http://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/SSBdescribe_nontechnical.pdf},
    url = {http://hdl.handle.net/1813/43924}
    }

Synthetic LBD

  • S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, "Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database," Center for Economic Studies, U.S. Census Bureau, Working Papers 11-04, 2011.
    [Abstract] [PDF] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.

    @TECHREPORT{CES-WP-11-04,
    author = {Satkartar K. Kinney and Jerome P. Reiter and Arnold P. Reznek and
    Javier Miranda and Ron S. Jarmin and John M. Abowd},
    title = {Towards Unrestricted Public Use Business Microdata: The {Synthetic}
    {Longitudinal} {Business} {Database}},
    institution = {Center for Economic Studies, U.S. Census Bureau},
    year = {2011},
    type = {Working Papers},
    number = {11-04},
    month = Feb,
    abstract = {In most countries, national statistical agencies do not release establishment-level
    business microdata, because doing so represents too large a risk
    to establishments\' confidentiality. One approach with the potential
    for overcoming these risks is to release synthetic data; that is,
    the released establishment data are simulated from statistical models
    designed to mimic the distributions of the underlying real microdata.
    In this article, we describe an application of this strategy to create
    a public use file for the Longitudinal Business Database, an annual
    economic census of establishments in the United States comprising
    more than 20 million records dating back to 1976. The U.S. Bureau
    of the Census and the Internal Revenue Service recently approved
    the release of these synthetic microdata for public use, making the
    synthetic Longitudinal Business Database the first-ever business
    microdata set publicly released in the United States. We describe
    how we created the synthetic data, evaluated analytical validity,
    and assessed disclosure risk.},
    owner = {vilhuber},
    timestamp = {2013.10.14},
    url = {http://ideas.repec.org/p/cen/wpaper/11-04.html}
    }
  • S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, "Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database," International Statistical Review, vol. 79, iss. 3, pp. 362-384, 2011.
    [Abstract] [DOI] [URL] [Bibtex]

    In most countries, national statistical agencies do not release establishment-level business microdata, because doing so represents too large a risk to establishments\' confidentiality. One approach with the potential for overcoming these risks is to release synthetic data; that is, the released establishment data are simulated from statistical models designed to mimic the distributions of the underlying real microdata. In this article, we describe an application of this strategy to create a public use file for the Longitudinal Business Database, an annual economic census of establishments in the United States comprising more than 20 million records dating back to 1976. The U.S. Bureau of the Census and the Internal Revenue Service recently approved the release of these synthetic microdata for public use, making the synthetic Longitudinal Business Database the first-ever business microdata set publicly released in the United States. We describe how we created the synthetic data, evaluated analytical validity, and assessed disclosure risk.

    @ARTICLE{KinneyEtAl2011,
    author = {Kinney, Satkartar K. and Reiter, Jerome P. and Reznek, Arnold P.
    and Miranda, Javier and Jarmin, Ron S. and Abowd, John M.},
    title = {Towards Unrestricted Public Use Business Microdata: The {Synthetic}
    {Longitudinal} {Business} {Database}},
    journal = {International Statistical Review},
    year = {2011},
    volume = {79},
    pages = {362--384},
    number = {3},
    doi = {10.1111/j.1751-5823.2011.00153.x},
    issn = {1751-5823},
    keywords = {Economic census, data confidentiality, synthetic data, disclosure
    limitation},
    owner = {vilhuber},
    publisher = {Blackwell Publishing Ltd},
    timestamp = {2012.09.04},
    abstract = {In most countries, national statistical agencies do not release establishment-level
    business microdata, because doing so represents too large a risk
    to establishments\' confidentiality. One approach with the potential
    for overcoming these risks is to release synthetic data; that is,
    the released establishment data are simulated from statistical models
    designed to mimic the distributions of the underlying real microdata.
    In this article, we describe an application of this strategy to create
    a public use file for the Longitudinal Business Database, an annual
    economic census of establishments in the United States comprising
    more than 20 million records dating back to 1976. The U.S. Bureau
    of the Census and the Internal Revenue Service recently approved
    the release of these synthetic microdata for public use, making the
    synthetic Longitudinal Business Database the first-ever business
    microdata set publicly released in the United States. We describe
    how we created the synthetic data, evaluated analytical validity,
    and assessed disclosure risk.},
    url = {http://dx.doi.org/10.1111/j.1751-5823.2011.00153.x}
    }
  • J. Miranda, "LBD Codebook," U.S. Census Bureau, mimeo , 2011.
    [PDF] [Bibtex]
    @TECHREPORT{LBD_Codebook,
    author = {Javier Miranda},
    title = {{LBD} Codebook},
    institution = {U.S. Census Bureau},
    year = {2011},
    type = {mimeo},
    owner = {vilhuber},
    timestamp = {2013.10.14},
    }
  • J. Miranda, "SynLBD Codebook," U.S. Census Bureau, mimeo , 2011.
    [PDF] [URL] [Bibtex]
    @TECHREPORT{SynLBD_Codebook,
    author = {Javier Miranda},
    title = {{SynLBD} Codebook},
    institution = {U.S. Census Bureau},
    year = {2011},
    type = {mimeo},
    owner = {vilhuber},
    url = {http://www.census.gov/ces/pdf/SynLBD_Codebook.pdf},
    timestamp = {2013.10.14}
    }
  • J. Miranda and R. Jarmin, "The Longitudinal Business Database," U.S. Census Bureau, Center for Economic Studies, Discussion Paper CES-WP-02-17, 2002.
    [Abstract] [PDF] [URL] [Bibtex]

    The LBD is a research dataset constructed at the Census Bureau's Center for Economic Studies. The LBD is an establishment based file created by linking the annual snapshot files from Census Bureau's Business Register over time. It contains high quality longitudinal establishment linkages. Firm level linkages are currently under development at CES. The LBD contains several basic data items such as firm ownership, location, industry, payroll and employment.

    @TECHREPORT{MirandaJarmin2002,
    author = {Javier Miranda and Ron Jarmin},
    title = {The {Longitudinal} {Business} {Database}},
    institution = {U.S. Census Bureau, Center for Economic Studies},
    year = {2002},
    type = {Discussion Paper},
    number = {CES-WP-02-17},
    abstract = {The LBD is a research dataset constructed at the Census Bureau's Center
    for Economic Studies. The LBD is an establishment based file created
    by linking the annual snapshot files from Census Bureau's Business
    Register over time. It contains high quality longitudinal establishment
    linkages. Firm level linkages are currently under development at
    CES. The LBD contains several basic data items such as firm ownership,
    location, industry, payroll and employment.},
    owner = {vilhuber},
    timestamp = {2009.09.25},
    url = {http://ideas.repec.org/p/cen/wpaper/02-17.html}
    }
  • S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, "Appendix to 'Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database'," Center for Economic Studies, U.S. Census Bureau, online document , 2011.
    [PDF] [URL] [Bibtex]
    @TECHREPORT{Kinney_et_al_2011_Appendix,
    author = {Kinney, Satkartar K. and Reiter, Jerome P. and Reznek, Arnold P.
    and Miranda, Javier and Jarmin, Ron S. and Abowd, John M.},
    title = {Appendix to '{T}owards Unrestricted Public Use Business Microdata: The {Synthetic}
    {Longitudinal} {Business} {Database}'},
    institution = {Center for Economic Studies, U.S. Census Bureau},
    year = {2011},
    type = {online document},
    keywords = {Economic census, data confidentiality, synthetic data, disclosure
    limitation},
    owner = {vilhuber},
    url = {https://www.census.gov/ces/pdf/SynLBD_Kinney_et_al_2011_Appendix.pdf}
    }
  • L. Vilhuber, "Codebook for the Synthetic LBD Version 2.0 [Codebook file]," {Comprehensive Extensible Data Documentation and Access Repository (CED2AR)}, Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University, Ithaca, NY, USA, DDI-C document , 2013.
    [URL] [Bibtex]
    @TECHREPORT{CED2AR-SynLBDv2,
    author = { Lars Vilhuber },
    title = {Codebook for the Synthetic LBD Version 2.0 [Codebook file]},
    institution = {{Comprehensive Extensible Data Documentation and Access Repository (CED2AR)}, Cornell Institute for Social and Economic Research and Labor Dynamics Institute [distributor]. Cornell University},
    type = {DDI-C document},
    address = {Ithaca, NY, USA},
    year = {2013},
    url = {http://www2.ncrn.cornell.edu/ced2ar-web/codebooks/synlbd/v/v2}
    }

Complete Bibtex files

SSB-documentation and SynLBD-documentation

Citations of the datasets themselves

SIPP Synthetic Beta

  • U.S. Census Bureau, "SIPP Synthetic Beta Version 6.0.2," {U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server [distributor], Washington,DC and Ithaca, NY, USA, [Computer file] , 2015.
    [URL] [Bibtex]
    @TECHREPORT{SSB602,
    author = {{U.S. Census Bureau}},
    title = {{SIPP} {S}ynthetic {B}eta Version 6.0.2},
    institution = {{U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server
    [distributor]},
    year = {2015},
    type = {[Computer file]},
    address = {Washington,DC and Ithaca, NY, USA},
    howpublished = {Computer file},
    organization = {Cornell University, Synthetic Data Server [distributor]},
    owner = {vilhuber},
    timestamp = {2015.01.10},
    url = {http://www2.vrdc.cornell.edu/news/data/sipp-synthetic-beta-file/}
    }
  • U.S. Census Bureau, "SIPP Synthetic Beta Version 6.0," {U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server [distributor], Washington,DC and Ithaca, NY, USA, [Computer file] , 2015.
    [URL] [Bibtex]
    @TECHREPORT{SSB6,
    author = {{U.S. Census Bureau}},
    title = {{SIPP} {S}ynthetic {B}eta Version 6.0},
    institution = {{U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server
    [distributor]},
    year = {2015},
    type = {[Computer file]},
    address = {Washington,DC and Ithaca, NY, USA},
    howpublished = {Computer file},
    organization = {Cornell University, Synthetic Data Server [distributor]},
    owner = {vilhuber},
    timestamp = {2015.01.10},
    url = {http://www2.vrdc.cornell.edu/news/data/sipp-synthetic-beta-file/}
    }
  • U.S. Census Bureau, "SIPP Synthetic Beta Version 5.1," {U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server [distributor], Washington,DC and Ithaca, NY, USA, [Computer file] , 2013.
    [URL] [Bibtex]
    @TECHREPORT{SSB5.1,
    author = {{U.S. Census Bureau}},
    title = {{SIPP} {S}ynthetic {B}eta Version 5.1},
    institution = {{U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server
    [distributor]},
    year = {2013},
    type = {[Computer file]},
    address = {Washington,DC and Ithaca, NY, USA},
    howpublished = {Computer file},
    organization = {Cornell University, Synthetic Data Server [distributor]},
    owner = {vilhuber},
    timestamp = {2013.06.10},
    url = {http://www2.vrdc.cornell.edu/news/data/sipp-synthetic-beta-file/}
    }
  • U.S. Census Bureau, "SIPP Synthetic Beta Version 5.0," {U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server [distributor], Washington,DC and Ithaca, NY, USA, [Computer file] , 2011.
    [URL] [Bibtex]
    @TECHREPORT{SSB5.0,
    author = {{U.S. Census Bureau}},
    title = {{SIPP} {S}ynthetic {B}eta Version 5.0},
    institution = {{U.S. Census Bureau} [producer] and Cornell University, Synthetic Data Server
    [distributor]},
    year = {2011},
    type = {[Computer file]},
    address = {Washington,DC and Ithaca, NY, USA},
    howpublished = {Computer file},
    organization = {Cornell University, Synthetic Data Server [distributor]},
    owner = {vilhuber},
    timestamp = {2013.06.10},
    url = {http://www2.vrdc.cornell.edu/news/data/sipp-synthetic-beta-file/}
    }

Synthetic LBD

  • U.S. Census Bureau, "Synthetic LBD Beta Version 2.0," {U.S. Census Bureau} and Cornell University, Synthetic Data Server [distributor], Washington,DC and Ithaca, NY, USA, [Computer file] , 2011.
    [Abstract] [URL] [Bibtex]

    The Synthetic LBD Beta Data Product (SynLBD) is an experimental data product produced by the U.S. Census Bureau in collaboration with Duke University, Cornell University, the National Institute of Statistical Sciences (NISS), the Internal Revenue Service (IRS) and the National Science Foundation (NSF). The purpose of the SynLBD is to provide users with access to a longitudinal business data product that can be used outside of a secure Census Bureau facility. The Census Bureau created version 2 of the SynLBD by synthesizing information on establishments' employment and payroll, establishments' birth and death years, and industrial classification. The Census Disclosure Review Board and their counterparts at IRS have reviewed the content of the file, and allowed the release of these data for public use.

    @TECHREPORT{SynLBD20,
    author = {{U.S. Census Bureau}},
    title = {Synthetic {LBD} {Beta} Version 2.0},
    institution = {{U.S. Census Bureau} and Cornell University, Synthetic Data Server
    [distributor]},
    year = {2011},
    type = {[Computer file]},
    address = {Washington,DC and Ithaca, NY, USA},
    abstract = {The Synthetic LBD Beta Data Product (SynLBD) is an experimental data
    product produced by the U.S. Census Bureau in collaboration with
    Duke University, Cornell University, the National Institute of Statistical
    Sciences (NISS), the Internal Revenue Service (IRS) and the National
    Science Foundation (NSF). The purpose of the SynLBD is to provide
    users with access to a longitudinal business data product that can
    be used outside of a secure Census Bureau facility. The Census Bureau
    created version 2 of the SynLBD by synthesizing information on establishments'
    employment and payroll, establishments' birth and death years, and
    industrial classification. The Census Disclosure Review Board and
    their counterparts at IRS have reviewed the content of the file,
    and allowed the release of these data for public use.},
    howpublished = {Computer file},
    organization = {Cornell University, Synthetic Data Server [distributor]},
    owner = {vilhuber},
    timestamp = {2013.06.10},
    url = {http://www2.vrdc.cornell.edu/news/data/lbd-synthetic-data/}
    }

Complete Bibtex files

SSB data citations and SynLBD data citations