The Census Bureau is engaged in a number of innovative disclosure avoidance research activities, often with the assistance of leading academic experts. One such activity is the development of a synthetic public use version of the Longitudinal Business Database (LBD). As part of this project, the Census Bureau would like to begin efforts to reach out to data users to familiarize them with the benefits of such a product and educate them on how best to use it. the IRS and the Census Disclosure Review Board have approved the release of a preliminary “beta” synthetic version of the Longitudinal Business Database (known as the “LBD Synthetic Beta”) for use at the Cornell Virtual Research Data Center (VRDC).
Description of LBD
The LBD currently covers all private non-farm business establishments with paid employees for years 1975 through 2005. It is constructed by linking annual snapshots of the Census Bureau’s Business Register. The CES at the Census Bureau has added considerable value to the LBD by improving longitudinal linkages, retiming multi-unit establishment births and dealing with missing data. Originally developed as a research dataset, the LBD is now the most widely used dataset in the Census Bureau’s Research Data Centers (RDCs). It is used for analysis of business dynamics (e.g., births, deaths, job creation and destruction, etc). It is also the basis for the Business Demography Series, a new set of publicly available tabulations that CES is now developing.
Purpose of the LBD Synthetic Beta v1
The Census Bureau is interested in constructing a public release synthetic LBD for several reasons. First, there are growing concerns about confidentiality protection, and synthetic data is one proposed solution to this problem. This project is a test of synthetic data in a business data setting. Second, a synthetic LBD will satisfy the needs of, at least a subset of, data users. For example, a synthetic LBD could be useful for performing international comparisons of business dynamics. In this way, it could save resources at the Census Bureau and reduce disclosure risks by reducing the number of special tabulations the Census Bureau carries out.
The LBD synthetic beta v1 is an early version of the synthetic LBD. It covers only one three-digit SIC industry (573, Radio, TV, Consumer Electronics, and Music Stores), whereas the full synthetic LBD will cover all industries. The Census Bureau has created five implicates (i.e., separate synthetic versions of the data), 26 years each (1976-2001). The file contains only the following variables: SIC (always 573 for this file), an imputation number, a synthetic LBD number (a random sequence number), reference year, Mar. 12 employment and annual payroll (in thousands) for the reference year, and the establishment’s first and last years of existence on the file. Documentation is available here. For reference, this document also reports the structure of the internal-use LBD, zero-obs versions of which are available on the VirtualRDC.
Although analytical validity was not the main concern in constructing the LBD Synthetic Beta, this beta version will still be useful. It will familiarize users with the structure and use of the LBD file, and it will allow them to write programs that could be run on the full LBD. It will also build confidence that the Census Bureau can release such files. The U.S. Census Bureau Disclosure Review Board and the IRS Disclosure Officer have certified that the LBD Synthetic Beta does not disclose any Census or IRS confidential data.
Accessing the LBD Synthetic Beta file v1
The synthetic LBD data can only be accessed when logged onto one of the SSG compute nodes at the VirtualRDC, where it is stored under
To obtain an account on the SSG, please consult this page. As a reminder, the complete data for the internal-use LBD can only be accessed through the Census Research Data Center network, consult http://www.ces.census.gov for more details. The ongoing work to improve the LBD Synthetic Beta file occurs on the NSF-funded RDC supercomputer.