This page
This page describes how to use the SDS. Once the results are computed on the SDS, you will want to validate results. Please see the relevant page for that.
Important: No internet access when logged on
Please note that the Synthetic Data Server is a restricted-access server. While you can log on to the SDS from anywhere you have internet access, while you are logged on to the server,
- you cannot transfer programs and/or data to or from the server - contact the data providers to perform these services for you. The key reason for this is that the data you are accessing is not intended for distribution. Your access to the data is limited to access while logged on to the server.
- you cannot download programs, code, modules, packages, or auxiliary data automatically within your programs. This is true for R, SAS, Stata, Python. Please see below how to address this for packages in R and Stata. For any other data, you will need to identify the precise source and nature of the upload, and request it be uploaded for you. The reason for this restriction is that it mirrors similar restrictions on the validation server, and enforces strong replicability. In order to validate your analysis, your analysis must be completely replicable.
Filesystem layout
The main filesystem is $HOME (7TB). Directory structure replicates the typical Census RDC node, on which both synthetic data and completed gold standard data reside:
/temporary/
for scratch space (for both SAS and Stata)/rdcprojects/co/co00517
(for SSB)/rdcprojects/tr/tr00612
(for SynLBD)
Data
The most current data can change over time; the most reliable indicator is to inspect the data directories of each project:
- SSB data resides in /rdcprojects/co/co00517/SSB/data/
- SynLBD resides in /rdcprojects/tr/tr00612/data/synlbd/
Data documentation on these datasets can be found at https://www2.ncrn.cornell.edu/ced2ar-web/.
Additional public-use data is accessible under /data
- Zero-obs datasets from the Census RDC are available at /data/virtualrdc/ in locations otherwise corresponding to their locations on the Census RDC, e.g./economic/cbo/microdata is where the CBO files would be on the Census RDC, and /data/virtualrdc/economic/cbo/microdata is where they are found on the SDS.
- Cleaned and ready-to-use (generally, as SAS files) public use data for a variety of data sources can be found under
/data/clean/(NAME)
, with accompanying documentation under/data/doc/(NAME)
. If you notice anything out of date, please let us know. - It should be noted that if you use these data in your SDS-based analysis and request validation by the data owner, you need to explicitly identify these data, as they may NOT be available at the data owner's compute server.
User-created programs
Users should create programs OUTSIDE of their home directories (see backup policy below). Create a directory for your project under
- /rdcprojects/co/co00517/SSB/programs/users/(LOGIN ID)
- /rdcprojects/tr/tr00612/programs/users/(LOGIN ID)
This ensures ease of validation on the Census internal computers. See below on suggested programming practices.
Statistical and other software
SDSx uses job scheduling software. If you need to learn about qsub, please see various tutorials about using qsub, including our own qsub page. However, we also have a few convenience commands (our 'q-commands' and 'i-commands'), which automatically submit jobs to the queue using the appropriate qsub commands (for general queues, and interactive queues, respectively). Available queues can be found on the SDSx queue page.
We also have a short tutorial you may want to consult.
Software | Versions | Commandline (on compute node) |
Module (help) |
Qsub-aware command (head-node) |
Availability |
9.4
|
sas
|
sas
sas/(VERSION) |
qsas, isas
|
compute-1-1
(use job queue 'sas') |
|
Stata(SE, MP)
|
14
|
stata(-mp,-se), xstata(-mp,-se)
|
stata, stata/se,
stata/mp |
qstata(-mp,-se), iStata
|
Compute nodes
|
3.0.1,
3.0.2-ACML 3.2.0-ACML
|
R, Rscript
|
R
R/ACML |
qR, iR
|
Compute nodes
|
|
0.98 using R 3.0.1
|
rstudio
|
n.a.
|
iRstudio
|
Compute nodes
|
|
Matlab |
R2013b, R2014b
|
matlab
|
matlab
matlab/(VERSION)
|
qmatlab, imatlab
|
Compute nodes
|
3.4.3
|
octave
|
qoctave, ioctave
|
On demand
|
||
6.4.2
|
grass
|
On demand
|
|||
21.0.0
|
spss
|
|
|||
ASReml(also related R package)
|
3.00 [01 Jan 2009]
|
asreml
|
iasreml |
On demand
|
Interactive usage of software
You will find many of these software packages available from the Gnome menu, under "Statistics". You can also launch them using the 'i-command' version (i.e., using 'iStata' to launch Stata). However, all instances run from the menu or the 'i-command' will run in the interactive queue, and are subject to limitations in terms of CPUs, memory, and runtime:
Wallclock Limit: 2 hours
Job/User Limit: 1
Memory Limit/Job: 4GB
Long-running jobs need to be submitted from the command line. The interactive versions should be considered appropriate for debugging, but not the full computational jobs.
Batch submission of software
For longer running jobs, users should use the 'q-commands'. Default runtimes, memory limits, and number of CPUs are noted on the SDSx queue page. Most 'q-commands' take "chunks" as arguments, where chunks are 2 CPUs and 8GB of memory. For differing requirements (for instance longer-running jobs), custom qsub scripts can be used, see the qsub page for more details. To monitor jobs in the queue, as well as your own jobs during processing, use qstat. A graphical utility wrapped around qstat is available.
Suggested programming practices
When validating results on the confidential data, the data custodians will use the same programs, but certain aspects of the environment will be different:
- Exact filenames ("synlbd" vs "lbd")
- Exact paths may change, although relative file structures are expected to be constant (by design)
- Available add-on packages may be limited or not installed by default.
Think of the validation as a replication exercise, where your analysis is replicated by a different person in a somewhat different, constrained environment.
Paths and filenames
The most robust way to ensure ease of replication is to NEVER hard-code paths. Suggested practice is to use macro/global variables to encode such paths:
SAS Stata R The SDSx cluster is configured as follows: Note: there may be limits accessing these resources. All access on SDSx is channeled through queues, see the queue configuration page for more details. Due to the restricted-access nature of the server, we provide backup of critical files. However, we do not back up all files on the system, so in order to ensure that your critical programs get backed up, please note the following backup policy: You will be notified by the Cornell Center for Advanced Computing (CAC) of any downtimes of the SDS cluster. You can unsubscribe from CAC's mailing list by closing your account on SDS. For updates on data, you might receive email from an announcement-only mailing list (virtualrdc-sds-l@cornell.edu). If you wish to be notified at a different email address, send an email to listmanager@list.cornell.edu with the body of the message stating "subscribe virtualrdc-sds-l". To unsubscribe, send an email to listmanager@list.cornell.edu with the body of the message stating "unsubscribe virtualrdc-sds-l". If you need further assistance, please consult our Help page on how best to direct your inquiry.
Packages
RsessionInfo()
). We regularly add certain R packages, and mirror CRAN, but if you need anything in particular, please contact the Help Desk. Since you cannot access the internet from within the SDSx, we will need to transfer the R packages for you (or update the CRAN mirror). You should include the following code at the TOP of your code (or in a setup R script that is run before all other R code):
For more information, see http://www.stata.com/help.cgi?adopath
System
Names
Processor
Number of
processorsCores per
processorTotal cores,
all nodesClockspeed
Memory
per nodeResource set
64GB
Login only
AMD 6380
2
16
96
2.5Ghz
256GB
All
128
832 GB
Backup
/home/(userid)
) (and your desktop) are NOT backed up./rdcprojects/co/co00517
and /rdcprojects/tr/tr00612
are generally backed up, but user-created data files (in the user/ directories) may be excluded in the future./rdcprojects/{co,tr}/{co00517,tr00612}/.../programs/users
are ALWAYS backed up.Keeping informed
Getting help