Skip to main content

Where is the Social Science Gateway?

The Social Science Gateway (SSG) grant has ended, please read here about ongoing availability of resources created as part of that project.

We support:

APDU Logo foas-logo-small (2)

Step 4 - Using the SDS

Filesystem layout

The main filesystem is $HOME (7TB). Directory structure replicates the typical Census RDC node, on which both synthetic data and completed gold standard data reside:

  • /temporary/ for scratch space (for both SAS and Stata)
  • /rdcprojects/co/co00517 (for SSB)
  • /rdcprojects/tr/tr00612 (for SynLBD)

Data

The most current data can change over time; the most reliable indicator is to inspect the data directories of each project:

  • SSB data resides in /rdcprojects/co/co00517/SSB/data/
  • SynLBD resides in /rdcprojects/tr/tr00612/data/synlbd/

Additional public-use data is accessible under /data

  • Zero-obs datasets from the Census RDC are available at /data/virtualrdc/ in locations otherwise corresponding to their locations on the Census RDC, e.g./economic/cbo/microdata is where the CBO files would be on the Census RDC, and /data/virtualrdc/economic/cbo/microdata is where they are found on the SDS.
  • Cleaned and ready-to-use (generally, as SAS files) public use data for a variety of data sources can be found under /data/clean/(NAME), with accompanying documentation under /data/doc/(NAME). If you notice anything out of date, please let us know.
  • It should be noted that if you use these data in your SDS-based analysis and request validation by the data owner, you need to explicitly identify these data, as they may NOT be available at the data owner's compute server.

User-created programs

Users should create programs OUTSIDE of their home directories (see backup policy below). Create a directory for your project under

  • /rdcprojects/co/co00517/SSB/programs/users/(LOGIN ID)
  • /rdcprojects/tr/tr00612/programs/users/(LOGIN ID)

This ensures ease of validatoin on the Census internal computers. The most robust way to ensure ease of replication is to NEVER hard-code paths. Suggested practice is to use macro/global variables to encode such paths:

SAS

%let base=/rdcprojects/tr/tr00612;
%let version=2.0.2;
%let myid=specXXX;
%let prefix=synlbd;
libname inputs "&base./data/synlbd/&version." access=readonly;
libname mydata "&base./programs/users/&myid./data";data mydata.analysis_file;
set inputs.&prefix.1992c;
Stata

global base /rdcprojects/tr/tr00612
global version 2.0.2
global myid specXXX
global prefix synlbd
global inputs $base/data/synlbd/$version
global mydata $base/programs/users/$myid/data"use ${inputs}/${prefix}1992c
...
save ${mydata}/analysis_file
R

base = "/rdcprojects/tr/tr00612"
version = "2.0.2"
myid = "specXXX"
prefix = "synlbd"
library(foreign)
inputs = paste(base,"/data/synlbd/",version,sep="")
mydata = paste(base,"/programs/users/",myid,"/data",sep="")
analysis_file ...
save(analysis_file,file=paste(mydata,"/analysis_file.RData",sep=""))

Statistical and other software

SDSx uses job scheduling software. If you need to learn about qsub, please see various tutorials about using qsub, including our own qsub page. However, we also have a few convenience commands (our 'q-commands' and 'i-commands'), which automatically submit jobs to the queue using the appropriate qsub commands (for general queues, and interactive queues, respectively). Available queues can be found on the SDSx queue page.

We also have a short tutorial you may want to consult.

Software Versions Commandline
(on compute node)
Module
(help)
Qsub-aware command
(head-node)
Availability
9.3, 9.4
sas
sas
sas/(VERSION)
qsas, isas
Compute nodes
Stata(SE, MP)
12.0
stata(-mp,-se), xstata(-mp,-se)
stata, stata/se,
stata/mp
qstata(-mp,-se), iStata
Compute nodes
3.0.1,
3.0.2-ACML (note)
R, Rscript
R
R/ACML
qR, iR
Compute nodes
0.98 using R 3.0.1
rstudio
n.a.
iRstudio
Compute nodes
ASReml(also related R package)
3.00 [01 Jan 2009]
asreml
 iasreml
On demand
Matlab
R2013b (8.2.0.701)
matlab
matlab
qmatlab, imatlab
Compute nodes
3.4.3
octave
qoctave, ioctave
On demand
6.4.2
grass
On demand
21.0.0
spss

 

Interactive usage of software

You will find many of these software packages available from the Gnome menu, under "Statistics". You can also launch them using the 'i-command' version (i.e., using 'iStata' to launch Stata). However, all instances run from the menu will run in the interactive queue, and are subject to limitations in terms of CPUs, memory, and runtime. Long-running jobs need to be submitted from the command line. The interactive versions should be considered appropriate for debugging, but not the full computational jobs.

Batch submission of software

 

For longer running jobs, users should use the 'q-commands'. Default runtimes, memory limits, and number of CPUs are noted on the SDSx queue page. Most 'q-commands' take "chunks" as arguments, where chunks are 2 CPUs and 8GB of memory. For differing requirements (for instance longer-running jobs), custom qsub scripts can be used, see the qsub page for more details.

Packages

R: We regularly add certain R packages, but if you need anything in particular, please contact the Help Desk. Since you cannot access the internet from within the SDSx, we will need to transfer the R packages for you.

Stata: We occassionally mirror the RePEC repository of Stata packages to /data/mirror/fmwww.bc.edu/repec/. Users can install packages by running commands such as the following: (for any package, use the first character of the name of the package in the first line)

net from /data/mirror/fmwww.bc.edu/repec/e
net install estout, replace

In addition, occassional private ado files will be made available. On SDSx, if you intend to use them, add the following line to your Stata program:

adopath + "/cac/contrib/ado/"

For more information, see http://www.stata.com/help.cgi?adopath

System

The SDSx cluster is configured as follows:

Names Processor Number of
processors
Cores per
processor
Total cores,
all nodes
Clockspeed Memory
per node
Resource set
login/head node
AMD 6380
2
16
32
2.5Ghz
64GB Login only
compute-1-{1-3}
AMD 6380 2 16 96 2.5Ghz 256GB All
Total June 2014
      128   832 GB

Note: there may be limits accessing these resources. All access on SDSx is channeled through queues, see the queue configuration page for more details.

Backup

Due to the restricted-access nature of the server, we provide backup of critical files. However, we do not back up all files on the system, so in order to ensure that your critical programs get backed up, please note the following backup policy:

  • Files in your home directory (/home/(userid)) (and your desktop) are NOT backed up.
  • Files under /rdcprojects/co/co00517 and /rdcprojects/tr/tr00612 are generally backed up, but user-created data files (in the user/ directories) may be excluded in the future.
  • User-created programs under /rdcprojects/{co,tr}/{co00517,tr00612}/.../programs/usersare ALWAYS backed up.
  • Files in the scratch space are never backed up, and are regularly removed to efficiently manage space.

Keeping informed

By default, we will subscribe you to a announcement-only mailing list (virtualrdc-sds-l@cornell.edu) to notify you of any important information about the server.

  • If you wish to be notified at a different email address, send an email to listmanager@list.cornell.edu with the body of the message stating "subscribe virtualrdc-sds-l".

Getting help

If you need further assistance, please consult our Help page on how best to direct your inquiry.