ECCO and SDSx are clusters, shared by a large number of users, and with finite resources. In order to better manage the cluster for all users, users are expected to follow some rules. Some are soft but monitored, others are hard constraints that will affect your individual effort if not taken into account.
File sizes
Some simple rules:
- $HOME (your home directory, your desktop) is small, and should be used only for reduced derivative files and final result files
- Quotas might affect your ability to store files in $HOME
- $HOME is shared available on head node and all compute nodes
- $SCRATCH (/scratch or /temporary - they both point to the same filesystem) is designed for large files.
- Not all $SCRATCH are created equal though - the easiest way to find out is to open an interactive qsub job and type 'df -h /temporary/', giving you the size in Terabytes
- $SCRATCH (or /temporary) is ... well, temporary. We clean it out if a file hasn't been used in more than two weeks. There is no backup, "cleaning out" means deleting, forever.
- Most shared data (public-use datasets on ECCO, synthetic datasets on SDSx) are in a common location, for your use.
The above rules imply the following structure on any of your programs:
- Read the shared data from the shared location. Do NOT copy them to your desktop.
- Define a way to reference the $SCRATCH location for your temporary files. This can be used to share storage in between two Stata or SAS or R jobs, or simply within your program. Store files here which, even if it takes a long time, can be reproduced using your programs (which in principle means ALL files you create), but which are not needed in the long-run.
- Actively clean-up at the end of a job: delete unneeded data files
- Only write to $HOME analysis results, or greatly reduced files that are of long-term use.
A sample Stata program that cleans up, and does not write big files:
global version "v5.1.1" global _version "v5_1_1" global user "spec666" global INPUT "/rdcprojects/co/co00517/SSB/data/$version" global USERDATA "/rdcprojects/co/co00517/SSB/data/user/$user" global OUTPUTS "/rdcprojects/co/c00517/SSB/programs/user/$user"log using $OUTPUTS/mylog.log, text replace tempfile scratch global implicate 1_1 /* only load the necessary variables */ use personid spouse_personid state panel using $INPUT/ssb_${_version}_synthetic${implicate} ... do stuff ... tab state panel /* save to temporary file */ sort state save `scratch' /* process another file */ use $USERDATA/state_urate sort state merge 1:m state using `scratch' /* run analysis */ reg something urate exit, clear
(Stata will delete the scratch intermediate file)
We remind users on ECCO that there is NO BACKUP of any files that you create. You must either commit to your own offsite Subversion or Git repository, or make other accomodations to copy files offsite.
Programming with limits
Most simple jobs can be run with our 'qcommands', but as soon as your job sequence becomes more complex, you will want to create your own qsub scripts. With custom qsub scripts, you can
- control which nodes your programs get allocated to. You may want to do this when you have intermediate files on $SCRATCH. You do NOT want to do this in general, since it may delay when your job can be run.
- ask for longer walltime. Most queues have a default wallclock limit (the time your job is allowed to run). They also have a longer maximum wallclock limit that users are allowed to run jobs for. If your job runs for a very long time, you will want to increase the requested time. (PBS command: "-l walltime=HH:MM:SS")
You will also want to cut jobs into smaller programs, so that they can run within the wallclock limits, and in order to make them robust to restarts.
- Unless you have a regression routine that by itself pushes the limits, you can at a minimum separate data preparation steps, and analysis steps (and writing the output of the data preparation step to a location on $SCRATCH, from where the analysis step can read it)
- You can have one data preparation routine, and multiple regression routines. By separating them into separate qsub programs, you can submit all regression routines simultaneously (within the limits imposed on the system). At a minimum, if running two of them in parallel, you reduce your overall time waiting for all the results by 50%. Not bad.
Another example Stata with intermediate storage and config file
The progams below (config.do, 01_prep.do, 02_analysis.do, the qsub programs) split the processing into two steps: the preparation, which stores a file on $SCRATCH, and the analysis, which reads that file. Because $SCRATCH is specific to a particular compute node (not shared), we capture what node the 01_prep.qsub was submitted to, and add it to the
config.do:global version "v5.1.1" global _version "v5_1_1" global user "spec666" global INPUT "/rdcprojects/co/co00517/SSB/data/$version" global USERDATA "/rdcprojects/co/co00517/SSB/data/user/$user" global OUTPUTS "/rdcprojects/co/c00517/SSB/programs/user/$user" global SCRATCH "/scratch/$user" ! mkdir $SCRATCH01_prep.do:do config.do log using $OUTPUTS/01_prep.log, text replace global implicate 1_1 /* only load the necessary variables */ use personid spouse_personid state panel using $INPUT/ssb_${_version}_synthetic${implicate} ... do stuff ... save $SCRATCH/01_prepped, replace
02_analysis.do:do config.do log using $OUTPUTS/02_analysis.log, text replace global implicate 1_1 global mydebug on use $SCRATCH/01_prepped, clear .... do stuff .... /* delete files only if debug = off */ if ( "$mydebug" == "off" ) { rm $SCRATCH/01_prepped }
Now that we have the Stata programs, let's write the qsub programs to schedule them:
01_prep.qsub:#!/bin/bash #PBS -N 01_prep #PBS -l ncpus=1,mem=8000m #PBS -j oe source /etc/profile.d/modules.sh cd /rdcprojects/co/c00517/SSB/programs/user/$PBS_O_LOGNAME module load stata export STATATMP=/scratch/ # capture the hostname hostname > hostname.txt # run the stata program stata -q -b do 01_prep.do
Submit this program using
qsub 01_prep.qsub
02_analysis.qsub:#!/bin/bash #PBS -N 02_analysis #PBS -l ncpus=1,mem=8000m #PBS -j oe source /etc/profile.d/modules.sh cd /rdcprojects/co/c00517/SSB/programs/user/$PBS_O_LOGNAME module load stata export STATATMP=/scratch/ stata-mp -q -b do 02_analysis.do
When submitting 02_analysis.qsub, use the following command line:
qsub -l nodes=$(cat hostname.txt) 02_analysis.qsub
which will request that the scheduler process the job on the same compute node that 01_prep.qsub ran on.