Skip to main content

Latest Tweets

Working with large datasets and long-running programs on ECCO and SDSx

Print Friendly, PDF & Email
.

ECCO and SDSx are clusters, shared by a large number of users, and with finite resources. In order to better manage the cluster for all users, users are expected to follow some rules. Some are soft but monitored, others are hard constraints that will affect your individual effort if not taken into account.

File sizes

Some simple rules:

  • $HOME (your home directory, your desktop) is small, and should be used only for reduced derivative files and final result files
  • Quotas might affect your ability to store files in $HOME
  • $HOME is shared available on head node and all compute nodes
  • $SCRATCH (/scratch or /temporary - they both point to the same filesystem) is designed for large files.
  • Not all $SCRATCH are created equal though - the easiest way to find out is to open an interactive qsub job and type 'df -h /temporary/', giving you the size in Terabytes
  • $SCRATCH (or /temporary) is ... well, temporary. We clean it out if a file hasn't been used in more than two weeks. There is no backup, "cleaning out" means deleting, forever.
  • Most shared data (public-use datasets on ECCO, synthetic datasets on SDSx) are in a common location, for your use.

The above rules imply the following structure on any of your programs:

  • Read the shared data from the shared location. Do NOT copy them to your desktop.
  • Define a way to reference the $SCRATCH location for your temporary files. This can be used to share storage in between two Stata or SAS or R jobs, or simply within your program.  Store files here which, even if it  takes a long time, can be reproduced using your programs (which in principle means ALL files you create), but which are not needed in the long-run.
  • Actively clean-up at the end of a job: delete unneeded data files
  • Only write to $HOME analysis results, or greatly reduced files that are of long-term use.

A sample Stata program that cleans up, and does not write big files:

global version "v5.1.1"
global _version "v5_1_1"
global user "spec666"
global INPUT "/rdcprojects/co/co00517/SSB/data/$version"
global USERDATA "/rdcprojects/co/co00517/SSB/data/user/$user"
global OUTPUTS "/rdcprojects/co/c00517/SSB/programs/user/$user"
log using $OUTPUTS/mylog.log, text replace
tempfile scratch
global implicate 1_1
/* only load the necessary variables */
use personid spouse_personid state panel using $INPUT/ssb_${_version}_synthetic${implicate}
... do stuff ...
tab state panel
/* save to temporary file */
sort state
save `scratch'
/* process another file */
use $USERDATA/state_urate
sort state
merge 1:m state using `scratch'
/* run analysis */
reg something urate
exit, clear

(Stata will delete the scratch intermediate file)

We remind users on ECCO that there is NO BACKUP of any files that you create. You must either commit to your own offsite Subversion or Git repository, or make other accomodations to copy files offsite.

Programming with limits

Most simple jobs can be run with our 'qcommands', but as soon as your job sequence becomes more complex, you will want to create your own qsub scripts. With custom qsub scripts, you can

  • control which nodes your programs get allocated to. You may want to do this when you have intermediate files on $SCRATCH. You do NOT want to do this in general, since it may delay when your job can be run.
  • ask for longer walltime. Most queues have a default wallclock limit (the time your job is allowed to run). They also have a longer maximum wallclock limit that users are allowed to run jobs for. If your job runs for a very long time, you will want to increase the requested time. (PBS command: "-l walltime=HH:MM:SS")

You will also want to cut jobs into smaller programs, so that they can run within the wallclock limits, and in order to make them robust to restarts.

  • Unless you have a regression routine that by itself pushes the limits, you can at a minimum separate data preparation steps, and analysis steps (and writing the output of the data preparation step to a location on $SCRATCH, from where the analysis step can read it)
  • You can have one data preparation routine, and multiple regression routines. By separating them into separate qsub programs, you can submit all regression routines simultaneously (within the limits imposed on the system). At a minimum, if running two of them in parallel, you reduce your overall time waiting for all the results by 50%. Not bad.

Another example Stata with intermediate storage and config file

The progams below (config.do, 01_prep.do, 02_analysis.do, the qsub programs) split the processing into two steps: the preparation, which stores a file on $SCRATCH, and the analysis, which reads that file. Because $SCRATCH is specific to a particular compute node (not shared), we capture what node the 01_prep.qsub was submitted to, and add it to the

config.do:
global version "v5.1.1"
global _version "v5_1_1"
global user "spec666"
global INPUT "/rdcprojects/co/co00517/SSB/data/$version"
global USERDATA "/rdcprojects/co/co00517/SSB/data/user/$user"
global OUTPUTS "/rdcprojects/co/c00517/SSB/programs/user/$user"
global SCRATCH "/scratch/$user"
! mkdir $SCRATCH
01_prep.do:
do config.do
log using $OUTPUTS/01_prep.log, text replace
global implicate 1_1
/* only load the necessary variables */
use personid spouse_personid state panel using $INPUT/ssb_${_version}_synthetic${implicate}
... do stuff ...
save $SCRATCH/01_prepped, replace
02_analysis.do:
do config.do
log using $OUTPUTS/02_analysis.log, text replace
global implicate 1_1
global mydebug on
use $SCRATCH/01_prepped, clear
.... do stuff ....
/* delete files only if debug = off */
if ( "$mydebug" == "off" ) { rm $SCRATCH/01_prepped }

Now that we have the Stata programs, let's write the qsub programs to schedule them:

01_prep.qsub:
#!/bin/bash
#PBS -N 01_prep
#PBS -l ncpus=1,mem=8000m
#PBS -j oe
source /etc/profile.d/modules.sh
cd /rdcprojects/co/c00517/SSB/programs/user/$PBS_O_LOGNAME
module load stata
export STATATMP=/scratch/
# capture the hostname
hostname > hostname.txt
# run the stata program
stata -q -b do 01_prep.do

Submit this program using

qsub 01_prep.qsub
02_analysis.qsub:
#!/bin/bash
#PBS -N 02_analysis
#PBS -l ncpus=1,mem=8000m
#PBS -j oe
source /etc/profile.d/modules.sh
cd /rdcprojects/co/c00517/SSB/programs/user/$PBS_O_LOGNAME
module load stata
export STATATMP=/scratch/
stata-mp -q -b do 02_analysis.do

When submitting 02_analysis.qsub, use the following command line:

qsub -l nodes=$(cat hostname.txt) 02_analysis.qsub

which will request that the scheduler process the job on the same compute node that 01_prep.qsub ran on.