Let’s get into making the most out of the cluster…
What is a parallel job?
Running three times the same job would look something like this…
qsub -cwd -S /bin/bash script.sh
qsub -cwd -S /bin/bash script.sh
qsub -cwd -S /bin/bash script.sh
There is a more efficient way of doing this…
qsub -t 1-3 -cwd -S /bin/bash script.sh
What if we need to slightly change the script for every iteration?
For example…
qsub -cwd -S /bin/bash script.sh 1
qsub -cwd -S /bin/bash script.sh 2
qsub -cwd -S /bin/bash script.sh 3
To do this, we need to add a variable to the R script. Let’s have a look at the file script-parallel.R.
args <- commandArgs(trailingOnly=T)
i <- args[1]
Sys.sleep(20)
message(paste("my job number", i, sep=" "))
Then change your bash script so that it can collect the index from the flag -t
#!/bin/bash
Rscript script-parallel.R $SGE_TASK_ID
Now, you can call your new bash script doing:
qsub -t 1-3 -cwd -S /bin/bash script-parallel.sh
Have a look at the output files to see the results!
Sometimes we want to use several cores within the same job.
the following R script would take 10 minutes to complete
for(i in 1:100){
# your code goes here
# the following instruction takes 6 seconds to complete
Sys.sleep(6)
}
Instead of evaluating each iteration in the loop sequentially we could tell the computer to run several in parallel.
To do that we need to employ a parallel environment in R.
If you’re using unix you could use the doMC
+ parallel
packages for the parallel environment. We won’t cover parallel environments for windows because you cannot use those in the Abacus cluster.
run this instructions in your R console
install.packages("doMC")
install.packages("parallel")
install.packages("foreach")
The last package provides a parallel version of a for
-loop
the parallel equivalent of the sequential loop shown before would be
library(doMC)
library(parallel)
library(foreach)
# use 2 cores in the
registerDoMC(cores = 2)
foreach(i=1:100) %dopar% {
# your code goes here
# the following instruction takes 6 seconds to complete
Sys.sleep(6)
}
here we’re using two cores so it takes 5 minutes instead of 10
Current computers usually between 2 and 8 cores. Abacus has nodes that have up to 48 cores. The same loop would take 18 seconds instead of 10 minutes.
Let’s try to run that in the cluster
First we need to install the required packages in the cluster
R
from the consoleoptions(repos = "https://cran.rstudio.com/")
install.packages("doMC")
install.packages("parallel")
install.packages("foreach")
The bash script to call script-multicore.R looks very similar:
#!/bin/bash
Rscript script-multicore.R
But we need to tell qsub that our job needs several cores instead of one.
To do that we use the option -pe multi-thread
You can run the parallel loop typing the following in the Abacus console
qsub -S /bin/bash -pe multi_thread 8 script-multicore.sh
Take into account that the more cores we request the harder it is to schedule these jobs cause the cluster needs to find a node with all those cores free.
Depending on the queue we use and the user privileges each job is limited to use 1-2 Gigabytes of RAM memory.
To request more memory we need to use the options -l mem_free
& h_vmem
.
To request 8 Gigabytes of RAM for a job we would run:
qsub -l mem_free=8G h_vmem=8G script.sh
Abacus has around 100GB ram in each node. But keep in mind that jobs that request large amounts of RAM are harder to schedule and take resources other people could use. Only request what you actually need.
Here we will look at the template.sh
…
To call your scripts using this template, you would only need to run:
qsub script-parallel.sh
The reason is because all the flags and arguments are defined in template.sh
.
Let’s have a look at this template:
##!/bin/bash
###############################
# SGE settings Here
# Basically, if a line starts with "#$", then you can enter any
# qsub command line flags . See the qsub(1) man page.
# Email address, and options, to whom reports should be sent.
# The -M is for the address, the -m is for when to send email.
# a=abort b=begin e=end s=suspend n=none
##$ -M youremail@domain.com
##$ -m abe
# Redirect the STDOUT and STDERR files to the ~/jobs directory
#$ -o /home/usr123/jobs/
#$ -e /home/usr123/jobs/
# This script, ladies and gentlemen, is in bash
#$ -S /bin/bash
# Do some validation checking, and bail on errors
##$ -w e
# Operate in the current directory
##$ -cwd
# End SGE Settings
###############################
#you can put your scripts here
Another option is generating the output files using R. Let’s change the file script.R
that we used before:
Sys.sleep(20) # this instruction takes 20 seconds to complete
message("my second job in the cluster")
output <- 10
save(output, file="output.RData")
Run this new script with qsub and check the file generated. To do this, open an R session in the cluster and load the file using:
load("output.RData")
print(output)
To do that, we just need to use a different cluster that helps us go through the University firewall. This cluster is called abacus gateway, and you already have an account in it:
ssh usr123@abacus-gw.canterbury.ac.nz
The password is gateway2workshop
Change your password again:
passwd
Now you can wither access to the cluster with:
ssh usr123@abacus
Or your computer with your computer name:
ssh bbr36@biolz210-9
Let’s make a simple python script called script.py
:
def fprint(x):
print x
This script print a variable x
The bash file that calls this script in the cluster would be something like this:
#!/bin/bash
python -c "import script; script.fprint('$1')"
We import the python script and we run it.
Let’s say that you want to call the python function and print your name. To do so, you just need to run qsub as:
qsub -cwd -S /bin/bash script-python.sh "Fernat"
If you check the output files you will see that we have just printed your name.