Abacus - Part 2

Bernat Bramon & Fernando Cagua

December 15, 2017

Let’s get into the basics of running jobs to the cluster…

Queues

Although the cluster is just like any other computer, the way we run scripts is slightly different.

  • If everyone ran the jobs directly to the cluster, it would be chaos!

  • Jobs would be all running at the same time,

  • it would be harder to prevent the cluster from crushing,

  • it would be hard to manage multiple users…

That’s why we need a queueing system…

The queueing system

  • Distributes jobs across nodes/cores

  • Makes a waiting list

  • Manages job priorities

  • Ensures that jobs are contained and the nodes are shieded from possoble errors

  • Ensures that jobs are run only within allocated resources (cores/RAM memory/time)

The master node hosts the queueing system. That’s why we shouldn’t run jobs there.

If the master node chrashes, the queueing system crashes too. Which means everyone’s jobs might go with it. No one can use the cluster until the whole thing is reset.

SGE

There are multiple quewing systems available. Abacus uses SGE (Sun Grid Engine System).

There are 3 basic commands we’ll cover today

qsub
qstat
qdel

qsub

To use qsub, we will “always” need a bash script that runs our R scripts.

Write with nano a bash script named script.sh:

#!/bin/bash

Rscript script.R

Now you should have an R script named script.R and a bash script named script.sh. Try running the job in the cluster by means of:

qsub -cwd -S /bin/bash script.sh

You will see that two new files have been created in your home directory.

script.sh.o

This is the output file. Unless Steve messes up with the cluster, you should see the output of the R script inside

script.sh.e

This is the errors file. Unless Steve messes up with the cluster, you can use it to debug your code.

You could try changing the Rscript to produce an error:

Sys.sleep(20)
message("my second job in the cluster")
error <- log("This will produce an error because it is a string")

qstat

We can use qstat to check the status of the queue

to get the status of your jobs in the queue use:

qstat

to get the status of any every user in the queue use:

qstat -u "*"

we get something like this:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
1382668 0.50500 ivs_rogini rru36        r     11/20/2017 15:17:55 all.q@compute-0-5.local            1        
1382673 0.50500 ivs_simu.s rru36        r     11/20/2017 21:19:10 all.q@compute-0-12.local           1        
1382685 0.50500 ivs44.sh   rru36        r     11/21/2017 16:43:25 all.q@compute-0-14.local           1        
1382708 0.50500 dts_simu.s rru36        r     11/24/2017 14:45:55 all.q@math-compute-0-0.local       1        
1382751 0.50500 convertToH jmp197       r     12/08/2017 14:14:11 all.q@compute-0-15.local           1        
1382756 0.50500 moc_rrsw.s rru36        r     12/12/2017 18:11:26 all.q@math-compute-0-3.local       1        
1382765 0.60500 run_simula jmp197       r     12/13/2017 15:50:26 all.q@compute-0-11.local          41 1
1382765 0.60500 run_simula jmp197       r     12/13/2017 15:50:26 all.q@compute-0-12.local          41 2
1382765 0.60500 run_simula jmp197       r     12/13/2017 15:50:26 all.q@compute-0-13.local          41 3

try submiting a job and then getting info about it

qsub script.sh
qstat

qdel

qdel is used to delete jobs in the queue

We need to have the job-id first, which we can obtain calling qstat

qsub script.sh
qstat
qdel the_job_id_you_see_on_qstat

Bonus

SGE provides a graphical interface for submitting jobs. Try the following:

On your computer

ssh -X usr123@abacus

And then on abacus:

qmon