Let’s get into the basics of running jobs to the cluster…
Although the cluster is just like any other computer, the way we run scripts is slightly different.
If everyone ran the jobs directly to the cluster, it would be chaos!
Jobs would be all running at the same time,
it would be harder to prevent the cluster from crushing,
it would be hard to manage multiple users…
That’s why we need a queueing system…
Distributes jobs across nodes/cores
Makes a waiting list
Manages job priorities
Ensures that jobs are contained and the nodes are shieded from possoble errors
Ensures that jobs are run only within allocated resources (cores/RAM memory/time)
The master node hosts the queueing system. That’s why we shouldn’t run jobs there.
If the master node chrashes, the queueing system crashes too. Which means everyone’s jobs might go with it. No one can use the cluster until the whole thing is reset.
There are multiple quewing systems available. Abacus uses SGE (Sun Grid Engine System).
There are 3 basic commands we’ll cover today
qsub
qstat
qdel
To use qsub, we will “always” need a bash script that runs our R scripts.
Write with nano a bash script named script.sh
:
#!/bin/bash
Rscript script.R
Now you should have an R script named script.R
and a bash script named script.sh
. Try running the job in the cluster by means of:
qsub -cwd -S /bin/bash script.sh
You will see that two new files have been created in your home directory.
script.sh.o
This is the output file. Unless Steve messes up with the cluster, you should see the output of the R script inside
script.sh.e
This is the errors file. Unless Steve messes up with the cluster, you can use it to debug your code.
You could try changing the Rscript to produce an error:
Sys.sleep(20)
message("my second job in the cluster")
error <- log("This will produce an error because it is a string")
We can use qstat
to check the status of the queue
to get the status of your jobs in the queue use:
qstat
to get the status of any every user in the queue use:
qstat -u "*"
we get something like this:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1382668 0.50500 ivs_rogini rru36 r 11/20/2017 15:17:55 all.q@compute-0-5.local 1
1382673 0.50500 ivs_simu.s rru36 r 11/20/2017 21:19:10 all.q@compute-0-12.local 1
1382685 0.50500 ivs44.sh rru36 r 11/21/2017 16:43:25 all.q@compute-0-14.local 1
1382708 0.50500 dts_simu.s rru36 r 11/24/2017 14:45:55 all.q@math-compute-0-0.local 1
1382751 0.50500 convertToH jmp197 r 12/08/2017 14:14:11 all.q@compute-0-15.local 1
1382756 0.50500 moc_rrsw.s rru36 r 12/12/2017 18:11:26 all.q@math-compute-0-3.local 1
1382765 0.60500 run_simula jmp197 r 12/13/2017 15:50:26 all.q@compute-0-11.local 41 1
1382765 0.60500 run_simula jmp197 r 12/13/2017 15:50:26 all.q@compute-0-12.local 41 2
1382765 0.60500 run_simula jmp197 r 12/13/2017 15:50:26 all.q@compute-0-13.local 41 3
try submiting a job and then getting info about it
qsub script.sh
qstat
qdel
is used to delete jobs in the queue
We need to have the job-id first, which we can obtain calling qstat
qsub script.sh
qstat
qdel the_job_id_you_see_on_qstat
SGE provides a graphical interface for submitting jobs. Try the following:
On your computer
ssh -X usr123@abacus
And then on abacus:
qmon