This package will allow you to send function calls as jobs on a
computing cluster with a minimal interface provided by the
Q
function:
# load the library and create a simple function
library(clustermq)
= function(x) x * 2
fx
# queue the function call on your scheduler
Q(fx, x=1:3, n_jobs=1)
# list(2,4,6)
Computations are done entirely on the network and without any temporary files on network-mounted storage, so there is no strain on the file system apart from starting up R once per job. All calculations are load-balanced, i.e. workers that get their jobs done faster will also receive more function calls to work on. This is especially useful if not all calls return after the same time, or one worker has a high load.
Browse the vignettes here:
Install the clustermq
package in R from CRAN (including
the bundled ZeroMQ system
library):
install.packages('clustermq')
Alternatively you can use the remotes
package to install
directly from Github. Note that this version needs
autoconf
/automake
and CMake
for
compilation:
# install.packages('remotes')
::install_github('mschubert/clustermq')
remotes# remotes::install_github('mschubert/clustermq@develop') # dev version
[!TIP] For installation problems, see the FAQ
An HPC cluster’s scheduler ensures that computing jobs are distributed to available worker nodes. Hence, this is what clustermq interfaces with in order to do computations.
We currently support the following schedulers (either locally or via SSH):
options(clustermq.scheduler="multiprocess")
options(clustermq.scheduler="PBS"/"Torque")
options(clustermq.scheduler="ssh", clustermq.ssh.host=<yourhost>)
[!TIP] Follow the links above to configure your scheduler in case it is not working out of the box and check the FAQ if your job submission errors or gets stuck
The most common arguments for Q
are:
fun
- The function to call. This needs to be
self-sufficient (because it will not have access to the
master
environment)...
- All iterated arguments passed to the function. If
there is more than one, all of them need to be namedconst
- A named list of non-iterated arguments passed
to fun
export
- A named list of objects to export to the
worker environmentThe documentation for other arguments can be accessed by typing
?Q
. Examples of using const
and
export
would be:
# adding a constant argument
= function(x, y) x * 2 + y
fx Q(fx, x=1:3, const=list(y=10), n_jobs=1)
# exporting an object to workers
= function(x) x * 2 + y
fx Q(fx, x=1:3, export=list(y=10), n_jobs=1)
clustermq
can also be used as a parallel backend for foreach
.
As this is also used by BiocParallel
,
we can run those packages on the cluster as well:
library(foreach)
register_dopar_cmq(n_jobs=2, memory=1024) # see `?workers` for arguments
foreach(i=1:3) %dopar% sqrt(i) # this will be executed as jobs
library(BiocParallel)
register(DoparParam()) # after register_dopar_cmq(...)
bplapply(1:3, sqrt)
More examples are available in the User Guide.
There are some packages that provide high-level parallelization of R
function calls on a computing cluster. We compared
clustermq
to BatchJobs
and
batchtools
for processing many short-running jobs, and
found it to have approximately 1000x less overhead cost.
In short, use clustermq
if you want:
Use batchtools
if
you:
Don’t use batch
(last updated 2013) or BatchJobs
(issues with SQLite on network-mounted storage).
Contributions are welcome and they come in many different forms, shapes, and sizes. These include, but are not limited to:
log_worker=TRUE
.good first issue
tag. Please discuss anything more complicated before putting a lot of
work in, I’m happy to help you get started.[!TIP] Check the User Guide and the FAQ first, maybe your query is already answered there
This project is part of my academic work, for which I will be
evaluated on citations. If you like me to be able to continue working on
research support tools like clustermq
, please cite the
article when using it for publications:
M Schubert. clustermq enables efficient parallelisation of genomic analyses. Bioinformatics (2019). doi:10.1093/bioinformatics/btz284