Using SLURM with Jupyter Notebook

Overview

When working in a Jupyter Notebook session it can be advantageous to launch jobs using output from the session, or to conduct parallel processing tasks. This tutorial demonstrates two methods of launching SLURM jobs from inside a Jupyter Notebook.

SLURM-MAGIC

The python package slurm-magic permits complete job scripts to be constructed inside of a Jupyter Notebook and submitted to the cluster resource manager. It is recommended to install slurm-magic in the same conda environment in which you've installed Jupyter Notebook, since it will use this environment and any installed packages for executing the scripted job.

To install slurm-magic:

conda activate myenv
module load git-gcc
pip install git+https://github.com/NERSC/slurm-magic.git

where "myenv" is the name of the conda environment you intend to run Jupyter Notebook. After starting a job to begin an interactive SLURM session (for example using "srun"), start Jupyter Notebook, create a new notebook and enter the following in a single cell:

%load_ext slurm_magic
import warnings
warnings.filterwarnings("ignore")

After entering the code above, hit "Shift" and "Return" together. Now you can create the job script inside the next cell. For example, here is a case where we are submitting a script which will run on a GPU node:

%%sbatch
#SBATCH --job-name=myGPUTeset
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --mail-type=ALL
#SBATCH [email protected]

python /home/u/user/gpuTask.py -i /scratch/users/user/myInput.dat -o /scratch/users/user/gpuTask.out

Once again, hit "Shift" and "Return" to submit the job. (make sure to substitute your actual script or command for the line beginning with "python".

NOTE: Do NOT try to combine the first and second code blocks as this may generate an error. Ensure that "%load_ext slurm_magic" statement is executed BEFORE creating your job script.

You can use essentially any common SLURM directive in this manner, without special keywords or limitations. Please be aware however, whatever code you execute will be subject to the packages installed in the same conda environment as you used to start Jupyter Notebook. Thus, loading other Python modules (for example) may cause unexpected errors or other issues.

Once the job script is submitted, you can check on the status using the "squeue -u username" command in another terminal.

Here is a second example showing how to submit a multi-node MPI job using slurm-magic. Suppose we have a script named "mpiTest.py" that contains the following:

import ipyparallel as ipp
def mpi_example():
    from mpi4py import MPI
    comm = MPI.COMM_WORLD
    return f"Hello World from rank {comm.Get_rank()}. total ranks={comm.Get_size()}. host={MPI.Get_processor_name()}"

# request an MPI cluster with 24 engines
with ipp.Cluster(controller_ip="*", engines="mpi", n=24) as rc:
    # get a broadcast_view on the cluster which is best
    # suited for MPI style computation
    view = rc.broadcast_view()
    # run the mpi_example function on all engines in parallel
    r = view.apply_sync(mpi_example)
    # Retrieve and print the result from the engines
    print("\n".join(r))
# at this point, the cluster processes have been shutdown

If you have the necessary prerequisites installed in the conda environment you used to launch Jupyter Notebook (for example, ipyparallel, mpi4py, mpich, etc.) you can submit this using slurm-magic with the following:

%%sbatch
#SBATCH --job-name=myMPITest
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=8
#SBATCH --ntasks=24
#SBATCH --mail-type=ALL
#SBATCH [email protected]

python /home/u/user/mpiTest.py

Remember to load the slurm-magic extension first before attempting to run the code above. This job will assign 3 nodes, with 8 cores for each (24 total) and run our MPI script across those nodes (see Using MPI with Jupyter for more information).

SUBMITIT

The python/conda package "submitit" can also be used to submit jobs to the SLURM resource manager inside Jupyter Notebook. Unlike "slurm-magic", "submitit" has unique notations used for structuring and submitting the job. Documentation on "submitit" is not extensive, and some common SLURM directives may not be available. Submitit also is designed more to be an "in-line" part of your code, rather than a standalone job script.

To install submitit:

conda activate myenv
conda install submitit

Next, start an interactive SLURM session and start Jupyter Notebook. Then create a new notebook, and enter your code in a cell.

In the following example, we create a function called "add" and then submit it with arguments to the cluster using submitit:

import submitit
import sys
import os
def primes(nprimes):
    os.system('module load Python')
    n = nprimes
    for p in range(2, n+1):
      for i in range(2, p):
          if p % i == 0:
              break
      else:
          print (p),
    print ('Done')
log_folder = "log_test/%j"
executor = submitit.AutoExecutor(folder=log_folder)
executor.update_parameters(slurm_job_name="PrimesTest", tasks_per_node=1, nodes=1, gpus_per_node=1, timeout_min=300, slurm_partition="defq")
job = executor.submit(primes, 1000000)
print(job.job_id)  # ID of your job
output = job.result()

In this example, we create a function "primes" which calculates prime numbers until the value of "n" is reached. We need to create a "log_folder" for submitit - this is where all of the job related files will be stored, including output, error, and job submittal scripts. Next, an submitit objected named "executor" is created. This is used to specify job parameters as shown above. Finally, we create an submit object called "jobs" - this actually submits our function "primes" to the cluster as a job, with an input value of "1000000" for the function. The "print" statement will display your jobid and the "output" object is used to capture any error or warning information should the job fail.

Once the code is all entered into a cell, hit "Shit" and "Return" to execute. Unless your function is designed to produce output in a specified location, it will be found in a *.out file under your "log_test" directory under the jobid corresponding to the job you just submitted.

More Information


CategoryHPC