Running on Slurm HPC Systems

mythos with the RayOptimizer can scale across multiple nodes on Slurm-managed HPC clusters. This page covers practical advice for writing sbatch scripts and tuning resource allocation.

sbatch Script Example

The following script launches a multi-node Ray cluster on Slurm and runs an optimization using ray symmetric-run (requires Ray 2.50+):

#!/bin/bash
#SBATCH --job-name=mythos-oxdna-lp
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --nodes=4
#SBATCH --partition=cpu
#SBATCH --time=02:00:00

set -x

nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
port=6379
ip_head=$head_node:$port
export ip_head
echo "IP Head: $ip_head"

module load gcc/9.3.0 cmake/3.27.9  # replace with your cluster's module names
. ~/mythos/.venv/bin/activate  # replace with your python virtual/conda environment

srun --nodes="$SLURM_JOB_NUM_NODES" --ntasks="$SLURM_JOB_NUM_NODES" \
  ray symmetric-run \
  --address "$ip_head" \
  --min-nodes "$SLURM_JOB_NUM_NODES" \  # Min nodes waits for all nodes to join before starting
  --num-cpus="${SLURM_CPUS_PER_TASK}" \  # Per-worker logical CPU count
  -- \
  python -u persistence_length_optimization.py

Key sbatch Settings

tasks-per-node

Always set --tasks-per-node=1. Each Slurm node runs one Ray worker process; Ray handles task-level parallelism internally. Setting this to a higher value will launch multiple competing Ray workers on the same node, leading to resource contention and unpredictable behavior.

cpus-per-task

Set --cpus-per-task to the number of CPU cores each node should expose to Ray. This value is passed to ray symmetric-run via --num-cpus so that Ray knows how many CPUs are available per node for scheduling tasks.

nodes

The total number of parallel simulators is determined by the total CPU count across all nodes (cpus-per-task × nodes) divided by the CPUs allocated per simulator via scheduler_hints. For oxDNA, each simulation typically runs best with a single CPU (num_cpus=1), so four nodes with 16 cores each can run up to 64 simulators in parallel.

Note that a single simulator cannot span multiple nodes — if a simulator requires more CPUs than cpus-per-task, it will not be schedulable. Ensure that the per-simulator num_cpus in scheduler_hints does not exceed cpus-per-task.

Ray Cluster Startup

The script above uses ray symmetric-run, which handles starting and connecting all workers to the head node automatically. The key flags are:

  • --address "$ip_head" — the head node address, derived from Slurm’s SLURM_JOB_NODELIST.

  • --min-nodes "$SLURM_JOB_NUM_NODES" — wait for all allocated nodes to join before starting the workload.

  • --num-cpus — tells Ray how many CPUs each worker should advertise (should match --cpus-per-task).

Module Loading

If the cluster nodes require specific compiler toolchains or build tools (e.g., gcc, cmake). The oxDNA simulator recompiles its binary during optimization, so cmake and a C++ compiler must be available on every node.

Memory Considerations

oxDNA Build Memory

The oxDNA source compilation (triggered each optimization step when parameters change) can require significant memory — potentially several gigabytes per build. If you observe out-of-memory errors during the build phase, consider:

  • Requesting higher-memory nodes or partitions (via Slurm’s --mem or --mem-per-cpu options, the default may be insufficient).

  • Reducing n_build_threads on the oxDNASimulator to lower peak memory during parallel compilation.

  • Setting mem_mb in scheduler_hints so Ray avoids scheduling too many concurrent builds on the same node.

from mythos.utils.scheduler import SchedulerHints

simulator = oxdna.oxDNASimulator(
    ...,
    n_build_threads=4,
    scheduler_hints=SchedulerHints(
        num_cpus=4,
        mem_mb=8192,  # reserve 8 GB per task
    ),
)

Trajectory and Gradient Memory

When many simulators feed trajectories into a single objective, memory can grow quickly. See the Memory Considerations section of the Ray Optimizer page for strategies on managing trajectory concatenation and gradient computation memory.

Other considerations

Other simulators, objectives, and callbacks may have their own memory requirements. Typically either the OS or Slurm will kill the job if it exceeds available or allocated memory, and the logs will show OOM kill messages.

Some options for diagnosing:

  • Use Slurm’s memory monitoring tools (e.g., sacct with the MaxRSS field)

  • Check Ray’s dashboard for worker resource usage and correlate which tasks are running when OOM kills occur.

  • Adjust Slurm’s memory allocation and Ray’s scheduler hints iteratively on a scaled-down sample workflow to find a stable configuration

See Observability and Debugging on the Ray Optimizer page for information on using the Ray dashboard to monitor resource usage, including how to access it via SSH port forwarding on Slurm clusters.

Troubleshooting

Further advice on using ray on Slurm can be found in the Ray documentation: Deploying on Slurm.

Workers fail to join

If the job times out waiting for workers, verify that all nodes can reach the head node on port 6379 (or your chosen port). Some clusters have firewall rules between compute nodes that may block Ray’s communication ports, or the port is already bound by another process, and another should be chosen.

Build failures on worker nodes

If oxDNA compilation fails on worker nodes but succeeds locally, ensure that the required modules (gcc, cmake) are loaded in the sbatch script. Worker nodes may not have the same default module set as login nodes.

Out-of-memory kills

If Slurm’s OOM killer terminates your job, check whether the issue is during oxDNA compilation or during gradient computation. Use mem_mb in scheduler_hints to help Ray distribute memory-intensive tasks across nodes, and consider requesting a higher-memory partition.