What's the relationship between Sun Grid Engine (SGE) process number and OpenMPI process number?

Learn what's the relationship between sun grid engine (sge) process number and openmpi process number? with practical examples, diagrams, and best practices. Covers mpi, openmpi, sungridengine deve...

Understanding the Relationship Between SGE and OpenMPI Process Numbers

Abstract representation of a grid computing cluster with processes flowing between nodes.

Explore how Sun Grid Engine (SGE) job slots map to OpenMPI process ranks, and learn how to configure your MPI jobs for optimal performance on an SGE cluster.

When running parallel applications using OpenMPI on a Sun Grid Engine (SGE) cluster, a common point of confusion arises regarding the relationship between SGE's allocated process slots and OpenMPI's internal process numbering (ranks). Understanding this mapping is crucial for correctly configuring your MPI jobs, ensuring efficient resource utilization, and debugging performance issues. This article will clarify how SGE communicates resource allocations to OpenMPI and how OpenMPI interprets these to assign ranks to your parallel processes.

SGE's Role in Resource Allocation

Sun Grid Engine (SGE), now often referred to as Oracle Grid Engine or Open Grid Engine, is a workload management system that allocates computational resources (CPU cores, memory, etc.) to jobs submitted by users. For parallel jobs, SGE uses a concept called a 'Parallel Environment' (PE). When you submit an MPI job, you request a certain number of slots from a specific PE. SGE then finds available hosts and assigns the requested slots across them. The key information SGE provides to the job is a 'machine file' or 'host file' that lists the allocated hosts and the number of slots on each.

#!/bin/bash
#$ -N MyMPIJob
#$ -pe mpi 8
#$ -cwd

# The $PE_HOSTFILE environment variable points to the machine file
# OpenMPI typically reads this automatically or can be specified with -hostfile

mpirun -np $NSLOTS ./my_mpi_program

Example SGE job script for an OpenMPI job.

OpenMPI's Interpretation of Resources

OpenMPI's mpirun (or orterun) command is responsible for launching your parallel application across the allocated resources. When mpirun starts, it reads the machine file provided by SGE (usually via the $PE_HOSTFILE environment variable). This file tells mpirun which hosts to use and how many processes (slots) it can launch on each host. Based on this information, mpirun then assigns a unique rank (from 0 to N-1, where N is the total number of processes) to each MPI process it launches.

flowchart TD
    A[SGE Job Submission] --> B{Request PE and Slots}
    B --> C[SGE Allocates Resources]
    C --> D["Generates $PE_HOSTFILE"]
    D --> E[mpirun Reads $PE_HOSTFILE]
    E --> F[mpirun Launches MPI Processes]
    F --> G["MPI Processes Get Unique Ranks (0 to N-1)"]
    G --> H[MPI Application Execution]

Flow of resource allocation from SGE to OpenMPI process numbering.

Crucially, the total number of processes (-np argument to mpirun) should match the total number of slots SGE has allocated ($NSLOTS). If these numbers don't match, you might encounter errors or inefficient resource usage. For instance, if you request 8 slots from SGE but tell mpirun to launch 16 processes, mpirun will attempt to launch more processes than SGE has allocated, potentially leading to oversubscription or job failure.

💡

Always use the $NSLOTS environment variable provided by SGE directly in your mpirun command for the -np argument. This ensures that the number of MPI processes launched exactly matches the number of slots allocated by SGE, preventing resource mismatches.

Mapping SGE Slots to MPI Ranks

The relationship is direct: each SGE slot corresponds to one potential MPI process. OpenMPI takes the list of hosts and slots from the $PE_HOSTFILE and distributes the MPI ranks across them. By default, OpenMPI tries to fill up each host with processes before moving to the next host, but this behavior can be controlled with process binding and mapping options (e.g., --map-by, --bind-to).

# Example $PE_HOSTFILE content for a job requesting 8 slots
# SGE allocates 4 slots on hostA and 4 slots on hostB
hostA.example.com 4
hostB.example.com 4

Example content of an SGE PE_HOSTFILE.

Given the above $PE_HOSTFILE, mpirun -np 8 ./my_mpi_program would typically launch ranks 0, 1, 2, 3 on hostA and ranks 4, 5, 6, 7 on hostB. The exact assignment of ranks to physical cores within a host depends on OpenMPI's binding policies and the available hardware topology.

⚠️

Be mindful of process binding. If OpenMPI processes are not properly bound to physical cores, multiple MPI processes might contend for the same core, leading to performance degradation. Consult OpenMPI documentation for --map-by and --bind-to options to optimize performance.