CentOS 7 + TORQUE + NUMA

Suppose a NUMA computer with two CPUs installed on a motherboard with two CPU sockets. For example, assume that two Xeons are installed as CPUs. If you check with the lscpu command,

$ lscpu
architecture: x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Number of threads per core:1
Number of cores per socket:16
Socket(s):             2
NUMA node:         2
Vendor ID:        GenuineIntel
CPU family:    6
model:             85
Model name:            Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz
stepping:    7
CPU MHz:               1200.048
CPU max MHz:           3900.0000
CPU min MHz:           1200.0000
BogoMIPS:              5600.00
Virtualization:             VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:    1024K
L3 cache:    22528K
NUMA node 0 CPU:   0-15
NUMA node 1 CPU:   16-31

This is the case when it is displayed as. As you can see in here, Xeon Gold 6242 has 16 cores, and it is true that NUMA node 0 recognizes 16 cores from 0 to 15. The second is NUMA node 1, which recognizes 16 cores from 16 to 31. In other words, it has 32 core resources as a computer. (Hyper-threading technology is turned off)

Job management system setup </ font>

The OS is CentOS 7, and I want to use The Terascale Open-source Resource and QUEue Manager (TORQUE) as a job management system.

First, install the TORQUE server, client, and scheduler according to Torque setup on CentOS 7.

If set according to the article above, the pbsnodes command should only show one node (16 cores). In this case there are two NUMA nodes, so It is necessary to partially rewrite the settings as follows.

/var/lib/torque/server_priv/nodes


HOSTNAME num_node_boards=2 numa_board_str=16

/var/lib/torque/mom_priv/mom.layout


nodes=0
nodes=1

Here, the host name of the computer should be written in place of HOSTNAME. Set the number of NUMA nodes in num_node_boards. Set the number of CPU cores per NUMA node in numa_board_str.

Set the above and restart torque:

# systemctl restart pbs_server
# systemctl restart pbs_sched
# systemctl restart pbs_mom

Execute the pbsnodes command again, and if two nodes with np = 16, state = free are displayed, it is successful. I think the names of the nodes are like HOSTNAME-0 and HOSTNAME-1.

For example, when performing 32 parallel calculations with OpenMP, Add options to your job script, such as:

#PBS -l nodes=2:ppn=16

In this example, all two NUMA nodes and cores (32 cores in total) are reserved. While this job is being executed (status: R), other jobs are in the standby state ( status: Q).

Recommended Posts

CentOS 7 + TORQUE + NUMA
Torque setup on CentOS 6