Strumenti Utente

Strumenti Sito


calcoloscientifico:guidautente_slurm_en

HPC Project of the University of Parma and INFN Parma
User Guide (SLURM version)
Clicca qua per la versione italiana

Project Description (it)

Access / Login

In order to access the resources, you must be included in the LDAP database of the HPC management server. Requests for access or general assistance must be sent to: es_calcolo@unipr.it.

Once enabled, the login is done through SSH on the login host:

ssh <name.surname>@login.hpc.unipr.it 

Password access is allowed only within the University network (160.78.0.0/16). Outside this context it is necessary to use the University VPN or access with public key authentication.

Access password-less between nodes

In order to use the cluster it is necessary to eliminate the need to use the password between nodes, using public key authentication. It is necessary to generate on login.hpc.unipr.it the pair of keys, without passphrase, and add the public key in the authorization file (authorized_keys):

Key generation. Accept the defaults by pressing enter:

ssh-keygen -t rsa

Copy of the public key into authorized_keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

External Access with public key Authentication

The key pair must be generated with the SSH client. The private key should be protected by an appropriate passphrase (it is not mandarory but recommended). The public key must be included in your authorized_keys file on the login host.

If you use the SSH client for Windows PuTTY (http://www.putty.org), you need to generate the public and private key pair with PuTTYgen and save them into a file. The private key must be included in the Putty (or WinSCP) configuration panel:

Configuration -> Connection -> SSH -> Auth -> Private key file for authentication

The public key must be included in the .ssh/authorized_keys file on login.hpc.unipr.it

Useful links for SSH clients configuration: Linux, MacOS X, PuTTY , Windows SSH Secure Shell

The public key of the client (example client_id_rsa.pub) must be inserted in the file ~/.ssh/authorized_keys of login computer:

Copy of the public key into authorized_keys:

cat client_id_rsa.pub >> ~/.ssh/authorized_keys

File transfer

SSH is the only protocol for external communication and can also be used for file transfer.

If you use a Unix-like client (Linux, MacOS X) you can use the command scp or sftp.

On Windows systems, the most used tool is WinSCP (https://winscp.net/eng/docs/introduction). During the installation of WinSCP it is possible to import Putty profiles.

SSH can also be used to mount a remote file-system using SshFS (see http://www.fis.unipr.it/dokuwiki/doku.php?id=calcoloscientifico:guidautente_slurm_en#sshfs)

Hardware

The current cluster is composed of the new computing nodes.

New computing nodes

  • Cluster1 ( BDW)
    • 8 nodes with 2 Intel Xeon E5-2683v4 (2x16 cores, 2.1GHz, 40MB smartcache), 128 GB RAM (E4)
    • 9 nodes with 2 Intel Xeon E5-2680v4 (2x14 cores, 2.4GHz, 35MB smartcache), 128 GB RAM (DELL R730)
    • 1 nodes with 2 Intel Xeon E5-2683v4 (2x16 cores, 2.1GHz, 40MB smartcache), 1024 GB RAM (E4 - FAT MEM)
    • 1 nodes with 4 Intel Xeon E7-8880v4 (4x22 cores, 2.2GHz, 55MB smartcache), 512 GB RAM (HP - FAT CORES)
  • Cluster2 ( GPU)
    • 2 nodes with 2 Intel Xeon E5-2683v4 (2x16 cores, 2.1GHz), 128 GB RAM, 7 GPU NVIDIA P100-PCIE-12GB (Pascal architecture).
  • Cluster3 ( KNL)
    • 4 nodes with 1 Intel Xeon PHI 7250 (1x68 cores, 1.4GHz, 16GB MCDRAM), 192 GB RAM.

Nodes details:

Node list - Usage (intranet only)

Peak performance (double precision):

1 Node BDW -> 2x16 (cores) x 2.1 (GHz) x 16 (AVX2) = 1 TFlops, Max memory Bandwidth = 76.8 GB/s
1 GPU P100 -> 4.7 TFlops
1 node KNL -> 68 (cores) x 1.4 (GHz) x 32 (AVX512) = 3 TFlops, Max memory bandwidth = 115.2 GB/s 

Interconnection with Intel OmniPath

Peak performance:

Bandwidth: 100 Gb/s, Latency: 100 ns.

Benchmarks: IMB, NBODY, HPL

Software

The operating system for all types of nodes is CentOS 7.X.

Environment Software (libraries, compilers e tools): List

Some software components must be loaded in order to be used.

To list the available modules:

module avail

To upload / download a module (example intel):

module load   intel 
module unload intel

To list the loaded modules:

module list

Storage

The login node and computing nodes share the following storage areas:

Mount Point Env. Var. Backup Quota Note Support
/hpc/home $HOME yes 50 GB Programs and data SAN nearline
/hpc/group (/hpc/account ?) $GROUP yes 100 GB Programs and data SAN nearline
/hpc/share Application software and database SAN nearline
/hpc/scratch $SCRATCH no 1? TB, max 1 month run-time data SAN
/hpc/archive $ARCHIVE no Archive NAS/tape/cloud (1)
(1) Archive: foreseeen in 2019

Private Area

Acknowledgement

Remember to mention the project in the publications among the Acknowlegements:

This research benefits from the HPC (High Performance Computing) facility of the University of Parma, Italy

Old sentence, do no use: Part of this research is conducted using the High Performance Computing (HPC) facility of the University of Parma.

The authors are requested to communicate the references of the publications, which will be listed on the site.

Job Submission with Slurm

The queues are scheduled with Slurm Workload Manager.

Slurm Partitions

Work in progress
Cluster Partition job resources TIMELIMIT Max Running per user
BDW bdw 2-256 core 10-00:00:00
KNL knl 2- core 10-00:00:00
GPU gpu 1-10 GPU ?? 0-24:00:00 6
vrt 1 core 10-00:00:00

Global configurations:

  • Global Max job running per user : ??
  • ..
  • Other partitions can be defined for special needs (ethrogeneous jobs, dedicated resources, ..)

Private area PBSpro - Slurm

Useful commands

Display the status of the queues in a synthetic way:

sinfo

Display the status of the individual queues in detail:

scontrol show partition

List of nodes and their status:

sinfo -all

Submission of a job:

srun <options>             # interactive mode
sbatch <options> script.sh # batch mode
squeue                     # Display jobs in the queue:
sprio                      # show dynamic priority

Main options

This option selects the partition (queue) to use:

-p <partition name> ( The default partition is bdw ?? )

Other options:

  • -Nx: where x is the number of chunk (cores group on the same node)
  • -ny: where y is the number of cores per each node (default 1)
  • –gres=gpu:tesla:X: where X is the number of GPU for each node (consumable resources)
  • –mem=<size{units}>: requested memory for node
  • –ntasks=Y: where Y is the number of processes MPI for each node
  • –cpus-per-task=Z: where Z is the number of thread OpenMP for each process
  • –exclusive: allocate hosts exclusively (not shared with other jobs)

Example of resource selection:

-p bdw -N1 -n2

-t <days-hours:minutes:seconds> Maximum execution time of the job. This data selects the queue to be used. ( Default: 0-00:72:00 verificare)

Example:

-t 0-00:30:00

-A <account name>

–account=<nameaccount>

Specifies the account to be charged for using resources. (Mandatory ??)

See Teamwork paragraph

Example:

-A T_HPC17A

-oe

redirects the standard error to standard output.

–mail-user=<mail address>

The option –mail-user allows to indicate one or more e-mail addresses, separated by commas, that will receive the notifications of the queue manager.

If the option is not specified, the queue system sends notifications to the user's university email address. In the case of guests, notifications are sent to the user's personal e-mail address.

–mail-type=<FAIL, BEGIN, END, NONE, ALL>

The option –mail-type allows to indicate the events that generate the sending of the notification:

  • FAIL: notification in case of interruption of the job
  • BEGIN: notification when job starts
  • END: notification when job stops
  • NONE: no notification
  • ALL: all notifications
If the option is not specified, the code system sends only in case of interruption of work.

Example:

--mail-user=john.smith@unipr.it
--mail-type=BEGIN,END

Priority

The priority (from queue to execution) is dynamically defined by three paramters:

  • Timelimit
  • Aging (waiting time in partition)
  • Fair share (amount of resources used in last 14 days)

Advance reservation

It is possible to define an advance reservation for teaching activities or special requests

Advance reservation policy: ToDo

For a request send an e-mail to es_calcolo@unipr.it

Accounting

Reporting Example:

accbilling.sh  -a <accountname>   -s 2018-01-01 -e 2018-04-10
accbilling.sh  -u <username>   -s 2018-01-01 -e 2018-04-10

Interactive jobs

Per verificare l'elenco delle risorse assegnate si può utilizzare la sottomissione interattiva con opzione -I. Una volta entrati in modo interattivo il comando cat $SLURM_JOB_NODELIST visualizza l'elenco delle risorse assegnate. Il comando squeue -al lista maggiori dettagli riguardo le risorse assegnate.

srun -N<nodes number> -n<cores number> -q <QOS> -C <node type> -t <wall time> -L <file system>
cat $SLURM_JOB_NODELIST
scontrol show job <jobID>
exit

Examples:

# 1 group (chunk) of 2 CPU type BDW and file system Scratch
srun -N1 -n2 -p bdw -L SCRATCH
 
# 2 chunks of 2 CPU type KNL and file system Scratch (they can stay on the same node)
srun -N2 -n2 -p knl -L SCRATCH
 
# The chunks must be on different nodes
srun -N2 -n2 -p knl --scatter
 
# 1 chunk with 2 GPU on GPU Cluster
srun -N1 -p gpu --gres=gpu:2 -L SCRATCH
 
# 2 chunks each with 2 GPU on different nodes
srun -N2 --gres=gpu:2 -p gpu --scatter
 
# --ntask=Y defines MPI how many processes need to be activated for each chunk
srun -N2 -n1ntasks=1: -p bdw

Batch job

A shell script must be created that includes the SLURM options and the commands that must be executed on the nodes.

to submit the job and related resource charge:

sbatch -A <account name> scriptname.sh

Each job is assigned a unique numeric identifier <Job Id>.

At the end of the execution the two files containing stdout and stderr will be created in the directory from which the job was submitted.

By default, the two files are named after the script with an additional extension:

Stdout: <script.sh>.o<job id> 
Stderr: <script.sh>.e<job id>

Serial jobs, compiler GNU

Compilation of the example mm.c for the calculation of the product of two matrices:

cp /hpc/share/samples/serial/mm.* .
g++ mm.cpp -o mm

Script mm.bash for the submission of the serial executable mm:

#!/bin/bash
 
#< Request a chunk with a 1 CPU
#SBATCH -p bdw -N1 -n32
 
#< Its declares that the job will last at most 30 minutes (days-hours:minutes:seconds)
#SBATCH --time 0-00:30:00
 
#< Charge resources to own account
#SBATCH $SBATCH_ACCOUNT
 
#< Print the node name assigned
cat  $SLURM_JOB_NODELIST
 
#< Enter the directory that contains the script
cd "$SLURM_SUBMIT_DIR"
 
#< Executes the program
./mm

Submission:

sbatch mm.bash

See <job id> and the state:

squeue

To cancel the job in progress:

scancel <Job id>

Serial jobs, compiler Intel

Compiling the cpi_mc.c example for the calculation of Pi:

cp /hpc/share/samples/serial/cpi/cpi_mc.c .
module load intel
icc cpi_mc.c -o cpi_mc_int

Script cpi_mc.bash for the submission of the serial executable cpi_mc_int:

#!/bin/bash
 
#< Print the node name assigned
cat $SLURM_JOB_NODELIST
 
#< Charge resources to own account
#SBATCH $SBATCH_ACCOUNT
 
#< Load the compiler module Intel
module load intel
 
#< Enter the directory that contains the script
cd "$SLURM_SUBMIT_DIR"
 
#< Executes the program
 
N=10000000
./cpi_mc_int  -n $N

Submission:

sbatch cpi_mc.bash

Serial job, compiler PGI

Compiling the cpi_sqrt.c example for the computing of Pi:

cp /hpc/share/samples/serial/cpi/cpi_sqrt.c .
module load pgi
pgcc cpi_sqrt.c -o cpi_sqrt_pgi

Script cpi_sqrt_pgi.bash for the submission of the serial executable cpi_sqrt_pgi:

#!/bin/bash
 
#< Options SLURM default. They can be omitted
#SBATCH -p bdw -N1 -n32
#SBATCH --time 0-00:30:00
 
#< Charge resources to own account
#SBATCH $SBATCH_ACCOUNT
 
#< Print name node assigned
cat $SLURM_JOB_NODELIST
 
module load pgi
#< Enter the directory that contains the script
cd "$SLURM_SUBMIT_DIR"
 
N=10000000
 
./cpi_sqrt_pgi -n $N
sbatch cpi_sqrt_pgi.bash

Job OpenMP with GNU 4.8

cp /hpc/share/samples/omp/omp_hello.c .
gcc -fopenmp omp_hello.c -o omp_hello

Script omp_hello.bash with the request for 32 CPUs in exclusive use.

#!/bin/bash
 
#SBATCH -p bdw -N1 -n32
#SBATCH --exclusive
#SBATCH -t 0-00:30:00
#SBATCH $SBATCH_ACCOUNT
 
#< Merge strerr with stdout
#SBATCH -oe
 
cat $SLURM_JOB_NODELIST
 
echo  #OMP_NUM_THREADS : $OMP_NUM_THREADS
 
cd "$SLURM_SUBMIT_DIR"
./omp_hello

Job OpenMP with Intel ==== ==== SLURM FATTO

module load intel 
cp /hpc/share/samples/omp/mm/omp_mm.cpp .

Script mm_omp.bash with the request of 1 whole node with at least 32 cores:

#!/bin/bash
 
#SBATCH -p bdw_debug -N1 -n32
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
#SBATCH --account=<account>
 
cat $SLURM_JOB_NODELIST
cd "$SLURM_SUBMIT_DIR"
 
module load intel
icpc -qopenmp omp_mm.cpp -o omp_mm
 
# To change the number of threads:
export OMP_NUM_THREADS=8
 
echo  OMP_NUM_THREADS : $OMP_NUM_THREADS
 
./omp_mm

Job OpenMP with PGI ==== ===== FATTO SLURM

cp /hpc/share/samples/omp/mm/omp_mm.cpp .

Script omp_mm_pgi.bash. The BDW cluster consists of nodes with 32 cores. The OMP_NUM_THREADS variable is by default equal to the number of cores. If we want a different thread number we can indicate it in the row –cpus-per-task

#!/bin/sh
 
#SBATCH -p bdw_debug -N1 -n32
#SBATCH --cpus-per-task=4
#SBATCH --time 0-00:30:00
SBATCH -oe
 
cat $SLURM_JOB_NODELIST
cd "$SLURM_SUBMIT_DIR"
 
module load pgi
 
pgc++ -mp omp_mm.cpp -o omp_mm_pgi
 
echo  OMP_NUM_THREADS : $OMP_NUM_THREADS
 
./omp_mm_pgi
sbatch -A <name account> omp_mm_pgi.bash

Job OpenMP with GNU 5.4 ==== ====SLURM FATTO

cp /hpc/share/samples/omp/cpi/* .
sbatch -A <name account> cpi2_omp.bash
python cpi2_omp.py

Job MPI, GNU OpenMPI ==== ==== SLURM FATTO

module load gnu openmpi
cp /hpc/share/samples/mpi/mpi_hello.c .
mpicc mpi_hello.c -o mpi_hello

Script mpi_hello.sh for using GNU OpenMPI:

#!/bin/bash
 
# 4 chunk of 16 CPU each. Executes a process MPI for each CPU
#SBATCH -p bdw_debug -N4 -n16
#SBATCH -n 16
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
echo "### SLURM_JOB_NODELIST ###"
cat $SLURM_JOB_NODELIST
echo "####################"
 
module load gnu openmpi
 
cd "$SLURM_SUBMIT_DIR"
mpirun  mpi_hello
sbatch -A <name account> mpi_hello.bash

Job MPI with Intel MPI ==== ==== SLURM FATTO

module load intel intelmpi
which mpicc
 
cp /hpc/share/samples/mpi/mpi_mm.c .
mpicc mpi_mm.c -o mpi_mm_int

Script mpi_mm_int.sh for using Intel MPI:

#!/bin/sh
 
# 4 chunk of 16 CPU each. Executes one process MPI for each CPU
#SBATCH -p bdw_debug -N4 -n16
#SBATCH -n 16
#SBACTH --time 0-00:30:00
#SBATCH -oe
 
echo "### SLURM_JOB_NODELIST ###"
cat $SLURM_JOB_NODELIST
echo "####################"
 
module load intel intelmpi
 
cd "$SLURM_SUBMIT_DIR"
mpirun  mpi_mm_int

Job MPI with PGI ==== ==== SLURM FATTO

module load pgi openmpi
which mpicc 

cp /hpc/share/samples/mpi/mpi_hello.c .
mpicc mpi_hello.c -o mpi_hello_pgi

Script mpi_hello_pgi.sh for using OpenMpi di PGI:

 
#!/bin/sh
 
#SBATCH -p bdw_debug -N4 -n16
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
echo "### SLURM_JOB_NODELIST ###"
cat $SLURM_JOB_NODELIST
echo "####################"
 
NPUSER=$(cat $SLURM_JOB_NODELIST | wc -l)
 
module load cuda pgi openmpi
 
cd "$SLURM_SUBMIT_DIR"
mpirun -hostfile $SLURM_JOB_NODELIST --npernode 1  mpi_hello_pgi

Job MPI + OpenMP with GNU OpenMPI ==== ==== SLURM FATTO

module load gnu openmpi
cp -p /hpc/share/samples/mpi+omp/mpiomp_hello.c .
mpicc -fopenmp mpiomp_hello.c -o mpiomp_hello_gnu

Script mpiomp_hello_gnu for using OpenMPI di PGI:

#!/bin/sh
 
# 4 chunk of 16 CPU each, 1 process MPI for each chunk, 16 thread OpenMP for process
#SBATCH -p bdw_debug -N4 -n16
#SBATCH -n 4
#SBATCH --cpus-per-task=16      # Number of threads OpenMP for each process MPI
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
echo "### SLURM_JOB_NODELIST ###"
cat $SLURM_JOB_NODELIST
echo "####################"
 
module load gnu openmpi
 
cd "$SLURM_SUBMIT_DIR"
mpirun mpiomp_hello_gnu

Job MPI + OpenMP with Intel MPI ==== ==== SLURM FATTO

module load intel intelmpi
cp /hpc/share/samples/mpi+omp/mpiomp-hello.c .
mpicc -qopenmp mpiomp_hello.c -o mpiomp_hello_int
#!/bin/sh
 
# 4 chunk of 16 CPU each, 1 process MPI for each chunk, 16 thread OpenMP for process
#SBATCH -p bdw_debug -N4 -n16
#SBATCH -n 1
#SBATCH --cpus-per-task=16      # Number of threads OpenMP for each process MPI
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
echo "### SLURM_JOB_NODELIST ###"
cat $SLURM_JOB_NODELIST
echo "####################"
 
module load intel intelmpi
 
cd "$SLURM_SUBMIT_DIR"
mpirun mpiomp_hello_int

Use of cluster KNL ===== ==== SLURM FATTO SE VIENE ATTIVATO KNL

The compiler to use is Intel.

The selection of the KNL cluster is done by specifying -p knl_<debug, pro ..> as required resources.

The maximum number of cores (ncpus) selectable per node is 68. Each physical core includes 4 virtual cores with hyperthreading technology, for a total of 272 per node.

#!/bin/sh
 
# 4 whole nodes. Executes one process MPI for each node and 128 threads for process
 
#SBATCH -p knl_debug -N4 -n1
#SBATCH -n 4
#SBATCH --cpus-per-task=128      # Number of threads OpenMP for each process MPI
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
echo "### SLURM_JOB_NODELIST ###"
cat $SLURM_JOB_NODELIST
echo "####################"
 
module load intel intelmpi
 
cd "$SLURM_SUBMIT_DIR"
 
cp /hpc/share/samples/mpi+omp/mpiomp_hello.c .
mpicc -qopenmp mpiomp_hello.c -o mpiomp_hello_knl
mpirun  mpiomp_hello_knl

Use of cluster GPU ===== == SLURM FATTO MA DA RIVEDERE BENE

The GPU cluster consists of 2 machines with 7 GPUs each. The GPUs of a single machine are identified by an integer ID that goes from 0 to 6.

The compiler to use is nvcc:

NVIDIA CUDA Compiler

Compilation example:

cp /hpc/share/samples/cuda/hello_cuda.cu .
module load cuda
nvcc hello_cuda.cu -o hello_cuda

The GPU cluster selection is done by specifying -p gpu_<debug-pro - ..> and –gres = gpu: <0-6> among the required resources.

Example of submission on 1 of the 7 GPUs available on a single node of the GPU cluster:

#!/bin/sh
 
# 1 node with 1 GPU
 
#SBATCH -p gpu_debug -N1
#SBATCH --gres=gpu:tesla:1
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
 
echo "### SLURM_JOB_NODEFILE ###"
cat $SLURM_JOB_NODEFILE
echo "####################"
 
module load cuda
 
cd "$SLURM_SUBMIT_DIR"
./hello_cuda

Example of submission of the N-BODY benchmark on all 7 GPUs available in a single node of the GPU cluster:

#!/bin/sh
 
# 1 node with 7 GPU
 
#SBATCH -p gpu_debug -N1
#SBATCH --gres=gpu:tesla:7
#SBATCH --time 0-00:30:00
#SBATCH -oe
 
echo "### SLURM_JOB_NODEFILE ###"
cat $SLURM_JOB_NODEFILE
echo "####################"
 
module load cuda
 
cd "$SLURM_SUBMIT_DIR"
/hpc/share/tools/cuda-9.0.176/samples/5_Simulations/nbody/nbody -benchmark -numbodies 1024000 -numdevices=5

In the case of N-BODY, the number of GPUs to be used is specified using the -numdevices option (the specified value must not exceed the number of GPUs required with the ngpus option).

In general, the GPU IDs to be used are derived from the value of the CUDA_VISIBLE_DEVICES environment variable.

In the case of the last example we have:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6

Teamwork

To share files among members of a group it is necessary to distinguish the type of activity

In the interactive mode on the login node the command newgrp modifies the primary group and the permissions of the new files:

newgrp <groupname>

The newgrp command on the HPC cluster also automatically executes the command to enter the group directory (/ hpc / group / <groupname>):

cd "$GROUP" 

In the Batch mode you must indicate the group to be used with the following directive:

#SBATCH --account=<account>

Scaling test ===== ==== SLURM FATTO

To sequentially launch a series run battery in the same job, for example to check the scaling of an algorithm:

cp /hpc/share/samples/serial/cpi/cpi_mc.c .
gcc cpi_mc.c o cpi_mc

Script launch_single.sh

#!/bin/bash

cd "$SLURM_SUBMIT_DIR"

for N in $(seq 1000000 1000000 10000000)
do
CMD="./cpi_mc -n $N"
echo "# $CMD"
eval $CMD  >> cpi_mc_scaling.dat
done
sbatch -A <account name> launch_single.sh

The outputs of the different runs are written to the cpi_mc_scaling.dat file.

To generate a scaling plot we can use the python matplotlib library:

cp /hpc/share/samples/serial/cpi/cpi_mc_scaling.py .
python cpi_mc_scaling.py

Job Array ===== === SLURM FATTO

Using a single SLURM script it is possible to subdue a battery of Jobs, which can be executed in parallel, specifying a different numerical parameter for each submissive job.

The -J option specifies the numeric sequence of parameters. At each launch the value of the parameter is contained in the $ SLURM_ARRAY_TASK_ID variable

Example:

Starts N job for the computing of Pi with a number of intervals increasing from 100000 to 900000 with increment of 10000:

cp /hpc/share/samples/serial/cpi/cpi_mc .
gcc cpi_mc.c o cpi_mc

Script slurm_launch_parallel.sh

#!/bin/sh
 
#SBATCH -J 100000-900000:10000
cd "SLURM_SUBMIT_DIR"
CMD="./cpi_mc -n ${SLURM_ARRAY_TASK_ID}"
echo "# $CMD"
eval $CMD
sbatch -A <name account> slurm_launch_parallel.sh

Gather the outputs:

grep -vh '^#' slurm_launch_parallel.sh.o*.*

Job MATLAB ===== === SLURM FATTO

Execution of a MATLAB serial program
cp /hpc/share/samples/matlab/pi_greco.m .

Script matlab.sh

#!/bin/sh
 
 
#SBATCH -p bdw_debug -N1 -n1
#SBATCH --time 0-00:30:00
 
cd "$SLURM_SUBMIT_DIR"
 
module load matlab
 
matlab -nodisplay -r pi_greco
sbatch -A <account name> matlab.sh
Execution of a parallel job with MATLAB
cp /hpc/share/samples/matlab/pi_greco_parallel.m .

Script matlab_parallel.sh. La versione di Matlab installata sul cluster consente l'utilizzo massimo di cores dello stesso nodo. qua bisogna specificare quanti core utilizzabili…ho messo 4 per il momento

#!/bin/sh
 
 
#SBATCH -p bdw_debug -N1 -n4
#SBATCH --time 0-00:30:00
 
cd "$SLURM_SUBMIT_DIR"
 
module load matlab
 
matlab -nodisplay -r pi_greco_parallel
sbatch -A <account name> matlab_parallel.sh
Execution of a program MATLAB on GPU
cp /hpc/share/samples/matlab/matlabGPU.m . # ---- da fare ---- 

Script matlabGPU.sh

#!/bin/bash
 
#SBATCH -p bdw_debug -N1 -n1
#SBATCH --gres=gpu:1
#SBATCH --time 0-00:30:00
 
cd "$SLURM_SUBMIT_DIR"
 
module load matlab cuda
 
matlab -nodisplay -r matlabGPU.m
 sbatch -A <account name> matlabGPU.sh

Job MPI Crystal14

Script crystal14.sh for submitting the MPI version of Crystal14. Requires 4 nodes from 8 cores and starts 8 MPI processes per node:

#!/bin/sh
 
#SBATCH --job-name="crystal14" #Job name 
#SBATCH -p bdw_debug -N4 -n8 #Resource request
#SBATCH -n8
#SBATCH --time 0-168:00:00
 
# input files directory
CRY14_INP_DIR='input'
 
# output files directory
CRY14_OUT_DIR='output'
 
# input files prefix
CRY14_INP_PREFIX='test'
 
# input wave function file prefix
CRY14_F9_PREFIX='test'
 
source /hpc/share/applications/crystal14

We recommend creating a folder for each simulation. In each folder there must be a copy of the crystal14.sh script.

The script contains the definition of four variables:
  • CRY14_INP_DIR: the input file or files must be in the 'input' subfolder of the current directory. To use the current directory, comment the line with the definition of the CRY14_INP_DIR variable. To change subfolder, change the value of the CRY14_INP_DIR variable.
  • CRY14_OUT_DIR: the output files will be created in the 'output' subfolder of the current folder. To use the current directory, comment the line with the definition of the CRY14_OUT_DIR variable. To change subfolder modify the value of the variable CRY14_OUT_DIR.
  • CRY14_INP_PREFIX: the file or input files have a prefix that must coincide with the value of the CRY14_INP_PREFIX variable. The string 'test' is purely indicative and does not correspond to a real case.
  • CRY14_F9_PREFIX: the input file, with extension 'F9', is the result of a previous processing and must coincide with the value of the variable CRY14_F9_PREFIX. The string 'test' is purely indicative and does not correspond to a real case.

The crystal14.sh script includes, in turn, the system script / hpc / software / bin / hpc-pbs-crystal14 . The latter can not be changed by the user.

Submission of the shell script

Navigate to the folder containing crystal14.sh and run the following command to submit the script to the job scheduler:

sbatch ./crystal14.sh
Analysis of files produced by Crystal14 during job execution

During execution of the job a temporary tmp folder is created which contains the two files:

nodes.par
machines.LINUX

The nodes.par file contains the names of the nodes that participate in the parallel computing.

The machines.LINUX file contains the names of the nodes that participate in the parallel computing with a multiplicity equal to the number of MPI processes started on the node.

To locate the temporary folders produced by Crystal14 during the execution of the job, run the following command directly from the login node:

eval ls -d1 /hpc/node/wn{$(seq -s, 81 95)}/$USER/crystal/* 2>/dev/null
Be careful because the previous command contains the names of the currently available calculation nodes. This list and the corresponding command may change in the future.

To check the contents of the files produced by Crystal14 during the execution of the job, the user can move to one of the folders highlighted by the previous command.

At the end of the execution of the job, the two files machines.LINUX and nodes.par are deleted. The temporary folder tmp is deleted only if it is empty.

It is therefore not necessary to log in with SSH to the nodes participating in the processing to check the contents of the files produced by Crystal14.

Job Gromacs ===== === INVARIATO

To define the GMXLIB environment variable, add the following lines to the file $HOME/.bash_profile:

GMXLIB=$HOME/gromacs/top
 
export GMXLIB
The path $ HOME / gromacs / top is purely indicative. Modify it according to your preferences.

Job Gromacs OpenMP ==== === SLURM FATTO

Script mdrun_omp.sh to exclusively request a node with 32 cores and start 16 OpenMP threads:

#!/bin/sh
 
#SBATCH -p bdw_debug -N1 -n32
#SBATCH --cpus-per-task=16      # Number of threads OpenMP
#SBATCH --exclusive
#SBATCH --time 0-24:00:00
 
test "$SLURM_ENVIRONMENT" = 'SLURM_BATCH' || exit
 
cd "$SLURM_SUBMIT_DIR"
 
module load gnu openmpi
source '/hpc/share/applications/gromacs/5.1.4/mpi_bdw/bin/GMXRC'
 
gmx mdrun -s topology.tpr -pin on
This will initiate a single MPI process and will result in suboptimal performance.

Job Gromacs MPI ed OpenMP ==== === SLURM FATTO

Script mdrun_mpi_omp.sh to exclusively request a node with 32 cores and start 8 MPI processes (the number of OpenMP threads will be calculated automatically):

#!/bin/sh
 
#SBATCH -p bdw_debug -N2 -n32
#SBATCH -n 8
#SBATCH --exclusive
#SBATCH --time 0-24:00:00
 
test "$SLURM_ENVIRONMENT" = 'SLURM_BATCH' || exit
 
cd "$SLURM_SUBMIT_DIR"
 
module load gnu openmpi
source '/hpc/share/applications/gromacs/5.1.4/mpi_bdw/bin/GMXRC'
 
NNODES=$(cat $SLURM_JOB_NODELIST | sort -u | wc -l)
NPUSER=$(cat $SLURM_JOB_NODEFILE | wc -l)
OMP_NUM_THREADS=$((OMP_NUM_THREADS/(NPUSER/NNODES)))
 
mpirun gmx mdrun -s topology.tpr -pin on
This will initiate multiple MPI processes and will achieve optimal performance.

Job Abaqus

Job Abaqus MPI ==== === SLURM FATTO

Example script abaqus.sh to run Abacus on 1 node, 32 cores, 0 GPUs:

#!/bin/bash

# walltime --time : estimated execution time, max 240 hours (better an estimate for excess than effective)

#SBATCH -p bdw_debug -N1 -n32
#SBATCH --time 0-240:00:00

cat $SLURM_JOB_NODELIST

# Modules necessary for the execution of Abacus
module load gnu intel openmpi

cd "$SLURM_SUBMIT_DIR"

abaqus j=testverita cpus=32
# j= nomefile.inp

Job Abaqus MPI with GPU ==== ==== SLURM FATTO SE ATTIVATO CLUSTER GPU ???

Example script abaqus-gpu.sh to run Abacus on 1 node, 6 cores, 1 GPU:

#!/bin/bash

# walltime --time : estimated running time, max 240 hours (better than a slightly higher than actual estimate)

#SBATCH -p gpu_dbg -N1 -n6
#SBATCH --gres=gpu:1
#SBATCH --time 0-00:30:00

cat $SLURM_JOB_NODELIST

# Modules necessary for the execution of Abacus
module load gnu intel openmpi cuda

cd "$SLURM_SUBMIT_DIR"

abaqus j=testverita cpus=6 gpus=1
# j= filename.inp

SSHFS ===== === DA FARE

To exchange data with a remote machine on which an ssh server is installed, you can use it SSHFS .

SSHFS is a file-system for Unix-like operating systems (MacOsX, Linux, BDS). This file system allows you to locally mount a folder located on a host running SSH server. This software implements the FUSE Kernel module.

Currently it is only installed on login.pr.infn.it. Alternatively it can be installed on the remote linux machine to access its data on the cluster.

To using it:

mkdir remote # create the mount point
sshfs <remote-user>@<remote-host>:<remote-dir> remote # mount of remote file-system 
df -h # see mounted files system 
ls remote/ 
fusermount -u remote # umount the file system

VTune

VTune is a performance profiler from Intel and is available on the HPC cluster.

General information from Intel: https://software.intel.com/en-us/get-started-with-vtune-linux-os

Local Guide vtune (work in progress)

CINECA guides

Altre risorse

calcoloscientifico/guidautente_slurm_en.txt · Ultima modifica: 27/04/2018 10:01 da roberto.alfieri