Slurm Installation from sources

From Supercomputación y Cálculo Científico UIS

Anterior: Job Scheduler Slurm

Logo sc33.png

This section shows how to install and setup Simple Linux Utility for Resource Management from sources

SLURM Installation from Sources

Install Requirements

Install munge in every node (compute and frontend). For example, on debian

apt-get -y install munge libfreeipmi-dev libhwloc-dev freeipmi libmunge-dev


Configure Munge

Create the munge key

/usr/sbin/create-munge-key


Copy the munge key from the frontend and the compute nodes

cpush /etc/munge/munge.key /etc/munge/


Set permissions, user and group of the keys

cexec chown munge:munge /etc/munge/munge.key


Init munge service in all nodes. From the frontend node execute the following command

/etc/init.d/munge start
cexec "/etc/init.d/munge start"


Test The Munge Service

From the frontend console execute the following commands:

Frontend

munge -n


Nodes

for i in 06 07 08 09 10 11 12 13 14 15 16; do munge -n


Where guane is the base name of the compute nodes.

The output should be something like:

STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)
STATUS:           Success (0)

Download the software


Compile the software

tar xvjf slurm-16.05.4.tar.bz2 cd /usr/local/src/slurm-16.05.4 ./configure --prefix=/usr/local/slurm make -j25 make install


Configure SLURM

Creates the Data Base

mysql -u root -p mysql> GRANT ALL ON slurmDB.* to 'slurm'@'localhost'; mysql> exit


Edit the Configuration Files

In the following directory create the configuration files:

/usr/local/slurm/etc

Please visit https://computing.llnl.gov/linux/slurm/slurm.conf.html for details.

File: slurm.conf
        MpiParams=ports=12000-12999
        AuthType=auth/munge
        CacheGroups=0
        MpiDefault=none
        ProctrackType=proctrack/cgroup
        ReturnToService=2
        SlurmctldPidFile=/var/run/slurmctld.pid
        SlurmctldPort=6817
        SlurmdPidFile=/var/run/slurmd.pid
        SlurmdPort=6818
        SlurmdSpoolDir=/var/spool/slurmd
        SlurmUser=root
        Slurmdlogfile=/var/log/slurmd.log
        SlurmdDebug=7
        Slurmctldlogfile=/var/log/slurmctld.log
        SlurmctldDebug=7
        StateSaveLocation=/var/spool
        SwitchType=switch/none
        TaskPlugin=task/cgroup
        InactiveLimit=0
        KillWait=30
        MinJobAge=300
        SlurmctldTimeout=120
        SlurmdTimeout=10
        Waittime=0
        FastSchedule=1
        SchedulerType=sched/backfill
        SchedulerPort=7321
        SelectType=select/cons_res
        SelectTypeParameters=CR_Core,CR_Core_Default_Dist_Block
        AccountingStorageHost=localhost
        AccountingStorageLoc=slurmDB
        AccountingStorageType=accounting_storage/slurmdbd
        AccountingStorageUser=slurm
        AccountingStoreJobComment=YES
        AccountingStorageEnforce=associations,limits
        ClusterName=guane
        JobCompType=jobcomp/none
        JobAcctGatherFrequency=30
        JobAcctGatherType=jobacct_gather/none
        GresTypes=gpu
        NodeName=guane[01-02,04,07,09-16] Procs=24 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=102000 Gres=gpu:8 State=UNKNOWN
        NodeName=guane[03,05,06,08] Procs=16 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=102000 Gres=gpu:8 State=UNKNOWN
        PartitionName=all Nodes=guane[01-16]  MaxTime=INFINITE State=UP Default=YES
        PartitionName=manycores16 Nodes=guane[03,05,06,08]  MaxTime=INFINITE State=UP 
        PartitionName=manycores24 Nodes=guane[01-02,04,07,09-16]  MaxTime=INFINITE State=UP
File: gres.conf
        Name=gpu Type=Tesla File=/dev/nvidia0
        Name=gpu Type=Tesla File=/dev/nvidia1
        Name=gpu Type=Tesla File=/dev/nvidia2
        Name=gpu Type=Tesla File=/dev/nvidia3
        Name=gpu Type=Tesla File=/dev/nvidia4
        Name=gpu Type=Tesla File=/dev/nvidia5
        Name=gpu Type=Tesla File=/dev/nvidia6
        Name=gpu Type=Tesla File=/dev/nvidia7
File: slurmdbd.conf
        AuthType=auth/munge
        DbdAddr=localhost
        DbdHost=localhost
        SlurmUser=slurm
        DebugLevel=4
        LogFile=/var/log/slurm/slurmdbd.log
        PidFile=/var/run/slurmdbd.pid
        StorageType=accounting_storage/mysql
        StorageHost=localhost
        StorageUser=slurm
        StorageLoc=slurmDB
File: cgroup.conf
        CgroupAutomount=yes
        CgroupReleaseAgentDir="/usr/local/slurm/etc/cgroup"
        ConstrainCores=yes
        TaskAffinity=yes
        ConstrainDevices=yes
        AllowedDevicesFile="/usr/local/slurm/etc/allowed_devices.conf"
        ConstrainRAMSpace=no
File: allowed_devices.conf
        /dev/null
        /dev/urandom
        /dev/zero
        /dev/cpu/*/*
        /dev/pts/*
File: slurmdbd.conf
        AuthType=auth/munge
        DbdAddr=localhost
        DbdHost=localhost
        DebugLevel=4
        LogFile=/var/log/slurm/slurmdbd.log
        PidFile=/var/run/slurmdbd.pid
        StorageType=accounting_storage/mysql
        StorageHost=localhost
        StoragePass=griduis2o14
        StorageUser=slurmacct
        StorageLoc=slurmDB

Configure the scripts to manages the resources

mkdir etc/cgroup

cp /usr/local/src/slurm-14.11.7/etc/cgroup.release_common.example /usr/local/slurm/etc/cgroup/cgroup.release_common ln -s /usr/local/slurm/etc/cgroup/cgroup.release_common /usr/local/slurm/etc/cgroup/release_devices ln -s /usr/local/slurm/etc/cgroup/cgroup.release_common /usr/local/slurm/etc/cgroup/release_cpuset

ln -s /usr/local/slurm/etc/cgroup/cgroup.release_common /usr/local/slurm/etc/cgroup/release_freezer


Initialize the Services

slurmctld and lurmdbd run on the frontend and slurmd on the compute nodes

On the frontend execute the following commands:

cexec /usr/local/slurm/sbin/slurmd slurmctld scontrol update NodeName=guane[1-10] State=RESUME /usr/local/slurm/sbin/slurmdbd &


Test SLURM Services

Using the following commands you can see if everything is OK

scontrol show node sinfo


You can issue the next command to test slurm also.

srun -N2 /bin/hostname


Activate the Accounting

Add the cluster to the database

sacctmgr add cluster guane


Add an accounting category

sacctmgr add account general Description="General Accounting" Organization=SC3


Add the users. For example, gilberto. This should be a (linux, ldap, etc) valid user.

sacctmgr add user gilberto DefaultAccount=general