Slurm Installation on Debian
Slurm Installation
In this section we describe all the administration tasks for the Slurm Workload Manager in the frontend node (Server) and in the compute nodes (Client)
Slurm Installation on Debian from repositories
- Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.
-
Install MUNGE
apt-get install -y libmunge-dev libmunge2 munge -
Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.
Wait around for some random data (recommended for the paranoid):
dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.keyGrab some pseudorandom data (recommended for the impatient):
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.keyPermissions
chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key -
Edit file /etc/passwd
vi /etc/passwdModify user munge in each machine
File: /etc/passwdmunge:x:501:501::/var/run/munge;/sbin/nologin
-
Start MUNGE
/etc/init.d/munge start -
Testing Munge
The following steps can be performed to verify that the software has been properly installed and configured:
Generate a credential on stdout:
munge -nCheck if a credential can be locally decoded:
munge -n | unmungeCheck if a credential can be remotely decoded:
munge -n | ssh somehost unmungeRun a quick benchmark:
remunge -
Install SLURM from repositories - Debian 8
apt-get install -y slurm-wlm slurm-wlm-docInstall SLURM from repositories - Debian 7
apt-get install -y slurm-llnl slurm-llnl-doc -
Create and copy slurm.conf
Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in /usr/share/doc/slurmctld/slurm-wlm-configurator.html
Open this file in your browser
sftp://ip-server/usr/share/doc/slurmctld/slurm-wlm-configurator.html
NOTE: Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.Copy the result from web-based configuration tool in /etc/slurm/slurm.conf and configure it such that it looks like the following (This is a example - build a configuration file customized for your environment) - http://slurm.schedmd.com/slurm.conf.html
File: /etc/slurm/slurm.conf# # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=GUANE ControlMachine=guane # SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/tmp SlurmdSpoolDir=/var/spool/slurm/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid CacheGroups=0 ReturnToService=1 # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill SelectType=select/linear FastSchedule=1 # # LOGGING SlurmctldDebug=3 SlurmdDebug=3 JobCompType=jobcomp/none JobCompLoc=/tmp/slurm_job_completion.txt # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost=slurm #AccountingStorageLoc=/tmp/slurm_job_accounting.txt #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # control node NodeName=guane NodeAddr=192.168.1.70 Port=17000 State=UNKNOWN # each logical node is on the same physical node, so we need different ports for them # name guane-[*] is arbitrary NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN # PARTITIONS # partition name is arbitrary PartitionName=guane Nodes=guane-[1-2] Default=YES MaxTime=8-00:00:00 State=UP
-
Install munge in each node of cluster
apt-get install -y libmunge-dev libmunge2 munge -
Copy munge.key file from server to each node from cluster
scp /etc/munge/munge.key root@node:/etc/munge/munge.keychown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key -
Install SLURM compute node daemon - Debian 8
apt-get install -y slurmdInstall SLURM compute node daemon - Debian 7
apt-get install -y slurm-llnl -
Start slurm in the nodes and server - Debian 8
/etc/init.d/slurmd start/etc/init.d/slurmctld startStart slurm in the nodes and server - Debian 7
/etc/init.d/slurm-llnl start/etc/init.d/slurmctld-llnl start
Slurm Installation on Debian from source
CONTROLLER CONFIGURATION
http://wildflower.diablonet.net/~scaron/slurmsetup.html
-
Prerequisites
apt-get install -y build-essential -
Install MUNGE
apt-get install -y libmunge-dev libmunge2 munge -
Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.
Wait around for some random data (recommended for the paranoid):
dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.keyGrab some pseudorandom data (recommended for the impatient):
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.keyPermissions
chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key -
Edit file /etc/passwd
vi /etc/passwdModify user munge in each machine
File: /etc/passwdmunge:x:501:501::var/run/munge;/sbin/nologin
-
Start MUNGE
/etc/init.d/munge start -
Testing Munge
The following steps can be performed to verify that the software has been properly installed and configured:
Generate a credential on stdout:
munge -nCheck if a credential can be locally decoded:
munge -n | unmungeCheck if a credential can be remotely decoded:
munge -n | ssh somehost unmungeRun a quick benchmark:
remunge -
Install MySQL server (for SLURM accounting) and development tools (to build SLURM). We'll also install the BLCR tools so that SLURM can take advantage of that checkpoint-and-restart functionality.
apt-get install mysql-server libmysqlclient-dev libmysqld-dev libmysqld-pic apt-get install gcc bison make flex libncurses5-dev tcsh pkg-config apt-get install blcr-util blcr-testsuite libcr-dbg libcr-dev libcr0 -
Download lastets version. (http://www.schedmd.com/#repos)
wget http://www.schedmd.com/download/latest/slurm-14.11.6.tar.bz2 -
Unpack and build SLURM
tar xvf slurm-14.11.6.tar.bz2 cd slurm-14.11.6 ./configure --enable-multiple-slurmd make make install.
-
Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in doc/html/configurator.html, this file can opened in the browser if it's copy to /usr/share/ (sftp://ip-server/usr/share/doc/slurm/configurator.html)
mkdir /usr/share/doc/slurm/ cd [slurm-src]/doc/html/ cp configurator.* /usr/share/doc/slurm/Other way is copy the example configuration files out to /etc/slurm.
mkdir /etc/slurm cd [slurm-src] cp etc/slurm.conf.example /etc/slurm/slurm.conf cp etc/slurmdbd.conf.example /etc/slurm/slurmdbd.confExecuting the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.
-
Set things up for slurmdbd (the SLURM accounting daemon) in MySQL
mysql -u root -p create database slurm_db; create user 'slurm'@'localhost'; set password for 'slurm'@'localhost' = password('MyPassword'); grant usage on *.* to 'slurm'@'localhost'; grant all privileges on slurm_db.* to 'slurm'@'localhost'; flush privileges; quit -
Configure /usr/local/etc/slurmdbd.conf such that it looks like the following:
File: /usr/local/etc/slurmdbd.conf# # Example slurmdbd.conf file. # # See the slurmdbd.conf man page for more information. # # Archive info #ArchiveJobs=yes #ArchiveDir="/tmp" #ArchiveSteps=yes #ArchiveScript= #JobPurge=12 #StepPurge=1 # # Authentication info AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info DbdAddr=localhost DbdHost=localhost #DbdPort=7031 SlurmUser=slurm #MessageTimeout=300 DebugLevel=4 #DefaultQOS=normal,standby LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid #PluginDir=/usr/lib/slurm #PrivateData=accounts,users,usage,jobs #TrackWCKey=yes # # Database info StorageType=accounting_storage/mysql #StorageHost=localhost #StoragePort=1234 StoragePass=MyPassword StorageUser=slurm StorageLoc=slurm_db
-
Configure /usr/local/etc/slurm.conf such that it looks like the following:
File: /usr/local/etc/slurm.conf# # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=GUANE ControlMachine=guane #ControlAddr= #BackupController= #BackupAddr= # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= StateSaveLocation=/tmp SlurmdSpoolDir=/var/spool/slurm/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid #PluginDir= CacheGroups=0 #FirstJobId= ReturnToService=1 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #Prolog= #Epilog= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= #TaskPlugin= #TrackWCKey=no #TreeWidth=50 #TmpFS= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= SelectType=select/linear FastSchedule=1 #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 #SlurmctldLogFile= SlurmdDebug=3 #SlurmdLogFile= JobCompType=jobcomp/none JobCompLoc=/tmp/slurm_job_completion.txt # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 # #ccountingStorageEnforce=limits,qos AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=slurm AccountingStorageLoc=/tmp/slurm_job_accounting.txt #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # control node NodeName=guane NodeAddr=192.168.1.75 Port=17000 State=UNKNOWN # each logical node is on the same physical node, so we need different ports for them # name guane-[*] is arbitrary NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN # PARTITIONS # partition name is arbitrary PartitionName=guane Nodes=guane-1 Default=YES MaxTime=8-00:00:00 State=UP
-
Create init scripts to /etc/init.d
touch /etc/init.d/slurmctl touch /etc/init.d/slurmdbd chmod +x /etc/init.d/slurm chmod +x /etc/init.d/slurmdbdFile: /etc/init.d/slurmctl#!/bin/sh # # chkconfig: 345 90 10 # description: SLURM is a simple resource management system which \ # manages exclusive access o a set of compute \ # resources and distributes work to those resources. # # processname: /usr/local/sbin/slurmctld # pidfile: /var/run/slurm/slurmctld.pid # # config: /etc/default/slurmctld # ### BEGIN INIT INFO # Provides: slurmctld # Required-Start: $remote_fs $syslog $network munge # Required-Stop: $remote_fs $syslog $network munge # Should-Start: $named # Should-Stop: $named # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Short-Description: slurm daemon management # Description: Start slurm to provide resource management ### END INIT INFO BINDIR="/usr/local/bin" CONFDIR="/usr/local/etc" LIBDIR="/usr/local/lib" SBINDIR="/usr/local/sbin" # Source slurm specific configuration if [ -f /etc/default/slurmctld ] ; then . /etc/default/slurmctld else SLURMCTLD_OPTIONS="" fi # Checking for slurm.conf presence if [ ! -f $CONFDIR/slurm.conf ] ; then if [ -n "$(echo $1 | grep start)" ] ; then echo Not starting slurmctld fi echo slurm.conf was not found in $CONFDIR exit 0 fi DAEMONLIST="slurmctld" test -f $SBINDIR/slurmctld || exit 0 #Checking for lsb init function if [ -f /lib/lsb/init-functions ] ; then . /lib/lsb/init-functions else echo Can\'t find lsb init functions exit 1 fi # setup library paths for slurm and munge support export LD_LIBRARY_PATH=$LIBDIR${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH} #Function to check for cert and key presence and key vulnerabilty checkcertkey() { MISSING="" keyfile="" certfile="" if [ "$1" = "slurmctld" ] ; then keyfile=$(grep JobCredentialPrivateKey $CONFDIR/slurm.conf | grep -v "^ *#") keyfile=${keyfile##*=} keyfile=${keyfile%#*} [ -e $keyfile ] || MISSING="$keyfile" fi if [ "${MISSING}" != "" ] ; then echo Not starting slurmctld echo $MISSING not found exit 0 fi if [ -f "$keyfile" ] && [ "$1" = "slurmctld" ] ; then keycheck=$(openssl-vulnkey $keyfile | cut -d : -f 1) if [ "$keycheck" = "COMPROMISED" ] ; then echo Your slurm key stored in the file $keyfile echo is vulnerable because has been created with a buggy openssl. echo Please rebuild it with openssl version \>= 0.9.8g-9 exit 0 fi fi } get_daemon_description() { case $1 in slurmd) echo slurm compute node daemon ;; slurmctld) echo slurm central management daemon ;; *) echo slurm daemon ;; esac } start() { CRYPTOTYPE=$(grep CryptoType $CONFDIR/slurm.conf | grep -v "^ *#") CRYPTOTYPE=${CRYPTOTYPE##*=} CRYPTOTYPE=${CRYPTOTYPE%#*} if [ "$CRYPTOTYPE" = "crypto/openssl" ] ; then checkcertkey $1 fi # Create run-time variable data mkdir -p /var/run/slurm chown slurm:slurm /var/run/slurm # Checking if StateSaveLocation is under run if [ "$1" = "slurmctld" ] ; then SDIRLOCATION=$(grep StateSaveLocation /usr/local/etc/slurm.conf \ | grep -v "^ *#") SDIRLOCATION=${SDIRLOCATION##*=} SDIRLOCATION=${SDIRLOCATION%#*} if [ "${SDIRLOCATION}" = "/var/run/slurm/slurmctld" ] ; then if ! [ -e /var/run/slurm/slurmctld ] ; then ln -s /var/lib/slurm/slurmctld /var/run/slurm/slurmctld fi fi fi desc="$(get_daemon_description $1)" log_daemon_msg "Starting $desc" "$1" unset HOME MAIL USER USERNAME #FIXME $STARTPROC $SBINDIR/$1 $2 STARTERRORMSG="$(start-stop-daemon --start --oknodo \ --exec "$SBINDIR/$1" -- $2 2>&1)" STATUS=$? log_end_msg $STATUS if [ "$STARTERRORMSG" != "" ] ; then echo $STARTERRORMSG fi touch /var/lock/slurm } stop() { desc="$(get_daemon_description $1)" log_daemon_msg "Stopping $desc" "$1" STOPERRORMSG="$(start-stop-daemon --oknodo --stop -s TERM \ --exec "$SBINDIR/$1" 2>&1)" STATUS=$? log_end_msg $STATUS if [ "$STOPERRORMSG" != "" ] ; then echo $STOPERRORMSG fi rm -f /var/lock/slurm } getpidfile() { dpidfile=`grep -i ${1}pid $CONFDIR/slurm.conf | grep -v '^ *#'` if [ $? = 0 ]; then dpidfile=${dpidfile##*=} dpidfile=${dpidfile%#*} else dpidfile=/var/run/${1}.pid fi echo $dpidfile } # # status() with slight modifications to take into account # instantiations of job manager slurmd's, which should not be # counted as "running" # slurmstatus() { base=${1##*/} pidfile=$(getpidfile $base) pid=`pidof -o $$ -o $$PPID -o %PPID -x $1 || \ pidof -o $$ -o $$PPID -o %PPID -x ${base}` if [ -f $pidfile ]; then read rpid < $pidfile if [ "$rpid" != "" -a "$pid" != "" ]; then for i in $pid ; do if [ "$i" = "$rpid" ]; then echo "${base} (pid $pid) is running..." return 0 fi done elif [ "$rpid" != "" -a "$pid" = "" ]; then # Due to change in user id, pid file may persist # after slurmctld terminates if [ "$base" != "slurmctld" ] ; then echo "${base} dead but pid file exists" fi return 1 fi fi if [ "$base" = "slurmctld" -a "$pid" != "" ] ; then echo "${base} (pid $pid) is running..." return 0 fi echo "${base} is stopped" return 3 } # # stop slurm daemons, # wait for termination to complete (up to 10 seconds) before returning # slurmstop() { for prog in $DAEMONLIST ; do stop $prog for i in 1 2 3 4 do sleep $i slurmstatus $prog if [ $? != 0 ]; then break fi done done } # # The pathname substitution in daemon command assumes prefix and # exec_prefix are same. This is the default, unless the user requests # otherwise. # # Any node can be a slurm controller and/or server. # case "$1" in start) start slurmctld "$SLURMCTLD_OPTIONS" ;; startclean) SLURMCTLD_OPTIONS="-c $SLURMCTLD_OPTIONS" start slurmctld "$SLURMCTLD_OPTIONS" ;; stop) slurmstop ;; status) for prog in $DAEMONLIST ; do slurmstatus $prog done ;; restart) $0 stop $0 start ;; force-reload) $0 stop $0 start ;; condrestart) if [ -f /var/lock/subsys/slurm ]; then for prog in $DAEMONLIST ; do stop $prog start $prog done fi ;; reconfig) for prog in $DAEMONLIST ; do PIDFILE=$(getpidfile $prog) start-stop-daemon --stop --signal HUP --pidfile \ "$PIDFILE" --quiet $prog done ;; test) for prog in $DAEMONLIST ; do echo "$prog runs here" done ;; *) echo "Usage: $0 {start|startclean|stop|status|restart|reconfig|condrestart|tes$ exit 1 ;; esac
-
Add SLURM user:
echo "slurm:x:2000:2000:slurm admin:/home/slurm:/bin/bash" >> /etc/passwd echo "slurm:x:2000:slurm >> /etc/group pwconv -
Create SLURM spool and log directories and set permissions accordingly:
mkdir /var/spool/slurm chown -R slurm:slurm /var/spool/slurm mkdir /var/log/slurm chown -R slurm:slurm /var/log/slurm
COMPUTE NODE CONFIGURATION
-
Install munge in each node of cluster
apt-get install -y libmunge-dev libmunge2 munge -
Copy munge.key file from server to each node from cluster
scp /etc/munge/munge.key root@node:/etc/munge/munge.key