Difference between revisions of "Slurm Installation on Debian"

From Supercomputación y Cálculo Científico UIS
m (Gilberto moved page Slurm Installation to Slurm Installation on Debian: Better name)
 
(16 intermediate revisions by one other user not shown)
Line 15: Line 15:
 
     <div class="panel panel-darker-white-border">  
 
     <div class="panel panel-darker-white-border">  
 
         <div class="panel-heading">
 
         <div class="panel-heading">
             <h3 class="panel-title">Slurm Installation on Debian</h3>
+
             <h3 class="panel-title">Slurm Installation on Debian from repositories</h3>
 
         </div>
 
         </div>
 
         <div class="panel-body">
 
         <div class="panel-body">
<ol>
+
            <ol>
<li>Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.</li>
+
                <li>Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.</li>
<li>
+
                <li>
<p>Install MUNGE</p>
+
                    <p>Install MUNGE</p>
<p>{{Command|<nowiki>apt-get install -y libmunge-dev libmunge2 munge</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>apt-get install -y libmunge-dev libmunge2 munge</nowiki>}}</p>
</li>
+
                </li>
<li>
+
                <li>
<p>Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.</p>
+
                    <p>Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.</p>
<p>Wait around for some random data (recommended for the paranoid):</p>
+
                    <p>Wait around for some random data (recommended for the paranoid):</p>
<p>{{Command|<nowiki>dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p>
<p>Grab some pseudorandom data (recommended for the impatient):</p>
+
                    <p>Grab some pseudorandom data (recommended for the impatient):</p>
<p>{{Command|<nowiki>dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p>
<p>Permissions</p>
+
                    <p>Permissions</p>
<p>
+
                    <p>
{{Command|<nowiki>chown munge:munge /etc/munge/munge.key
+
                        {{Command|<nowiki>chown munge:munge /etc/munge/munge.key
 
chmod 400 /etc/munge/munge.key
 
chmod 400 /etc/munge/munge.key
 
</nowiki>}}</p>
 
</nowiki>}}</p>
</li>
+
                </li>
<li>
+
                <li>
<p>Edit file /etc/passwd</p>
+
                    <p>Edit file /etc/passwd</p>
<p>{{Command|<nowiki>vi  /etc/passwd</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>vi  /etc/passwd</nowiki>}}</p>
<p>Modify user munge  in each machine</p>
+
                    <p>Modify user munge  in each machine</p>
<p>{{File|/etc/passwd|<pre><nowiki>munge:x:501:501::var/run/munge;/sbin/nologin</nowiki></pre>}}</p>
+
                    <p>{{File|/etc/passwd|<pre><nowiki>munge:x:501:501::/var/run/munge;/sbin/nologin</nowiki></pre>}}</p>
</li>
+
                </li>
<li>
+
                <li>
<p>Start MUNGE</p>
+
                    <p>Start MUNGE</p>
<p>{{Command|<nowiki>/etc/init.d/munge start</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>/etc/init.d/munge start</nowiki>}}</p>
</li>
+
                </li>
<li>
+
                <li>
<p>Testing Munge</p>
+
                    <p>Testing Munge</p>
<p>The following steps can be performed to verify that the software has been properly installed and configured:</p>
+
                    <p>The following steps can be performed to verify that the software has been properly installed and configured:</p>
<p>Generate a credential on stdout:</p>
+
                    <p>Generate a credential on stdout:</p>
<p>{{Command|<nowiki>munge -n</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>munge -n</nowiki>}}</p>
<p>Check if a credential can be locally decoded:</p>
+
                    <p>Check if a credential can be locally decoded:</p>
<p>{{Command|<nowiki>munge -n | unmunge</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>munge -n | unmunge</nowiki>}}</p>
<p>Check if a credential can be remotely decoded:</p>
+
                    <p>Check if a credential can be remotely decoded:</p>
<p>{{Command|<nowiki>munge -n | ssh somehost unmunge</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>munge -n | ssh somehost unmunge</nowiki>}}</p>
<p>Run a quick benchmark: </p>
+
                    <p>Run a quick benchmark: </p>
<p>{{Command|<nowiki>remunge</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>remunge</nowiki>}}</p>
</li>
+
                </li>
<li>
+
                <li>
<p>Install SLURM from repositories</p>
+
                    <p>Install SLURM from repositories - Debian 8</p>
<p>{{Command|<nowiki>apt-get install -y slurm-wlm slurm-wlm-doc</nowiki>}}</p>
+
                    <p>{{Command|<nowiki>apt-get install -y slurm-wlm slurm-wlm-doc</nowiki>}}</p>
</li>
+
                    <p>Install SLURM from repositories - Debian 7</p>
<li>
+
                    <p>{{Command|<nowiki>apt-get install -y slurm-llnl slurm-llnl-doc</nowiki>}}</p>
<p>Create and copy slurm.conf</p>
+
                </li>
<p>Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in <b>/usr/share/doc/slurmctld/slurm-wlm-configurator.html</b></p>
+
                <li>
<p>Open this file in your browser</p>
+
                    <p>Create and copy slurm.conf</p>
<p>sftp://ip-server/usr/share/doc/slurmctld/slurm-wlm-configurator.html</p>
+
                    <p>Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in <b>/usr/share/doc/slurmctld/slurm-wlm-configurator.html</b></p>
<p>{{Note|Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.}}</p>
+
                    <p>Open this file in your browser</p>
<p>Copy the result from web-based configuration tool in <b>/etc/slurm/slurm.conf</b> and configure it such that it looks like the following (This is a example - build a configuration file customized for your environment) - http://slurm.schedmd.com/slurm.conf.html</p>
+
                    <p>sftp://ip-server/usr/share/doc/slurmctld/slurm-wlm-configurator.html</p>
<p>{{File|/etc/slurm/slurm.conf|<pre><nowiki>
+
                    <p>{{Note|Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.}}</p>
 +
                    <p>Copy the result from web-based configuration tool in <b>/etc/slurm/slurm.conf</b> and configure it such that it looks like the following (This is a example - build a configuration file customized for your environment) - http://slurm.schedmd.com/slurm.conf.html</p>
 +
                    <p>{{File|/etc/slurm/slurm.conf|<pre><nowiki>
 
#
 
#
 
# slurm.conf file generated by configurator.html.
 
# slurm.conf file generated by configurator.html.
Line 134: Line 136:
 
PartitionName=guane Nodes=guane-[1-2] Default=YES MaxTime=8-00:00:00 State=UP
 
PartitionName=guane Nodes=guane-[1-2] Default=YES MaxTime=8-00:00:00 State=UP
 
</nowiki></pre>}}</p>
 
</nowiki></pre>}}</p>
</li>
+
                </li>
 
+
                <li>
 
+
                    <p>Install munge in each node of cluster</p>
 
+
                    <p>{{Command|<nowiki>apt-get install -y libmunge-dev libmunge2 munge</nowiki>}}</p>
<li>copy munge.key in the nodes - ssh /etc/munge/munge.key root@nodes:/etc/munge/munge.key</li>
+
                </li>
<li>Start slurm - /etc/init.d/slurmctld start in the server machine and /etc/init.d/slurmd start in nodes machine</li>
+
                <li>
 +
                    <p>Copy munge.key file from server to each node from cluster</p>
 +
                    <p>{{Command|<nowiki>scp /etc/munge/munge.key root@node:/etc/munge/munge.key</nowiki>}}</p>
 +
                    <p>
 +
                        {{Command|<nowiki>chown munge:munge /etc/munge/munge.key
 +
chmod 400 /etc/munge/munge.key
 +
</nowiki>}}</p>
 +
                </li>
 +
                <li>
 +
                    <p>Install SLURM compute node daemon - Debian 8</p>
 +
                    <p>{{Command|<nowiki>apt-get install -y slurmd</nowiki>}}</p>
  
</ol>
+
                    <p>Install SLURM compute node daemon - Debian 7</p>
 +
                    <p>{{Command|<nowiki>apt-get install -y slurm-llnl</nowiki>}}</p>
 +
                </li>
 +
                <li>
 +
                    <p>Start slurm in the nodes and server - Debian 8</p>
 +
                    <p>{{Command|<nowiki>/etc/init.d/slurmd start</nowiki>}}</p>
 +
                    <p>{{Command|<nowiki>/etc/init.d/slurmctld start</nowiki>}}</p>
 +
                    <p>Start slurm in the nodes and server - Debian 7</p>
 +
                    <p>{{Command|<nowiki>/etc/init.d/slurm-llnl start</nowiki>}}</p>
 +
                    <p>{{Command|<nowiki>/etc/init.d/slurmctld-llnl start</nowiki>}}</p>
 +
                </li>
 +
            </ol>
 
         </div>
 
         </div>
 
     </div>
 
     </div>
 
</div>
 
</div>
{{Command|<nowiki>curl http://oar-ftp.imag.fr/oar/oarmaster.asc | sudo apt-key add  -</nowiki>}}
 
{{File|/etc/oar/oar.conf|<pre><nowiki>
 
DB_TYPE="mysql"
 
  
</nowiki></pre>}}
 
{{Note|In our cluster guane we have 8 GPUs per node, every node have 24 cores. Therefore we have to use 3 CPU cores to manage one GPU. Thus, you have to modify the lines to do something like that.}}
 
  
  
 
<div class="col-md-14">
 
<div class="col-md-14">
     <div class="panel panel-darker-white-border">  
+
     <div class="panel panel-darker-white">
 
         <div class="panel-heading">
 
         <div class="panel-heading">
             <h3 class="panel-title">Slurm Installation on Debian</h3>
+
             <h3 class="panel-title">Slurm Installation on Debian from source</h3>
 
         </div>
 
         </div>
 
         <div class="panel-body">
 
         <div class="panel-body">
  
 +
 +
            <div class="col-md-12">
 +
                <div class="panel panel-midnight-border">
 +
                    <div class="panel-heading">
 +
                        <h3 class="panel-title">CONTROLLER CONFIGURATION</h3>
 +
                    </div>
 +
                    <div class="panel-body">
 
http://wildflower.diablonet.net/~scaron/slurmsetup.html
 
http://wildflower.diablonet.net/~scaron/slurmsetup.html
 +
                        <ol>
 +
                            <li>
 +
                                <p>Prerequisites</p>
 +
                                <p>{{Command|<nowiki>apt-get install -y build-essential</nowiki>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Install MUNGE</p>
 +
                                <p>{{Command|<nowiki>apt-get install -y libmunge-dev libmunge2 munge</nowiki>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.</p>
 +
                                <p>Wait around for some random data (recommended for the paranoid):</p>
 +
                                <p>{{Command|<nowiki>dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p>
 +
                                <p>Grab some pseudorandom data (recommended for the impatient):</p>
 +
                                <p>{{Command|<nowiki>dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p>
 +
                                <p>Permissions</p>
 +
                                <p>
 +
                                    {{Command|<nowiki>chown munge:munge /etc/munge/munge.key
 +
chmod 400 /etc/munge/munge.key
 +
</nowiki>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Edit file /etc/passwd</p>
 +
                                <p>{{Command|<nowiki>vi  /etc/passwd</nowiki>}}</p>
 +
                                <p>Modify user munge  in each machine</p>
 +
                                <p>{{File|/etc/passwd|<pre><nowiki>munge:x:501:501::var/run/munge;/sbin/nologin</nowiki></pre>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Start MUNGE</p>
 +
                                <p>{{Command|<nowiki>/etc/init.d/munge start</nowiki>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Testing Munge</p>
 +
                                <p>The following steps can be performed to verify that the software has been properly installed and configured:</p>
 +
                                <p>Generate a credential on stdout:</p>
 +
                                <p>{{Command|<nowiki>munge -n</nowiki>}}</p>
 +
                                <p>Check if a credential can be locally decoded:</p>
 +
                                <p>{{Command|<nowiki>munge -n | unmunge</nowiki>}}</p>
 +
                                <p>Check if a credential can be remotely decoded:</p>
 +
                                <p>{{Command|<nowiki>munge -n | ssh somehost unmunge</nowiki>}}</p>
 +
                                <p>Run a quick benchmark: </p>
 +
                                <p>{{Command|<nowiki>remunge</nowiki>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Install MySQL server (for SLURM accounting) and development tools (to build SLURM). We'll also install the BLCR tools so that SLURM can take advantage of that checkpoint-and-restart functionality.</p>
 +
                                <p>
 +
                                    {{Command|<nowiki>apt-get install mysql-server libmysqlclient-dev libmysqld-dev libmysqld-pic
 +
apt-get install gcc bison make flex libncurses5-dev tcsh pkg-config
 +
apt-get install blcr-util blcr-testsuite libcr-dbg libcr-dev libcr0</nowiki>}}
 +
                                </p>
 +
                            <li>
 +
                                <p>Download lastets version. (http://www.schedmd.com/#repos)</p>
 +
                                <p>{{Command|<nowiki>wget http://www.schedmd.com/download/latest/slurm-14.11.6.tar.bz2</nowiki>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Unpack and build SLURM</p>
 +
                                <p>{{Command|<nowiki>tar xvf slurm-14.11.6.tar.bz2
 +
cd slurm-14.11.6
 +
./configure --enable-multiple-slurmd
 +
make
 +
make install</nowiki>}}
 +
                                </p>
 +
                            </li>.
 +
                            <li>
 +
                                <p>Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in doc/html/configurator.html, this file can opened in the browser if it's copy to /usr/share/ (sftp://ip-server/usr/share/doc/slurm/configurator.html)</p>
 +
                                <p>
 +
                                    {{Command|<nowiki>mkdir /usr/share/doc/slurm/
 +
cd [slurm-src]/doc/html/
 +
cp configurator.* /usr/share/doc/slurm/</nowiki>}}
 +
                                </p>
 +
                                <p>Other way is copy the example configuration files out to /etc/slurm.</p>
 +
                                <p>
 +
                                    {{Command|<nowiki>mkdir /etc/slurm
 +
cd [slurm-src]
 +
cp etc/slurm.conf.example /etc/slurm/slurm.conf
 +
cp etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf</nowiki>}}
 +
                                </p>
 +
                                <p>Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.</p>
 +
                            <li>
 +
                                <p>Set things up for slurmdbd (the SLURM accounting daemon) in MySQL</p>
 +
                                <p>
 +
                                    {{Command|<nowiki>mysql -u root -p
 +
create database slurm_db;
 +
create user 'slurm'@'localhost';
 +
set password for 'slurm'@'localhost' = password('MyPassword');
 +
grant usage on *.* to 'slurm'@'localhost';
 +
grant all privileges on slurm_db.* to 'slurm'@'localhost';
 +
flush privileges;
 +
quit</nowiki>}}
 +
                                </p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Configure /usr/local/etc/slurmdbd.conf such that it looks like the following:</p>
 +
                                <p>{{File|/usr/local/etc/slurmdbd.conf|<pre><nowiki>
 +
#
 +
# Example slurmdbd.conf file.
 +
#
 +
# See the slurmdbd.conf man page for more information.
 +
#
 +
# Archive info
 +
#ArchiveJobs=yes
 +
#ArchiveDir="/tmp"
 +
#ArchiveSteps=yes
 +
#ArchiveScript=
 +
#JobPurge=12
 +
#StepPurge=1
 +
#
 +
# Authentication info
 +
AuthType=auth/munge
 +
AuthInfo=/var/run/munge/munge.socket.2
 +
#
 +
# slurmDBD info
 +
DbdAddr=localhost
 +
DbdHost=localhost
 +
#DbdPort=7031
 +
SlurmUser=slurm
 +
#MessageTimeout=300
 +
DebugLevel=4
 +
#DefaultQOS=normal,standby
 +
LogFile=/var/log/slurm/slurmdbd.log
 +
PidFile=/var/run/slurmdbd.pid
 +
#PluginDir=/usr/lib/slurm
 +
#PrivateData=accounts,users,usage,jobs
 +
#TrackWCKey=yes
 +
#
 +
# Database info
 +
StorageType=accounting_storage/mysql
 +
#StorageHost=localhost
 +
#StoragePort=1234
 +
StoragePass=MyPassword
 +
StorageUser=slurm
 +
StorageLoc=slurm_db
 +
</nowiki></pre>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Configure /usr/local/etc/slurm.conf such that it looks like the following: </p>
 +
                                <p>{{File|/usr/local/etc/slurm.conf|<pre><nowiki>
 +
#
 +
# slurm.conf file generated by configurator.html.
 +
#
 +
# See the slurm.conf man page for more information.
 +
#
 +
ClusterName=GUANE
 +
ControlMachine=guane
 +
#ControlAddr=
 +
#BackupController=
 +
#BackupAddr=
 +
#
 +
SlurmUser=slurm
 +
#SlurmdUser=root
 +
SlurmctldPort=6817
 +
SlurmdPort=6818
 +
AuthType=auth/munge
 +
#JobCredentialPrivateKey=
 +
#JobCredentialPublicCertificate=
 +
StateSaveLocation=/tmp
 +
SlurmdSpoolDir=/var/spool/slurm/slurmd
 +
SwitchType=switch/none
 +
MpiDefault=none
 +
SlurmctldPidFile=/var/run/slurmctld.pid
 +
SlurmdPidFile=/var/run/slurmd.pid
 +
ProctrackType=proctrack/pgid
 +
#PluginDir=
 +
CacheGroups=0
 +
#FirstJobId=
 +
ReturnToService=1
 +
#MaxJobCount=
 +
#PlugStackConfig=
 +
#PropagatePrioProcess=
 +
#PropagateResourceLimits=
 +
#PropagateResourceLimitsExcept=
 +
#Prolog=
 +
#Epilog=
 +
#SrunProlog=
 +
#SrunEpilog=
 +
#TaskProlog=
 +
#TaskEpilog=
 +
#TaskPlugin=
 +
#TrackWCKey=no
 +
#TreeWidth=50
 +
#TmpFS=
 +
#UsePAM=
 +
#
 +
# TIMERS
 +
SlurmctldTimeout=300
 +
SlurmdTimeout=300
 +
InactiveLimit=0
 +
MinJobAge=300
 +
KillWait=30
 +
Waittime=0
 +
#
 +
# SCHEDULING
 +
SchedulerType=sched/backfill
 +
#SchedulerAuth=
 +
#SchedulerPort=
 +
#SchedulerRootFilter=
 +
SelectType=select/linear
 +
FastSchedule=1
 +
#PriorityType=priority/multifactor
 +
#PriorityDecayHalfLife=14-0
 +
#PriorityUsageResetPeriod=14-0
 +
#PriorityWeightFairshare=100000
 +
#PriorityWeightAge=1000
 +
#PriorityWeightPartition=10000
 +
#PriorityWeightJobSize=1000
 +
#PriorityMaxAge=1-0
 +
#
 +
# LOGGING
 +
SlurmctldDebug=3
 +
#SlurmctldLogFile=
 +
SlurmdDebug=3
 +
#SlurmdLogFile=
 +
JobCompType=jobcomp/none
 +
JobCompLoc=/tmp/slurm_job_completion.txt
 +
#
 +
# ACCOUNTING
 +
JobAcctGatherType=jobacct_gather/linux
 +
JobAcctGatherFrequency=30
 +
#
 +
#ccountingStorageEnforce=limits,qos
 +
AccountingStorageType=accounting_storage/slurmdbd
 +
AccountingStorageHost=slurm
 +
AccountingStorageLoc=/tmp/slurm_job_accounting.txt
 +
#AccountingStoragePass=
 +
#AccountingStorageUser=
 +
#
  
CONTROLLER CONFIGURATION
+
# COMPUTE NODES
 +
# control node
 +
NodeName=guane NodeAddr=192.168.1.75 Port=17000 State=UNKNOWN
  
Prerequisites
+
# each logical node is on the same physical node, so we need different ports for them
 +
# name guane-[*] is arbitrary
 +
NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN
 +
NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN
  
apt-get install -y build-essential
+
# PARTITIONS
 +
# partition name is arbitrary
 +
PartitionName=guane Nodes=guane-1 Default=YES MaxTime=8-00:00:00 State=UP
 +
</nowiki></pre>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Create init scripts to /etc/init.d</p>
 +
                                <p>{{Command|<nowiki>touch /etc/init.d/slurmctl
 +
touch /etc/init.d/slurmdbd
 +
chmod +x /etc/init.d/slurm
 +
chmod +x /etc/init.d/slurmdbd
 +
</nowiki>}}</p>
 +
                                <p>{{File| /etc/init.d/slurmctl|<pre><nowiki>
 +
#!/bin/sh
 +
#
 +
# chkconfig: 345 90 10
 +
# description: SLURM is a simple resource management system which \
 +
#              manages exclusive access o a set of compute \
 +
#              resources and distributes work to those resources.
 +
#
 +
# processname: /usr/local/sbin/slurmctld
 +
# pidfile: /var/run/slurm/slurmctld.pid
 +
#
 +
# config: /etc/default/slurmctld
 +
#
 +
### BEGIN INIT INFO
 +
# Provides:          slurmctld
 +
# Required-Start:    $remote_fs $syslog $network munge
 +
# Required-Stop:    $remote_fs $syslog $network munge
 +
# Should-Start:      $named
 +
# Should-Stop:      $named
 +
# Default-Start:    2 3 4 5
 +
# Default-Stop:      0 1 6
 +
# Short-Description: slurm daemon management
 +
# Description:      Start slurm to provide resource management
 +
### END INIT INFO
 +
 
 +
BINDIR="/usr/local/bin"
 +
CONFDIR="/usr/local/etc"
 +
LIBDIR="/usr/local/lib"
 +
SBINDIR="/usr/local/sbin"
  
1. Install MUNGE
+
# Source slurm specific configuration
 +
if [ -f /etc/default/slurmctld ] ; then
 +
    . /etc/default/slurmctld
 +
else
 +
    SLURMCTLD_OPTIONS=""
 +
fi
  
apt-get install -y libmunge-dev libmunge2 munge
+
# Checking for slurm.conf presence
 +
if [ ! -f $CONFDIR/slurm.conf ] ; then
 +
    if [ -n "$(echo $1 | grep start)" ] ; then
 +
      echo Not starting slurmctld
 +
    fi
 +
      echo slurm.conf was not found in $CONFDIR
 +
    exit 0
 +
fi
  
2. Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.
+
DAEMONLIST="slurmctld"
 +
test -f $SBINDIR/slurmctld || exit 0
  
Wait around for some random data (recommended for the paranoid):
+
#Checking for lsb init function
dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key
+
if [ -f /lib/lsb/init-functions ] ; then
 +
  . /lib/lsb/init-functions
 +
else
 +
  echo Can\'t find lsb init functions
 +
  exit 1
 +
fi
 +
# setup library paths for slurm and munge support
 +
export LD_LIBRARY_PATH=$LIBDIR${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
  
Grab some pseudorandom data (recommended for the impatient):
+
#Function to check for cert and key presence and key vulnerabilty
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
+
checkcertkey()
 +
{
 +
  MISSING=""
 +
  keyfile=""
 +
  certfile=""
  
chown munge:munge /etc/munge/munge.key
+
  if [ "$1" = "slurmctld" ] ; then
chmod 400 /etc/munge/munge.key
+
    keyfile=$(grep JobCredentialPrivateKey $CONFDIR/slurm.conf | grep -v "^ *#")
 +
    keyfile=${keyfile##*=}
 +
    keyfile=${keyfile%#*}
 +
    [ -e $keyfile ] || MISSING="$keyfile"
 +
  fi
  
3. Start MUNGE.
+
  if [ "${MISSING}" != "" ] ; then
 +
    echo Not starting slurmctld
 +
    echo $MISSING not found
 +
    exit 0
 +
  fi
  
/etc/init.d/munge start
+
  if [ -f "$keyfile" ] && [ "$1" = "slurmctld" ] ; then
 +
    keycheck=$(openssl-vulnkey $keyfile | cut -d : -f 1)
 +
    if [ "$keycheck" = "COMPROMISED" ] ; then
 +
      echo Your slurm key stored in the file $keyfile
 +
      echo is vulnerable because has been created with a buggy openssl.
 +
      echo Please rebuild it with openssl version \>= 0.9.8g-9
 +
      exit 0
 +
    fi
 +
  fi
 +
}
  
4. Testing Munge
+
get_daemon_description()
 +
{
 +
    case $1 in
 +
      slurmd)
 +
        echo slurm compute node daemon
 +
        ;;
 +
      slurmctld)
 +
        echo slurm central management daemon
 +
        ;;
 +
      *)
 +
        echo slurm daemon
 +
        ;;
 +
    esac
 +
}
  
The following steps can be performed to verify that the software has been properly installed and configured:
+
start() {
 +
  CRYPTOTYPE=$(grep CryptoType $CONFDIR/slurm.conf | grep -v "^ *#")
 +
  CRYPTOTYPE=${CRYPTOTYPE##*=}
 +
  CRYPTOTYPE=${CRYPTOTYPE%#*}
 +
  if [ "$CRYPTOTYPE" = "crypto/openssl" ] ; then
 +
    checkcertkey $1
 +
  fi
  
    Generate a credential on stdout:  
+
  # Create run-time variable data
 +
  mkdir -p /var/run/slurm
 +
  chown slurm:slurm /var/run/slurm
  
     $ munge -n
+
  # Checking if StateSaveLocation is under run
 +
  if [ "$1" = "slurmctld" ] ; then
 +
    SDIRLOCATION=$(grep StateSaveLocation /usr/local/etc/slurm.conf \
 +
                      | grep -v "^ *#")
 +
    SDIRLOCATION=${SDIRLOCATION##*=}
 +
     SDIRLOCATION=${SDIRLOCATION%#*}
 +
    if [ "${SDIRLOCATION}" = "/var/run/slurm/slurmctld" ] ; then
 +
      if ! [ -e /var/run/slurm/slurmctld ] ; then
 +
        ln -s /var/lib/slurm/slurmctld /var/run/slurm/slurmctld
 +
      fi
 +
    fi
 +
  fi
  
     Check if a credential can be locally decoded:
+
desc="$(get_daemon_description $1)"
 +
  log_daemon_msg "Starting $desc" "$1"
 +
  unset HOME MAIL USER USERNAME
 +
  #FIXME $STARTPROC $SBINDIR/$1 $2
 +
  STARTERRORMSG="$(start-stop-daemon --start --oknodo \
 +
                  --exec "$SBINDIR/$1" -- $2 2>&1)"
 +
  STATUS=$?
 +
  log_end_msg $STATUS
 +
  if [ "$STARTERRORMSG" != "" ] ; then
 +
     echo $STARTERRORMSG
 +
  fi
 +
  touch /var/lock/slurm
 +
}
  
     $ munge -n | unmunge
+
stop() {
 +
     desc="$(get_daemon_description $1)"
 +
    log_daemon_msg "Stopping $desc" "$1"
 +
    STOPERRORMSG="$(start-stop-daemon --oknodo --stop -s TERM \
 +
                    --exec "$SBINDIR/$1" 2>&1)"
 +
    STATUS=$?
 +
    log_end_msg $STATUS
 +
    if [ "$STOPERRORMSG" != "" ] ; then
 +
      echo $STOPERRORMSG
 +
    fi
 +
    rm -f /var/lock/slurm
 +
}
  
     Check if a credential can be remotely decoded:
+
getpidfile() {
 +
    dpidfile=`grep -i ${1}pid $CONFDIR/slurm.conf | grep -v '^ *#'`
 +
     if [ $? = 0 ]; then
 +
        dpidfile=${dpidfile##*=}
 +
        dpidfile=${dpidfile%#*}
 +
    else
 +
        dpidfile=/var/run/${1}.pid
 +
    fi
  
     $ munge -n | ssh somehost unmunge
+
     echo $dpidfile
 +
}
  
    Run a quick benchmark:
+
#
 +
# status() with slight modifications to take into account
 +
# instantiations of job manager slurmd's, which should not be
 +
# counted as "running"
 +
#
  
     $ remunge
+
slurmstatus() {
 +
     base=${1##*/}
  
5. Install MySQL server (for SLURM accounting) and development tools (to build SLURM). We'll also install the BLCR tools so that SLURM can take advantage of that checkpoint-and-restart functionality.
+
    pidfile=$(getpidfile $base)
  
apt-get install mysql-server libmysqlclient-dev libmysqld-dev libmysqld-pic
+
    pid=`pidof -o $$ -o $$PPID -o %PPID -x $1 || \
apt-get install gcc bison make flex libncurses5-dev tcsh pkg-config
+
        pidof -o $$ -o $$PPID -o %PPID -x ${base}`
apt-get install blcr-util blcr-testsuite libcr-dbg libcr-dev libcr0
 
  
6. Download lastets version. (http://www.schedmd.com/#repos)
+
    if [ -f $pidfile ]; then
 +
        read rpid < $pidfile
 +
        if [ "$rpid" != "" -a "$pid" != "" ]; then
 +
            for i in $pid ; do
 +
                if [ "$i" = "$rpid" ]; then
 +
                    echo "${base} (pid $pid) is running..."
 +
                    return 0
 +
                fi
 +
            done
 +
        elif [ "$rpid" != "" -a "$pid" = "" ]; then
 +
#           Due to change in user id, pid file may persist
 +
#          after slurmctld terminates
 +
            if [ "$base" != "slurmctld" ] ; then
 +
              echo "${base} dead but pid file exists"
 +
            fi
 +
            return 1
 +
      fi
  
wget http://www.schedmd.com/download/latest/slurm-14.11.6.tar.bz2
+
    fi
  
7. Unpack and build SLURM.
+
    if [ "$base" = "slurmctld" -a "$pid" != "" ] ; then
 +
        echo "${base} (pid $pid) is running..."
 +
        return 0
 +
    fi
  
tar xvf slurm-14.11.6.tar.bz2
+
    echo "${base} is stopped"
cd slurm-14.11.6
 
./configure --enable-multiple-slurmd
 
make
 
make install
 
  
8. Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in doc/html/configurator.html, this file can opened in the browser if it's copy to /usr/share/ (sftp://ip-server/usr/share/doc/slurm/configurator.html).
+
    return 3
 +
}
  
mkdir /usr/share/doc/slurm/
+
#
cd [slurm-src]/doc/html/
+
# stop slurm daemons,
cp configurator.* /usr/share/doc/slurm/
+
# wait for termination to complete (up to 10 seconds) before returning
 +
#
  
Other way is copy the example configuration files out to /etc/slurm.
+
slurmstop() {
 +
    for prog in $DAEMONLIST ; do
 +
      stop $prog
 +
      for i in 1 2 3 4
 +
      do
 +
          sleep $i
 +
          slurmstatus $prog
 +
          if [ $? != 0 ]; then
 +
            break
 +
          fi
 +
      done
 +
    done
 +
}
  
mkdir /etc/slurm
+
#
cd [slurm-src]
+
# The pathname substitution in daemon command assumes prefix and
cp etc/slurm.conf.example /etc/slurm/slurm.conf
+
# exec_prefix are same.  This is the default, unless the user requests
cp etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
+
# otherwise.
 +
#
 +
# Any node can be a slurm controller and/or server.
 +
#
 +
case "$1" in
 +
    start)
 +
        start slurmctld "$SLURMCTLD_OPTIONS"
 +
        ;;
 +
    startclean)
 +
        SLURMCTLD_OPTIONS="-c $SLURMCTLD_OPTIONS"
 +
        start slurmctld "$SLURMCTLD_OPTIONS"
 +
        ;;
 +
    stop)
 +
        slurmstop
 +
        ;;
 +
    status)
 +
        for prog in $DAEMONLIST ; do
 +
          slurmstatus $prog
 +
        done
 +
        ;;
 +
    restart)
 +
        $0 stop
 +
        $0 start
 +
        ;;
 +
    force-reload)
 +
        $0 stop
 +
        $0 start
 +
        ;;
 +
    condrestart)
 +
        if [ -f /var/lock/subsys/slurm ]; then
 +
            for prog in $DAEMONLIST ; do
 +
                stop $prog
 +
                start $prog
 +
            done
 +
        fi
 +
        ;;
 +
    reconfig)
 +
        for prog in $DAEMONLIST ; do
 +
            PIDFILE=$(getpidfile $prog)
 +
            start-stop-daemon --stop --signal HUP --pidfile \
 +
              "$PIDFILE" --quiet $prog
 +
        done
 +
        ;;
 +
    test)
 +
        for prog in $DAEMONLIST ; do
 +
            echo "$prog runs here"
 +
        done
 +
        ;;
 +
    *)
 +
        echo "Usage: $0 {start|startclean|stop|status|restart|reconfig|condrestart|tes$
 +
        exit 1
 +
        ;;
 +
esac
 +
</nowiki></pre>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Add SLURM user:</p>
 +
                                <p>{{Command|<nowiki>echo "slurm:x:2000:2000:slurm admin:/home/slurm:/bin/bash" >> /etc/passwd
 +
echo "slurm:x:2000:slurm >> /etc/group
 +
pwconv</nowiki>}}</p>
 +
                            </li>
 +
                            <li>
 +
                                <p>Create SLURM spool and log directories and set permissions accordingly:</p>
 +
                                <p>{{Command|<nowiki>mkdir /var/spool/slurm
 +
chown -R slurm:slurm /var/spool/slurm
 +
mkdir /var/log/slurm
 +
chown -R slurm:slurm /var/log/slurm</nowiki>}}</p>
 +
                            </li>
 +
                        </ol>
 +
                    </div>
 +
                    <div class="panel-footer">CONTROLLER CONFIGURATION</div>
 +
                </div>
 +
            </div>
  
Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.
+
            <div class="col-md-12">
 +
                <div class="panel panel-midnight-border">
 +
                    <div class="panel-heading">
 +
                        <h3 class="panel-title">COMPUTE NODE CONFIGURATION</h3>
 +
                    </div>
 +
                    <div class="panel-body">
  
9. Set things up for slurmdbd (the SLURM accounting daemon) in MySQL.
 
  
mysql -u root -p
+
<ol>
create database slurm_db;
+
<li>
create user 'slurm'@'localhost';
+
                    <p>Install munge in each node of cluster</p>
set password for 'slurm'@'localhost' = password('MyPassword');
+
                    <p>{{Command|<nowiki>apt-get install -y libmunge-dev libmunge2 munge</nowiki>}}</p>
grant usage on *.* to 'slurm'@'localhost';
+
                </li>
grant all privileges on slurm_db.* to 'slurm'@'localhost';
+
                <li>
flush privileges;
+
                    <p>Copy munge.key file from server to each node from cluster</p>
quit
+
                    <p>{{Command|<nowiki>scp /etc/munge/munge.key root@node:/etc/munge/munge.key</nowiki>}}</p>
 +
                </li>
 +
</ol>
  
 +
                    </div>
 +
                    <div class="panel-footer">COMPUTE NODE CONFIGURATION</div>
 +
                </div>
 +
            </div>
  
 +
           
 +
       
 
         </div>
 
         </div>
 +
        <div class="panel-footer">Slurm Installation on Debian from source</div>
 
     </div>
 
     </div>
 
</div>
 
</div>

Latest revision as of 14:41, 21 September 2016


Logo_sc33.png

Slurm Installation

In this section we describe all the administration tasks for the Slurm Workload Manager in the frontend node (Server) and in the compute nodes (Client)

Slurm Installation on Debian from repositories

  1. Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.
  2. Install MUNGE

    apt-get install -y libmunge-dev libmunge2 munge

  3. Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.

    Wait around for some random data (recommended for the paranoid):

    dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key

    Grab some pseudorandom data (recommended for the impatient):

    dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

    Permissions

    chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key

  4. Edit file /etc/passwd

    vi /etc/passwd

    Modify user munge in each machine

    File: /etc/passwd
    munge:x:501:501::/var/run/munge;/sbin/nologin

  5. Start MUNGE

    /etc/init.d/munge start

  6. Testing Munge

    The following steps can be performed to verify that the software has been properly installed and configured:

    Generate a credential on stdout:

    munge -n

    Check if a credential can be locally decoded:

    munge -n | unmunge

    Check if a credential can be remotely decoded:

    munge -n | ssh somehost unmunge

    Run a quick benchmark:

    remunge

  7. Install SLURM from repositories - Debian 8

    apt-get install -y slurm-wlm slurm-wlm-doc

    Install SLURM from repositories - Debian 7

    apt-get install -y slurm-llnl slurm-llnl-doc

  8. Create and copy slurm.conf

    Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in /usr/share/doc/slurmctld/slurm-wlm-configurator.html

    Open this file in your browser

    sftp://ip-server/usr/share/doc/slurmctld/slurm-wlm-configurator.html

    NOTE: Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.

    Copy the result from web-based configuration tool in /etc/slurm/slurm.conf and configure it such that it looks like the following (This is a example - build a configuration file customized for your environment) - http://slurm.schedmd.com/slurm.conf.html

    File: /etc/slurm/slurm.conf
    #
    # slurm.conf file generated by configurator.html.
    #
    # See the slurm.conf man page for more information.
    #
    ClusterName=GUANE
    ControlMachine=guane
    #
    SlurmUser=slurm
    SlurmctldPort=6817
    SlurmdPort=6818
    AuthType=auth/munge
    StateSaveLocation=/tmp
    SlurmdSpoolDir=/var/spool/slurm/slurmd
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd.pid
    ProctrackType=proctrack/pgid
    CacheGroups=0
    ReturnToService=1
    #
    # TIMERS
    SlurmctldTimeout=300
    SlurmdTimeout=300
    InactiveLimit=0
    MinJobAge=300
    KillWait=30
    Waittime=0
    #
    # SCHEDULING
    SchedulerType=sched/backfill
    SelectType=select/linear
    FastSchedule=1
    #
    # LOGGING
    SlurmctldDebug=3
    SlurmdDebug=3
    JobCompType=jobcomp/none
    JobCompLoc=/tmp/slurm_job_completion.txt
    #
    # ACCOUNTING
    JobAcctGatherType=jobacct_gather/linux
    JobAcctGatherFrequency=30
    #
    #AccountingStorageType=accounting_storage/slurmdbd
    #AccountingStorageHost=slurm
    #AccountingStorageLoc=/tmp/slurm_job_accounting.txt
    #AccountingStoragePass=
    #AccountingStorageUser=
    #
    # COMPUTE NODES
    # control node
    NodeName=guane NodeAddr=192.168.1.70 Port=17000 State=UNKNOWN
    
    # each logical node is on the same physical node, so we need different ports for them
    # name guane-[*] is arbitrary
    NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN
    NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN
    
    # PARTITIONS
    # partition name is arbitrary
    PartitionName=guane Nodes=guane-[1-2] Default=YES MaxTime=8-00:00:00 State=UP
    

  9. Install munge in each node of cluster

    apt-get install -y libmunge-dev libmunge2 munge

  10. Copy munge.key file from server to each node from cluster

    scp /etc/munge/munge.key root@node:/etc/munge/munge.key

    chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key

  11. Install SLURM compute node daemon - Debian 8

    apt-get install -y slurmd

    Install SLURM compute node daemon - Debian 7

    apt-get install -y slurm-llnl

  12. Start slurm in the nodes and server - Debian 8

    /etc/init.d/slurmd start

    /etc/init.d/slurmctld start

    Start slurm in the nodes and server - Debian 7

    /etc/init.d/slurm-llnl start

    /etc/init.d/slurmctld-llnl start


Slurm Installation on Debian from source


CONTROLLER CONFIGURATION

http://wildflower.diablonet.net/~scaron/slurmsetup.html

  1. Prerequisites

    apt-get install -y build-essential

  2. Install MUNGE

    apt-get install -y libmunge-dev libmunge2 munge

  3. Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.

    Wait around for some random data (recommended for the paranoid):

    dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key

    Grab some pseudorandom data (recommended for the impatient):

    dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

    Permissions

    chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key

  4. Edit file /etc/passwd

    vi /etc/passwd

    Modify user munge in each machine

    File: /etc/passwd
    munge:x:501:501::var/run/munge;/sbin/nologin

  5. Start MUNGE

    /etc/init.d/munge start

  6. Testing Munge

    The following steps can be performed to verify that the software has been properly installed and configured:

    Generate a credential on stdout:

    munge -n

    Check if a credential can be locally decoded:

    munge -n | unmunge

    Check if a credential can be remotely decoded:

    munge -n | ssh somehost unmunge

    Run a quick benchmark:

    remunge

  7. Install MySQL server (for SLURM accounting) and development tools (to build SLURM). We'll also install the BLCR tools so that SLURM can take advantage of that checkpoint-and-restart functionality.

    apt-get install mysql-server libmysqlclient-dev libmysqld-dev libmysqld-pic apt-get install gcc bison make flex libncurses5-dev tcsh pkg-config apt-get install blcr-util blcr-testsuite libcr-dbg libcr-dev libcr0

  8. Download lastets version. (http://www.schedmd.com/#repos)

    wget http://www.schedmd.com/download/latest/slurm-14.11.6.tar.bz2

  9. Unpack and build SLURM

    tar xvf slurm-14.11.6.tar.bz2 cd slurm-14.11.6 ./configure --enable-multiple-slurmd make make install

  10. .
  11. Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in doc/html/configurator.html, this file can opened in the browser if it's copy to /usr/share/ (sftp://ip-server/usr/share/doc/slurm/configurator.html)

    mkdir /usr/share/doc/slurm/ cd [slurm-src]/doc/html/ cp configurator.* /usr/share/doc/slurm/

    Other way is copy the example configuration files out to /etc/slurm.

    mkdir /etc/slurm cd [slurm-src] cp etc/slurm.conf.example /etc/slurm/slurm.conf cp etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf

    Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.

  12. Set things up for slurmdbd (the SLURM accounting daemon) in MySQL

    mysql -u root -p create database slurm_db; create user 'slurm'@'localhost'; set password for 'slurm'@'localhost' = password('MyPassword'); grant usage on *.* to 'slurm'@'localhost'; grant all privileges on slurm_db.* to 'slurm'@'localhost'; flush privileges; quit

  13. Configure /usr/local/etc/slurmdbd.conf such that it looks like the following:

    File: /usr/local/etc/slurmdbd.conf
    #
    # Example slurmdbd.conf file.
    #
    # See the slurmdbd.conf man page for more information.
    #
    # Archive info
    #ArchiveJobs=yes
    #ArchiveDir="/tmp"
    #ArchiveSteps=yes
    #ArchiveScript=
    #JobPurge=12
    #StepPurge=1
    #
    # Authentication info
    AuthType=auth/munge
    AuthInfo=/var/run/munge/munge.socket.2
    #
    # slurmDBD info
    DbdAddr=localhost
    DbdHost=localhost
    #DbdPort=7031
    SlurmUser=slurm
    #MessageTimeout=300
    DebugLevel=4
    #DefaultQOS=normal,standby
    LogFile=/var/log/slurm/slurmdbd.log
    PidFile=/var/run/slurmdbd.pid
    #PluginDir=/usr/lib/slurm
    #PrivateData=accounts,users,usage,jobs
    #TrackWCKey=yes
    #
    # Database info
    StorageType=accounting_storage/mysql
    #StorageHost=localhost
    #StoragePort=1234
    StoragePass=MyPassword
    StorageUser=slurm
    StorageLoc=slurm_db
    

  14. Configure /usr/local/etc/slurm.conf such that it looks like the following:

    File: /usr/local/etc/slurm.conf
    #
    # slurm.conf file generated by configurator.html.
    #
    # See the slurm.conf man page for more information.
    #
    ClusterName=GUANE
    ControlMachine=guane
    #ControlAddr=
    #BackupController=
    #BackupAddr=
    #
    SlurmUser=slurm
    #SlurmdUser=root
    SlurmctldPort=6817
    SlurmdPort=6818
    AuthType=auth/munge
    #JobCredentialPrivateKey=
    #JobCredentialPublicCertificate=
    StateSaveLocation=/tmp
    SlurmdSpoolDir=/var/spool/slurm/slurmd
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd.pid
    ProctrackType=proctrack/pgid
    #PluginDir=
    CacheGroups=0
    #FirstJobId=
    ReturnToService=1
    #MaxJobCount=
    #PlugStackConfig=
    #PropagatePrioProcess=
    #PropagateResourceLimits=
    #PropagateResourceLimitsExcept=
    #Prolog=
    #Epilog=
    #SrunProlog=
    #SrunEpilog=
    #TaskProlog=
    #TaskEpilog=
    #TaskPlugin=
    #TrackWCKey=no
    #TreeWidth=50
    #TmpFS=
    #UsePAM=
    #
    # TIMERS
    SlurmctldTimeout=300
    SlurmdTimeout=300
    InactiveLimit=0
    MinJobAge=300
    KillWait=30
    Waittime=0
    #
    # SCHEDULING
    SchedulerType=sched/backfill
    #SchedulerAuth=
    #SchedulerPort=
    #SchedulerRootFilter=
    SelectType=select/linear
    FastSchedule=1
    #PriorityType=priority/multifactor
    #PriorityDecayHalfLife=14-0
    #PriorityUsageResetPeriod=14-0
    #PriorityWeightFairshare=100000
    #PriorityWeightAge=1000
    #PriorityWeightPartition=10000
    #PriorityWeightJobSize=1000
    #PriorityMaxAge=1-0
    #
    # LOGGING
    SlurmctldDebug=3
    #SlurmctldLogFile=
    SlurmdDebug=3
    #SlurmdLogFile=
    JobCompType=jobcomp/none
    JobCompLoc=/tmp/slurm_job_completion.txt
    #
    # ACCOUNTING
    JobAcctGatherType=jobacct_gather/linux
    JobAcctGatherFrequency=30
    #
    #ccountingStorageEnforce=limits,qos
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageHost=slurm
    AccountingStorageLoc=/tmp/slurm_job_accounting.txt
    #AccountingStoragePass=
    #AccountingStorageUser=
    #
    
    # COMPUTE NODES
    # control node
    NodeName=guane NodeAddr=192.168.1.75 Port=17000 State=UNKNOWN
    
    # each logical node is on the same physical node, so we need different ports for them
    # name guane-[*] is arbitrary
    NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN
    NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN
    
    # PARTITIONS
    # partition name is arbitrary
    PartitionName=guane Nodes=guane-1 Default=YES MaxTime=8-00:00:00 State=UP
    

  15. Create init scripts to /etc/init.d

    touch /etc/init.d/slurmctl touch /etc/init.d/slurmdbd chmod +x /etc/init.d/slurm chmod +x /etc/init.d/slurmdbd

    File: /etc/init.d/slurmctl
    #!/bin/sh
    #
    # chkconfig: 345 90 10
    # description: SLURM is a simple resource management system which \
    #              manages exclusive access o a set of compute \
    #              resources and distributes work to those resources.
    #
    # processname: /usr/local/sbin/slurmctld
    # pidfile: /var/run/slurm/slurmctld.pid
    #
    # config: /etc/default/slurmctld
    #
    ### BEGIN INIT INFO
    # Provides:          slurmctld
    # Required-Start:    $remote_fs $syslog $network munge
    # Required-Stop:     $remote_fs $syslog $network munge
    # Should-Start:      $named
    # Should-Stop:       $named
    # Default-Start:     2 3 4 5
    # Default-Stop:      0 1 6
    # Short-Description: slurm daemon management
    # Description:       Start slurm to provide resource management
    ### END INIT INFO
    
    BINDIR="/usr/local/bin"
    CONFDIR="/usr/local/etc"
    LIBDIR="/usr/local/lib"
    SBINDIR="/usr/local/sbin"
    
    # Source slurm specific configuration
    if [ -f /etc/default/slurmctld ] ; then
        . /etc/default/slurmctld
    else
        SLURMCTLD_OPTIONS=""
    fi
    
    # Checking for slurm.conf presence
    if [ ! -f $CONFDIR/slurm.conf ] ; then
        if [ -n "$(echo $1 | grep start)" ] ; then
          echo Not starting slurmctld
        fi
          echo slurm.conf was not found in $CONFDIR
        exit 0
    fi
    
    DAEMONLIST="slurmctld"
    test -f $SBINDIR/slurmctld || exit 0
    
    #Checking for lsb init function
    if [ -f /lib/lsb/init-functions ] ; then
      . /lib/lsb/init-functions
    else
      echo Can\'t find lsb init functions
      exit 1
    fi
    # setup library paths for slurm and munge support
    export LD_LIBRARY_PATH=$LIBDIR${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
    
    #Function to check for cert and key presence and key vulnerabilty
    checkcertkey()
    {
      MISSING=""
      keyfile=""
      certfile=""
    
      if [ "$1" = "slurmctld" ] ; then
        keyfile=$(grep JobCredentialPrivateKey $CONFDIR/slurm.conf | grep -v "^ *#")
        keyfile=${keyfile##*=}
        keyfile=${keyfile%#*}
        [ -e $keyfile ] || MISSING="$keyfile"
      fi
    
      if [ "${MISSING}" != "" ] ; then
        echo Not starting slurmctld
        echo $MISSING not found
        exit 0
      fi
    
      if [ -f "$keyfile" ] && [ "$1" = "slurmctld" ] ; then
        keycheck=$(openssl-vulnkey $keyfile | cut -d : -f 1)
        if [ "$keycheck" = "COMPROMISED" ] ; then
          echo Your slurm key stored in the file $keyfile
          echo is vulnerable because has been created with a buggy openssl.
          echo Please rebuild it with openssl version \>= 0.9.8g-9
          exit 0
        fi
      fi
    }
    
    get_daemon_description()
    {
        case $1 in
          slurmd)
            echo slurm compute node daemon
            ;;
          slurmctld)
            echo slurm central management daemon
            ;;
          *)
            echo slurm daemon
            ;;
        esac
    }
    
    start() {
      CRYPTOTYPE=$(grep CryptoType $CONFDIR/slurm.conf | grep -v "^ *#")
      CRYPTOTYPE=${CRYPTOTYPE##*=}
      CRYPTOTYPE=${CRYPTOTYPE%#*}
      if [ "$CRYPTOTYPE" = "crypto/openssl" ] ; then
        checkcertkey $1
      fi
    
      # Create run-time variable data
      mkdir -p /var/run/slurm
      chown slurm:slurm /var/run/slurm
    
      # Checking if StateSaveLocation is under run
      if [ "$1" = "slurmctld" ] ; then
        SDIRLOCATION=$(grep StateSaveLocation /usr/local/etc/slurm.conf \
                           | grep -v "^ *#")
        SDIRLOCATION=${SDIRLOCATION##*=}
        SDIRLOCATION=${SDIRLOCATION%#*}
        if [ "${SDIRLOCATION}" = "/var/run/slurm/slurmctld" ] ; then
          if ! [ -e /var/run/slurm/slurmctld ] ; then
            ln -s /var/lib/slurm/slurmctld /var/run/slurm/slurmctld
          fi
        fi
      fi
    
    desc="$(get_daemon_description $1)"
      log_daemon_msg "Starting $desc" "$1"
      unset HOME MAIL USER USERNAME
      #FIXME $STARTPROC $SBINDIR/$1 $2
      STARTERRORMSG="$(start-stop-daemon --start --oknodo \
                       --exec "$SBINDIR/$1" -- $2 2>&1)"
      STATUS=$?
      log_end_msg $STATUS
      if [ "$STARTERRORMSG" != "" ] ; then
        echo $STARTERRORMSG
      fi
      touch /var/lock/slurm
    }
    
    stop() {
        desc="$(get_daemon_description $1)"
        log_daemon_msg "Stopping $desc" "$1"
        STOPERRORMSG="$(start-stop-daemon --oknodo --stop -s TERM \
                        --exec "$SBINDIR/$1" 2>&1)"
        STATUS=$?
        log_end_msg $STATUS
        if [ "$STOPERRORMSG" != "" ] ; then
          echo $STOPERRORMSG
        fi
        rm -f /var/lock/slurm
    }
    
    getpidfile() {
        dpidfile=`grep -i ${1}pid $CONFDIR/slurm.conf | grep -v '^ *#'`
        if [ $? = 0 ]; then
            dpidfile=${dpidfile##*=}
            dpidfile=${dpidfile%#*}
        else
            dpidfile=/var/run/${1}.pid
        fi
    
        echo $dpidfile
    }
    
    #
    # status() with slight modifications to take into account
    # instantiations of job manager slurmd's, which should not be
    # counted as "running"
    #
    
    slurmstatus() {
        base=${1##*/}
    
        pidfile=$(getpidfile $base)
    
        pid=`pidof -o $$ -o $$PPID -o %PPID -x $1 || \
             pidof -o $$ -o $$PPID -o %PPID -x ${base}`
    
        if [ -f $pidfile ]; then
            read rpid < $pidfile
            if [ "$rpid" != "" -a "$pid" != "" ]; then
                for i in $pid ; do
                    if [ "$i" = "$rpid" ]; then
                        echo "${base} (pid $pid) is running..."
                        return 0
                    fi
                done
            elif [ "$rpid" != "" -a "$pid" = "" ]; then
    #           Due to change in user id, pid file may persist
    #           after slurmctld terminates
                if [ "$base" != "slurmctld" ] ; then
                   echo "${base} dead but pid file exists"
                fi
                return 1
           fi
    
        fi
    
        if [ "$base" = "slurmctld" -a "$pid" != "" ] ; then
            echo "${base} (pid $pid) is running..."
            return 0
        fi
    
        echo "${base} is stopped"
    
        return 3
    }
    
    #
    # stop slurm daemons,
    # wait for termination to complete (up to 10 seconds) before returning
    #
    
    slurmstop() {
        for prog in $DAEMONLIST ; do
           stop $prog
           for i in 1 2 3 4
           do
              sleep $i
              slurmstatus $prog
              if [ $? != 0 ]; then
                 break
              fi
           done
        done
    }
    
    #
    # The pathname substitution in daemon command assumes prefix and
    # exec_prefix are same.  This is the default, unless the user requests
    # otherwise.
    #
    # Any node can be a slurm controller and/or server.
    #
    case "$1" in
        start)
            start slurmctld "$SLURMCTLD_OPTIONS"
            ;;
        startclean)
            SLURMCTLD_OPTIONS="-c $SLURMCTLD_OPTIONS"
            start slurmctld "$SLURMCTLD_OPTIONS"
            ;;
        stop)
            slurmstop
            ;;
        status)
            for prog in $DAEMONLIST ; do
               slurmstatus $prog
            done
            ;;
        restart)
            $0 stop
            $0 start
            ;;
        force-reload)
            $0 stop
            $0 start
            ;;
        condrestart)
            if [ -f /var/lock/subsys/slurm ]; then
                for prog in $DAEMONLIST ; do
                     stop $prog
                     start $prog
                done
            fi
            ;;
        reconfig)
            for prog in $DAEMONLIST ; do
                PIDFILE=$(getpidfile $prog)
                start-stop-daemon --stop --signal HUP --pidfile \
                  "$PIDFILE" --quiet $prog
            done
            ;;
        test)
            for prog in $DAEMONLIST ; do
                echo "$prog runs here"
            done
            ;;
        *)
            echo "Usage: $0 {start|startclean|stop|status|restart|reconfig|condrestart|tes$
            exit 1
            ;;
    esac
    

  16. Add SLURM user:

    echo "slurm:x:2000:2000:slurm admin:/home/slurm:/bin/bash" >> /etc/passwd echo "slurm:x:2000:slurm >> /etc/group pwconv

  17. Create SLURM spool and log directories and set permissions accordingly:

    mkdir /var/spool/slurm chown -R slurm:slurm /var/spool/slurm mkdir /var/log/slurm chown -R slurm:slurm /var/log/slurm

COMPUTE NODE CONFIGURATION


  1. Install munge in each node of cluster

    apt-get install -y libmunge-dev libmunge2 munge

  2. Copy munge.key file from server to each node from cluster

    scp /etc/munge/munge.key root@node:/etc/munge/munge.key