Difference between revisions of "Slurm Installation on Debian"
Line 11: | Line 11: | ||
</div> | </div> | ||
</div> | </div> | ||
+ | |||
+ | <div class="col-md-14"> | ||
+ | <div class="panel panel-darker-white-border"> | ||
+ | <div class="panel-heading"> | ||
+ | <h3 class="panel-title">Slurm Installation on Debian</h3> | ||
+ | </div> | ||
+ | <div class="panel-body"> | ||
+ | <ol> | ||
+ | <li>Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.</li> | ||
+ | <li> | ||
+ | <p>Install MUNGE</p> | ||
+ | <p>{{Command|<nowiki>apt-get install -y libmunge-dev libmunge2 munge</nowiki>}}</p> | ||
+ | </li> | ||
+ | <li> | ||
+ | <p>Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.</p> | ||
+ | <p>Wait around for some random data (recommended for the paranoid):</p> | ||
+ | <p>{{Command|<nowiki>dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p> | ||
+ | <p>Grab some pseudorandom data (recommended for the impatient):</p> | ||
+ | <p>{{Command|<nowiki>dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key</nowiki>}}</p> | ||
+ | <p>Permissions</p> | ||
+ | <p> | ||
+ | {{Command|<nowiki>chown munge:munge /etc/munge/munge.key | ||
+ | chmod 400 /etc/munge/munge.key | ||
+ | </nowiki>}}</p> | ||
+ | </li> | ||
+ | <li> | ||
+ | <p>Edit file /etc/passwd</p> | ||
+ | <p>{{Command|<nowiki>vi /etc/passwd</nowiki>}}</p> | ||
+ | <p>Modify user munge in each machine</p> | ||
+ | <p>{{File|/etc/passwd|<pre><nowiki>munge:x:501:501::var/run/munge;/sbin/nologin</nowiki></pre>}}</p> | ||
+ | </li> | ||
+ | <li> | ||
+ | <p>Start MUNGE</p> | ||
+ | <p>{{Command|<nowiki>/etc/init.d/munge start</nowiki>}}</p> | ||
+ | </li> | ||
+ | <li> | ||
+ | <p>Testing Munge</p> | ||
+ | <p>The following steps can be performed to verify that the software has been properly installed and configured:</p> | ||
+ | <p>Generate a credential on stdout:</p> | ||
+ | <p>{{Command|<nowiki>munge -n</nowiki>}}</p> | ||
+ | <p>Check if a credential can be locally decoded:</p> | ||
+ | <p>{{Command|<nowiki>munge -n | unmunge</nowiki>}}</p> | ||
+ | <p>Check if a credential can be remotely decoded:</p> | ||
+ | <p>{{Command|<nowiki>munge -n | ssh somehost unmunge</nowiki>}}</p> | ||
+ | <p>Run a quick benchmark: </p> | ||
+ | <p>{{Command|<nowiki>remunge</nowiki>}}</p> | ||
+ | </li> | ||
+ | <li> | ||
+ | <p>Install SLURM from repositories</p> | ||
+ | <p>{{Command|<nowiki>apt-get install -y slurm-wlm slurm-wlm-doc</nowiki>}}</p> | ||
+ | </li> | ||
+ | <li> | ||
+ | <p>Create and copy slurm.conf</p> | ||
+ | <p>Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in <b>/usr/share/doc/slurmctld/slurm-wlm-configurator.html</b></p> | ||
+ | <p>Open this file in your browser</p> | ||
+ | <p>sftp://ip-server/usr/share/doc/slurmctld/slurm-wlm-configurator.html</p> | ||
+ | <p>{{Note|Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.}}</p> | ||
+ | <p>Copy the result from web-based configuration tool in <b>/etc/slurm/slurm.conf</b> and configure it such that it looks like the following (This is a example - build a configuration file customized for your environment)</p> | ||
+ | <p>{{File|/etc/slurm/slurm.conf|<pre><nowiki> | ||
+ | # | ||
+ | # slurm.conf file generated by configurator.html. | ||
+ | # | ||
+ | # See the slurm.conf man page for more information. | ||
+ | # | ||
+ | ClusterName=GUANE | ||
+ | ControlMachine=guane | ||
+ | # | ||
+ | SlurmUser=slurm | ||
+ | SlurmctldPort=6817 | ||
+ | SlurmdPort=6818 | ||
+ | AuthType=auth/munge | ||
+ | StateSaveLocation=/tmp | ||
+ | SlurmdSpoolDir=/var/spool/slurm/slurmd | ||
+ | SwitchType=switch/none | ||
+ | MpiDefault=none | ||
+ | SlurmctldPidFile=/var/run/slurmctld.pid | ||
+ | SlurmdPidFile=/var/run/slurmd.pid | ||
+ | ProctrackType=proctrack/pgid | ||
+ | CacheGroups=0 | ||
+ | ReturnToService=1 | ||
+ | # | ||
+ | # TIMERS | ||
+ | SlurmctldTimeout=300 | ||
+ | SlurmdTimeout=300 | ||
+ | InactiveLimit=0 | ||
+ | MinJobAge=300 | ||
+ | KillWait=30 | ||
+ | Waittime=0 | ||
+ | # | ||
+ | # SCHEDULING | ||
+ | SchedulerType=sched/backfill | ||
+ | SelectType=select/linear | ||
+ | FastSchedule=1 | ||
+ | # | ||
+ | # LOGGING | ||
+ | SlurmctldDebug=3 | ||
+ | SlurmdDebug=3 | ||
+ | JobCompType=jobcomp/none | ||
+ | JobCompLoc=/tmp/slurm_job_completion.txt | ||
+ | # | ||
+ | # ACCOUNTING | ||
+ | JobAcctGatherType=jobacct_gather/linux | ||
+ | JobAcctGatherFrequency=30 | ||
+ | # | ||
+ | #AccountingStorageType=accounting_storage/slurmdbd | ||
+ | #AccountingStorageHost=slurm | ||
+ | #AccountingStorageLoc=/tmp/slurm_job_accounting.txt | ||
+ | #AccountingStoragePass= | ||
+ | #AccountingStorageUser= | ||
+ | # | ||
+ | # COMPUTE NODES | ||
+ | # control node | ||
+ | NodeName=guane NodeAddr=192.168.1.70 Port=17000 State=UNKNOWN | ||
+ | |||
+ | # each logical node is on the same physical node, so we need different ports for them | ||
+ | # name guane-[*] is arbitrary | ||
+ | NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN | ||
+ | NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN | ||
+ | |||
+ | # PARTITIONS | ||
+ | # partition name is arbitrary | ||
+ | PartitionName=guane Nodes=guane-[1-2] Default=YES MaxTime=8-00:00:00 State=UP | ||
+ | </nowiki></pre>}}</p> | ||
+ | </li> | ||
+ | |||
+ | |||
+ | |||
+ | <li>copy munge.key in the nodes - ssh /etc/munge/munge.key root@nodes:/etc/munge/munge.key</li> | ||
+ | <li>Start slurm - /etc/init.d/slurmctld start in the server machine and /etc/init.d/slurmd start in nodes machine</li> | ||
+ | |||
+ | </ol> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | {{Command|<nowiki>curl http://oar-ftp.imag.fr/oar/oarmaster.asc | sudo apt-key add -</nowiki>}} | ||
+ | {{File|/etc/oar/oar.conf|<pre><nowiki> | ||
+ | DB_TYPE="mysql" | ||
+ | |||
+ | </nowiki></pre>}} | ||
+ | {{Note|In our cluster guane we have 8 GPUs per node, every node have 24 cores. Therefore we have to use 3 CPU cores to manage one GPU. Thus, you have to modify the lines to do something like that.}} | ||
Line 112: | Line 252: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
</div> | </div> | ||
</div> | </div> | ||
</div> | </div> |
Revision as of 16:19, 6 May 2015
Slurm Installation
In this section we describe all the administration tasks for the Slurm Workload Manager in the frontend node (Server) and in the compute nodes (Client)
Slurm Installation on Debian
- Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.
-
Install MUNGE
apt-get install -y libmunge-dev libmunge2 munge -
Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.
Wait around for some random data (recommended for the paranoid):
dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.keyGrab some pseudorandom data (recommended for the impatient):
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.keyPermissions
chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key -
Edit file /etc/passwd
vi /etc/passwdModify user munge in each machine
File: /etc/passwdmunge:x:501:501::var/run/munge;/sbin/nologin
-
Start MUNGE
/etc/init.d/munge start -
Testing Munge
The following steps can be performed to verify that the software has been properly installed and configured:
Generate a credential on stdout:
munge -nCheck if a credential can be locally decoded:
munge -n | unmungeCheck if a credential can be remotely decoded:
munge -n | ssh somehost unmungeRun a quick benchmark:
remunge -
Install SLURM from repositories
apt-get install -y slurm-wlm slurm-wlm-doc -
Create and copy slurm.conf
Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in /usr/share/doc/slurmctld/slurm-wlm-configurator.html
Open this file in your browser
sftp://ip-server/usr/share/doc/slurmctld/slurm-wlm-configurator.html
NOTE: Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.Copy the result from web-based configuration tool in /etc/slurm/slurm.conf and configure it such that it looks like the following (This is a example - build a configuration file customized for your environment)
File: /etc/slurm/slurm.conf# # slurm.conf file generated by configurator.html. # # See the slurm.conf man page for more information. # ClusterName=GUANE ControlMachine=guane # SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/tmp SlurmdSpoolDir=/var/spool/slurm/slurmd SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid ProctrackType=proctrack/pgid CacheGroups=0 ReturnToService=1 # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING SchedulerType=sched/backfill SelectType=select/linear FastSchedule=1 # # LOGGING SlurmctldDebug=3 SlurmdDebug=3 JobCompType=jobcomp/none JobCompLoc=/tmp/slurm_job_completion.txt # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=30 # #AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageHost=slurm #AccountingStorageLoc=/tmp/slurm_job_accounting.txt #AccountingStoragePass= #AccountingStorageUser= # # COMPUTE NODES # control node NodeName=guane NodeAddr=192.168.1.70 Port=17000 State=UNKNOWN # each logical node is on the same physical node, so we need different ports for them # name guane-[*] is arbitrary NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN # PARTITIONS # partition name is arbitrary PartitionName=guane Nodes=guane-[1-2] Default=YES MaxTime=8-00:00:00 State=UP
- copy munge.key in the nodes - ssh /etc/munge/munge.key root@nodes:/etc/munge/munge.key
- Start slurm - /etc/init.d/slurmctld start in the server machine and /etc/init.d/slurmd start in nodes machine
DB_TYPE="mysql"
Slurm Installation on Debian
http://wildflower.diablonet.net/~scaron/slurmsetup.html
CONTROLLER CONFIGURATION
Prerequisites
apt-get install -y build-essential
1. Install MUNGE
apt-get install -y libmunge-dev libmunge2 munge
2. Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.
Wait around for some random data (recommended for the paranoid): dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key
Grab some pseudorandom data (recommended for the impatient): dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key
3. Start MUNGE.
/etc/init.d/munge start
4. Testing Munge
The following steps can be performed to verify that the software has been properly installed and configured:
Generate a credential on stdout:
$ munge -n
Check if a credential can be locally decoded:
$ munge -n | unmunge
Check if a credential can be remotely decoded:
$ munge -n | ssh somehost unmunge
Run a quick benchmark:
$ remunge
5. Install MySQL server (for SLURM accounting) and development tools (to build SLURM). We'll also install the BLCR tools so that SLURM can take advantage of that checkpoint-and-restart functionality.
apt-get install mysql-server libmysqlclient-dev libmysqld-dev libmysqld-pic apt-get install gcc bison make flex libncurses5-dev tcsh pkg-config apt-get install blcr-util blcr-testsuite libcr-dbg libcr-dev libcr0
6. Download lastets version. (http://www.schedmd.com/#repos)
wget http://www.schedmd.com/download/latest/slurm-14.11.6.tar.bz2
7. Unpack and build SLURM.
tar xvf slurm-14.11.6.tar.bz2 cd slurm-14.11.6 ./configure --enable-multiple-slurmd make make install
8. Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in doc/html/configurator.html, this file can opened in the browser if it's copy to /usr/share/ (sftp://ip-server/usr/share/doc/slurm/configurator.html).
mkdir /usr/share/doc/slurm/ cd [slurm-src]/doc/html/ cp configurator.* /usr/share/doc/slurm/
Other way is copy the example configuration files out to /etc/slurm.
mkdir /etc/slurm cd [slurm-src] cp etc/slurm.conf.example /etc/slurm/slurm.conf cp etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.
9. Set things up for slurmdbd (the SLURM accounting daemon) in MySQL.
mysql -u root -p create database slurm_db; create user 'slurm'@'localhost'; set password for 'slurm'@'localhost' = password('MyPassword'); grant usage on *.* to 'slurm'@'localhost'; grant all privileges on slurm_db.* to 'slurm'@'localhost'; flush privileges; quit