Slurm Installation on Debian

From Supercomputación y Cálculo Científico UIS
Revision as of 18:40, 6 May 2015 by Ltorres (talk | contribs)


Logo_sc33.png

Slurm Installation

In this section we describe all the administration tasks for the Slurm Workload Manager in the frontend node (Server) and in the compute nodes (Client)

Slurm Installation on Debian from repositories

  1. Make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster.
  2. Install MUNGE

    apt-get install -y libmunge-dev libmunge2 munge

  3. Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.

    Wait around for some random data (recommended for the paranoid):

    dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key

    Grab some pseudorandom data (recommended for the impatient):

    dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

    Permissions

    chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key

  4. Edit file /etc/passwd

    vi /etc/passwd

    Modify user munge in each machine

    File: /etc/passwd
    munge:x:501:501::var/run/munge;/sbin/nologin

  5. Start MUNGE

    /etc/init.d/munge start

  6. Testing Munge

    The following steps can be performed to verify that the software has been properly installed and configured:

    Generate a credential on stdout:

    munge -n

    Check if a credential can be locally decoded:

    munge -n | unmunge

    Check if a credential can be remotely decoded:

    munge -n | ssh somehost unmunge

    Run a quick benchmark:

    remunge

  7. Install SLURM from repositories

    apt-get install -y slurm-wlm slurm-wlm-doc

  8. Create and copy slurm.conf

    Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in /usr/share/doc/slurmctld/slurm-wlm-configurator.html

    Open this file in your browser

    sftp://ip-server/usr/share/doc/slurmctld/slurm-wlm-configurator.html

    NOTE: Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file.

    Copy the result from web-based configuration tool in /etc/slurm/slurm.conf and configure it such that it looks like the following (This is a example - build a configuration file customized for your environment) - http://slurm.schedmd.com/slurm.conf.html

    File: /etc/slurm/slurm.conf
    #
    # slurm.conf file generated by configurator.html.
    #
    # See the slurm.conf man page for more information.
    #
    ClusterName=GUANE
    ControlMachine=guane
    #
    SlurmUser=slurm
    SlurmctldPort=6817
    SlurmdPort=6818
    AuthType=auth/munge
    StateSaveLocation=/tmp
    SlurmdSpoolDir=/var/spool/slurm/slurmd
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd.pid
    ProctrackType=proctrack/pgid
    CacheGroups=0
    ReturnToService=1
    #
    # TIMERS
    SlurmctldTimeout=300
    SlurmdTimeout=300
    InactiveLimit=0
    MinJobAge=300
    KillWait=30
    Waittime=0
    #
    # SCHEDULING
    SchedulerType=sched/backfill
    SelectType=select/linear
    FastSchedule=1
    #
    # LOGGING
    SlurmctldDebug=3
    SlurmdDebug=3
    JobCompType=jobcomp/none
    JobCompLoc=/tmp/slurm_job_completion.txt
    #
    # ACCOUNTING
    JobAcctGatherType=jobacct_gather/linux
    JobAcctGatherFrequency=30
    #
    #AccountingStorageType=accounting_storage/slurmdbd
    #AccountingStorageHost=slurm
    #AccountingStorageLoc=/tmp/slurm_job_accounting.txt
    #AccountingStoragePass=
    #AccountingStorageUser=
    #
    # COMPUTE NODES
    # control node
    NodeName=guane NodeAddr=192.168.1.70 Port=17000 State=UNKNOWN
    
    # each logical node is on the same physical node, so we need different ports for them
    # name guane-[*] is arbitrary
    NodeName=guane-1 NodeAddr=192.168.1.71 Port=17002 State=UNKNOWN
    NodeName=guane-2 NodeAddr=192.168.1.72 Port=17003 State=UNKNOWN
    
    # PARTITIONS
    # partition name is arbitrary
    PartitionName=guane Nodes=guane-[1-2] Default=YES MaxTime=8-00:00:00 State=UP
    

  9. Install munge in each node of cluster

    apt-get install -y libmunge-dev libmunge2 munge

  10. Copy munge.key file from server to each node from cluster

    scp /etc/munge/munge.key root@node:/etc/munge/munge.key

  11. Install SLURM compute node daemon

    apt-get install -y slurmd

  12. Start slurm in the nodes and server

    /etc/init.d/slurmd start

    /etc/init.d/slurmctld start


Slurm Installation on Debian from source


CONTROLLER CONFIGURATION

http://wildflower.diablonet.net/~scaron/slurmsetup.html

  1. Prerequisites

    apt-get install -y build-essential

  2. Install MUNGE

    apt-get install -y libmunge-dev libmunge2 munge

  3. Generate MUNGE key. There are various ways to do this, depending on the desired level of key quality. Refer to the MUNGE installation guide for complete details.

    Wait around for some random data (recommended for the paranoid):

    dd if=/dev/random bs=1 count=1024 > /etc/munge/munge.key

    Grab some pseudorandom data (recommended for the impatient):

    dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

    Permissions

    chown munge:munge /etc/munge/munge.key chmod 400 /etc/munge/munge.key

  4. Edit file /etc/passwd

    vi /etc/passwd

    Modify user munge in each machine

    File: /etc/passwd
    munge:x:501:501::var/run/munge;/sbin/nologin

  5. Start MUNGE

    /etc/init.d/munge start

  6. Testing Munge

    The following steps can be performed to verify that the software has been properly installed and configured:

    Generate a credential on stdout:

    munge -n

    Check if a credential can be locally decoded:

    munge -n | unmunge

    Check if a credential can be remotely decoded:

    munge -n | ssh somehost unmunge

    Run a quick benchmark:

    remunge

  7. 5. Install MySQL server (for SLURM accounting) and development tools (to build SLURM). We'll also install the BLCR tools so that SLURM can take advantage of that checkpoint-and-restart functionality. apt-get install mysql-server libmysqlclient-dev libmysqld-dev libmysqld-pic apt-get install gcc bison make flex libncurses5-dev tcsh pkg-config apt-get install blcr-util blcr-testsuite libcr-dbg libcr-dev libcr0 6. Download lastets version. (http://www.schedmd.com/#repos) wget http://www.schedmd.com/download/latest/slurm-14.11.6.tar.bz2 7. Unpack and build SLURM. tar xvf slurm-14.11.6.tar.bz2 cd slurm-14.11.6 ./configure --enable-multiple-slurmd make make install 8. Exist some ways to generate the slurm.cfg file. It have a web-based configuration tool which can be used to build a simple configuration file, which can then be manually edited for more complex configurations. The tool is located in doc/html/configurator.html, this file can opened in the browser if it's copy to /usr/share/ (sftp://ip-server/usr/share/doc/slurm/configurator.html). mkdir /usr/share/doc/slurm/ cd [slurm-src]/doc/html/ cp configurator.* /usr/share/doc/slurm/ Other way is copy the example configuration files out to /etc/slurm. mkdir /etc/slurm cd [slurm-src] cp etc/slurm.conf.example /etc/slurm/slurm.conf cp etc/slurmdbd.conf.example /etc/slurm/slurmdbd.conf Executing the command slurmd -C on each compute node will print its physical configuration (sockets, cores, real memory size, etc.), which can be used in constructing the slurm.conf file. 9. Set things up for slurmdbd (the SLURM accounting daemon) in MySQL. mysql -u root -p create database slurm_db; create user 'slurm'@'localhost'; set password for 'slurm'@'localhost' = password('MyPassword'); grant usage on *.* to 'slurm'@'localhost'; grant all privileges on slurm_db.* to 'slurm'@'localhost'; flush privileges; quit

COMPUTE NODE CONFIGURATION


  1. Install munge in each node of cluster

    apt-get install -y libmunge-dev libmunge2 munge

  2. Copy munge.key file from server to each node from cluster

    scp /etc/munge/munge.key root@node:/etc/munge/munge.key