User Tools

Site Tools


public:it:hpc:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
public:it:hpc:slurm [2021/03/01 15:18] philpublic:it:hpc:slurm [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1
Line 1: Line 1:
-====== Slurm ====== 
- 
-  * [[https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands|Convenient Slurm Commands]] 
-  * [[https://docs.rc.fas.harvard.edu/kb/fairshare| Harvard docs on their fairshare implementation]] 
-  * [[https://docs.rc.fas.harvard.edu/kb/dual-lab-affiliations-on-cluster/ | Harvard Affiliations]] 
-  * [[https://docs.rc.fas.harvard.edu/kb/running-jobs/| Harvard Running jobs]] 
- 
-  * [[https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf|Slurm Priority and Fair Tree presentation]] 
- 
-  * [[https://github.com/edf-hpc/slurm-llnl-misc-plugins/blob/master/job_submit/job_submit.lua|Job Submit override example]] 
-  * [[https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 | Job submit override example 2]] 
-  * [[https://git.webhosting.rug.nl/HPC/pg-playbooks/src/commit/b808fc56b82ecb6fecf28e0a3561f333a0ee4a1b/roles/slurm/files/job_submit.lua?lang=pl-PL| Job submit override example 3]] 
- 
- 
-  * [[https://slurm.schedmd.com/slurm.conf.html|man slurm.conf]] 
-  * [[https://slurm.schedmd.com/gres.html|man gres.conf]] 
-  * [[https://slurm.schedmd.com/srun.html|man srun]] 
-  * [[https://slurm.schedmd.com/sbatch.html|man sbatch]] 
-  * [[https://slurm.schedmd.com/priority_multifactor.html | Priority Multifactor Calculation]] 
- 
-  * [[https://cac.queensu.ca/wiki/index.php/SLURM_Accounting|Hidden Partitions]] 
- 
- 
-===== checkpointing ===== 
-  * https://ubccr.freshdesk.com/support/solutions/articles/5000688796 
-  * https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/ccr_buffalo/slurm_dmtcp_serial 
- 
-===== Web status api ===== 
-  * [[https://github.com/rbogle/slurm-web-api|Slurm Web API]] 
-  * [[https://github.com/edf-hpc/slurm-web| Slurm Web Monitor]] 
- 
- 
-====== Database Performance ====== 
-  * https://slurm.schedmd.com/high_throughput.html 
-  * https://bugs.schedmd.com/show_bug.cgi?id=446 
- 
-> We noticed sacct (in SLURM 2.6.1) is making unindexed queries[1] on job tables, which take several seconds on an installation with ~2M job_table rows, even after tuning mysqld. 
-> 
-> Adding a composite index across some of the more distinctive columns dropped query time to a few milliseconds: 
- 
-<code> 
-ALTER TABLE ${clustername}_job_table ADD KEY `sacct` (`id_user`,`time_start`,`time_end`); 
-</code> 
- 
-<code> 
-SET timestamp=1613686302; 
-select distinct t1.id_wckey, t1.is_def, t1.wckey_name, t1.user from "aicluster_wckey_table" as t1 where t1.deleted=0 && (t1.is_def=1) && (t1.user='kauffman3') order by wckey_name, user; 
-# User@Host: slurm[slurm] @  [172.20.0.3] 
-# Thread_id: 1318  Schema: slurmDB  QC_hit: No 
-# Query_time: 0.020722  Lock_time: 0.000069  Rows_sent: 0  Rows_examined: 2763 
-# Rows_affected: 2762  Bytes_sent: 60 
-# 
-# explain: id   select_type     table   type    possible_keys   key     key_len ref     rows    r_rows  filtered 
-        r_filtered      Extra 
-# explain: 1    SIMPLE  aicluster_assoc_table   index   NULL    PRIMARY 4       NULL    2763    2763.00 100.00  99.96   Using where 
-</code> 
- 
- 
- 
-====== QOS ====== 
- 
-Create QOS 
-<code> 
-root@fe01:~# sacctmgr -i add qos high set priority=1000 
- Adding QOS(s) 
-  high 
- Settings 
-  Description    = high 
-  Priority                 = 1000 
-</code> 
-<code> 
-root@fe01:~# sacctmgr -i add qos medium set priority=500 
- Adding QOS(s) 
-  medium 
- Settings 
-  Description    = medium 
-  Priority                 = 500 
-</code> 
-<code> 
-root@fe01:~# sacctmgr -i add qos low set priority=100 
- Adding QOS(s) 
-  low 
- Settings 
-  Description    = low 
-  Priority                 = 100 
-</code> 
- 
-Create group: 
- 
-<code> 
-root@fe01:~# sacctmgr create account jonaslab 
- Adding Account(s) 
-  jonaslab 
- Settings 
-  Description     = Account Name 
-  Organization    = Parent/Account Name 
- Associations 
-  A = jonaslab   C = aicluster  
-Would you like to commit changes? (You have 30 seconds to decide) 
-(N/y): y 
-</code> 
- 
-Set prio and default prio: 
-<code> 
-root@fe01:~# sacctmgr -i modify account jonaslab set qos=low 
- Modified account associations... 
-  C = aicluster  A = jonaslab of root 
-</code> 
-<code> 
-root@fe01:~# sacctmgr -i modify account jonaslab set defaultqos=low 
- Modified account associations... 
-  C = aicluster  A = jonaslab of root 
-</code> 
- 
- 
-====Add user account to group.==== 
-Source: https://bugs.schedmd.com/show_bug.cgi?id=1613 
- 
-This will give 'kauffman3' two user 
- 
-<code> 
-root@fe01:~# sacctmgr create user kauffman3 account=jonaslab 
- Associations = 
-  U = kauffman3 A = jonaslab   C = aicluster  
- Non Default Settings 
-Would you like to commit changes? (You have 30 seconds to decide) 
-(N/y): y 
-</code> 
- 
-<code> 
-root@fe01:~# sacctmgr show account withassoc kauffman3 
-   Account                Descr                  Org    Cluster   Par Name       User     Share   Priority GrpJobs GrpNodes  GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS  
----------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------  
- kauffman3            kauffman3            kauffman3  aicluster       root                    1                                                                                                                                                          normal            
- kauffman3            kauffman3            kauffman3  aicluster             kauffman3                                                                                                                                                                  normal 
-</code> 
-<code>            
-root@fe01:~# sacctmgr show account withassoc jonaslab 
-   Account                Descr                  Org    Cluster   Par Name       User     Share   Priority GrpJobs GrpNodes  GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS  
----------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------  
-  jonaslab             jonaslab             jonaslab  aicluster       root                    1                                                                                                                                                             low       low  
-  jonaslab             jonaslab             jonaslab  aicluster             kauffman3                                                                                                                                                                     low       low  
-</code> 
- 
-normal == default priority 
-<code> 
-root@fe01:~# sacctmgr list qos Format=name,priority 
-      Name   Priority  
----------- ----------  
-    normal          0  
-      high       1000  
-    medium        500  
-       low        100  
-</code> 
- 
-Check prio on submitted job: 
-<code> 
-kauffman3@fe01:~/examples$ sacct -j 381 --format=JobID,JobName,MaxRSS,Elapsed,Qos 
-       JobID    JobName     MaxRSS    Elapsed        QOS  
------------- ---------- ---------- ---------- ----------  
-381          two_gpu_p+              00:00:31        low  
-381.batch         batch      4092K   00:00:31             
-381.extern       extern          0   00:00:32             
-381.0              bash       520K   00:00:31   
-</code> 
- 
- 
-===== Containers ===== 
-https://containers-at-tacc.readthedocs.io/en/latest/singularity/03.mpi_and_gpus.html#message-passing-interface-mpi-for-running-on-multiple-nodes 
-https://containers-at-tacc.readthedocs.io/en/latest/singularity/02.singularity_batch.html#how-do-hpc-systems-fit-into-the-development-workflow 
- 
- 
- 
-==== Rootless Docker discussion for HPC ==== 
- 
-> TACC hasn't solved this problem either: 
-> 
->https://containers-at-tacc.readthedocs.io/en/latest/singularity/02.singularity_batch.html#how-do-hpc-systems-fit-into-the-development-workflow 
-> 
->Additionally for their SLURM cluster they use singularity and not Docker. The `build` portion for the Docker container is expected to happen elsewhere. 
-> 
-> https://containers-at-tacc.readthedocs.io/en/latest/singularity/03.mpi_and_gpus.html#singularity-and-gpu-computing 
-> 
-> Based on the little I know about singularity it was meant to be run on HPC clusters so I don't think we'll have problem deploying it everywhere. 
-> 
->Phil 
- 
->On 2/9/21 9:37 AM, : 
->> Ok this is great, thank you for looking into this so much. 
->> 
->> Phil I think your "round-peg square-hole" comment might be correct, but 
->> this is also the world we have woken up in. 
->> 
->> Podman might actually work, although I'm vaguely worried that they appear 
->> to use a version of fuse for their non-root userspace filesystem IO, which 
->> may be a performance nightmare. 
->> 
->> Heavily-multiuser systems like TACC (NSF supercomputer) and ALCF (Argonne) 
->> are increasingly adopting containers for end: 
->> https://containers-at-tacc.readthedocs.io/en/latest/ 
->> 
->> I believe the "river" cluster here at UChicago (run by physics) also 
->> supports running containers. 
->> 
->> I'm still trying to figure out where the security contours lie between 
->> "building" the container and "running" the container. For example, 
->> cluster-level support for running containers (but not building them) could 
->> conceivably be ok. This might be what TACC et al are doing. 
->> 
->> I'm willing to table this for a bit, but let's be sure to revisit. I'll ask 
->> Kyle what the River people are doing. 
- 
- 
-The conversation then turned to building docker images for different architectures. 
- 
-==== Building amd64 docker image on an ARM M1 MacbookAir ==== 
- 
->https://docs.docker.com/docker-for-mac/apple-m1/ 
-> 
->On my M1 MacBook Air: 
-> 
->Find the digest entry for amd64 
->m1$ docker manifest inspect ubuntu:20.04 
-> 
->m1$ docker run -it docker.io/library/ubuntu:20.04@sha256:3093096ee188f8ff4531949b8f6115af4747ec1c58858c091c8cb4579c39cc4e 
->uname -a 
->WARNING: The requested image's platform (linux/amd64) does not match the detected host platform >(linux/arm64/v8) and no specific platform was requested 
->root@afc7a92aafeb:/# uname -a 
->Linux afc7a92aafeb 4.19.104-linuxkit #1 SMP PREEMPT Sat Feb 15 00:49:47 UTC 2020 x86_64 x86_64 x86_64 >GNU/Linux 
-> 
-> 
->https://docs.docker.com/docker-for-mac/multi-arch/ 
-> 
->Basically this: 
->m1$ docker buildx build --platform linux/amd64 . 
-> 
->I’ve built a container that Techstaff uses to deploy the `chisubmit` client to linux.cs (amd64) on my M1 >MacBook (arm64). 
-> 
-> 
->Export the container: 
->m1$ docker save -o ubuntu-20.04-chisubmit-2.1.0.tar docker.io/techstaff/ubuntu-20.04-chisubmit:2.1.0 
-> 
->Go to an AMD64 machine and import it. Using Podman just to make this harder. 
->amd64-machine $ podman load < ubuntu-20.04-chisubmit-2.1.0.tar 
-> 
->amd64-machine $ podman image ls 
->REPOSITORY                                  TAG     IMAGE ID      CREATED        SIZE 
->localhost/techstaff/ubuntu-20.04-chisubmit  2.1.0   05989787458d  5 minutes ago  628 MB 
-> 
->amd64-machine $ podman run -it 05989787458d /bin/bash 
->root@718a5928bc4a:/# uname -a 
->Linux 718a5928bc4a 5.8.0-36-generic #40+21.04.1-Ubuntu SMP Thu Jan 7 11:35:09 UTC 2021 x86_64 x86_64 >x86_64 GNU/Linux 
-> 
->I tried using the repo name to run the image but it didn’t work. Not sure why at the moment. 
-> 
->Phil 
- 
- 
->> This is going to be really interesting going forward when most scientific 
->> users are no longer going to have the ability to build containers on their 
->> laptops due to architectural issues. Sigh. 
- 
- 
- 
- 
- 
- 
- 
  
public/it/hpc/slurm.1614633532.txt.gz · Last modified: 2021/03/01 15:18 by phil