This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
public:it:hpc:slurm [2021/07/28 18:31] – phil | public:it:hpc:slurm [Unknown date] (current) – removed - external edit (Unknown date) 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Slurm ====== | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | |||
- | * [[https:// | ||
- | |||
- | * [[https:// | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | |||
- | |||
- | * [[https:// | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | |||
- | * [[https:// | ||
- | |||
- | * [[https:// | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | |||
- | * [[https:// | ||
- | |||
- | ===== checkpointing ===== | ||
- | * https:// | ||
- | * https:// | ||
- | |||
- | ===== Web status api ===== | ||
- | * [[https:// | ||
- | * [[https:// | ||
- | |||
- | |||
- | ====== Database Performance ====== | ||
- | * https:// | ||
- | * https:// | ||
- | |||
- | > We noticed sacct (in SLURM 2.6.1) is making unindexed queries[1] on job tables, which take several seconds on an installation with ~2M job_table rows, even after tuning mysqld. | ||
- | > | ||
- | > Adding a composite index across some of the more distinctive columns dropped query time to a few milliseconds: | ||
- | |||
- | < | ||
- | ALTER TABLE ${clustername}_job_table ADD KEY `sacct` (`id_user`, | ||
- | </ | ||
- | |||
- | < | ||
- | SET timestamp=1613686302; | ||
- | select distinct t1.id_wckey, | ||
- | # User@Host: slurm[slurm] @ [172.20.0.3] | ||
- | # Thread_id: 1318 Schema: slurmDB | ||
- | # Query_time: 0.020722 | ||
- | # Rows_affected: | ||
- | # | ||
- | # explain: id | ||
- | r_filtered | ||
- | # explain: 1 SIMPLE | ||
- | </ | ||
- | |||
- | |||
- | |||
- | ====== QOS ====== | ||
- | |||
- | Create QOS | ||
- | < | ||
- | root@fe01: | ||
- | | ||
- | high | ||
- | | ||
- | Description | ||
- | Priority | ||
- | </ | ||
- | < | ||
- | root@fe01: | ||
- | | ||
- | medium | ||
- | | ||
- | Description | ||
- | Priority | ||
- | </ | ||
- | < | ||
- | root@fe01: | ||
- | | ||
- | low | ||
- | | ||
- | Description | ||
- | Priority | ||
- | </ | ||
- | |||
- | Create group: | ||
- | |||
- | < | ||
- | root@fe01: | ||
- | | ||
- | jonaslab | ||
- | | ||
- | Description | ||
- | Organization | ||
- | | ||
- | A = jonaslab | ||
- | Would you like to commit changes? (You have 30 seconds to decide) | ||
- | (N/y): y | ||
- | </ | ||
- | |||
- | Set prio and default prio: | ||
- | < | ||
- | root@fe01: | ||
- | | ||
- | C = aicluster | ||
- | </ | ||
- | < | ||
- | root@fe01: | ||
- | | ||
- | C = aicluster | ||
- | </ | ||
- | |||
- | |||
- | ====Add user account to group.==== | ||
- | Source: https:// | ||
- | |||
- | This will give ' | ||
- | |||
- | < | ||
- | root@fe01: | ||
- | | ||
- | U = kauffman3 A = jonaslab | ||
- | Non Default Settings | ||
- | Would you like to commit changes? (You have 30 seconds to decide) | ||
- | (N/y): y | ||
- | </ | ||
- | |||
- | < | ||
- | root@fe01: | ||
- | | ||
- | ---------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- | ||
- | | ||
- | | ||
- | </ | ||
- | < | ||
- | root@fe01: | ||
- | | ||
- | ---------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- | ||
- | jonaslab | ||
- | jonaslab | ||
- | </ | ||
- | |||
- | normal == default priority | ||
- | < | ||
- | root@fe01: | ||
- | Name | ||
- | ---------- ---------- | ||
- | normal | ||
- | high | ||
- | medium | ||
- | | ||
- | </ | ||
- | |||
- | Check prio on submitted job: | ||
- | < | ||
- | kauffman3@fe01: | ||
- | | ||
- | ------------ ---------- ---------- ---------- ---------- | ||
- | 381 two_gpu_p+ | ||
- | 381.batch | ||
- | 381.extern | ||
- | 381.0 bash | ||
- | </ | ||
- | |||
- | |||
- | ===== Containers ===== | ||
- | https:// | ||
- | https:// | ||
- | |||
- | |||
- | |||
- | ==== Rootless Docker discussion for HPC ==== | ||
- | |||
- | > TACC hasn't solved this problem either: | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | > https:// | ||
- | > | ||
- | > Based on the little I know about singularity it was meant to be run on HPC clusters so I don't think we'll have problem deploying it everywhere. | ||
- | > | ||
- | >Phil | ||
- | |||
- | >On 2/9/21 9:37 AM, : | ||
- | >> Ok this is great, thank you for looking into this so much. | ||
- | >> | ||
- | >> Phil I think your " | ||
- | >> this is also the world we have woken up in. | ||
- | >> | ||
- | >> Podman might actually work, although I'm vaguely worried that they appear | ||
- | >> to use a version of fuse for their non-root userspace filesystem IO, which | ||
- | >> may be a performance nightmare. | ||
- | >> | ||
- | >> Heavily-multiuser systems like TACC (NSF supercomputer) and ALCF (Argonne) | ||
- | >> are increasingly adopting containers for end: | ||
- | >> https:// | ||
- | >> | ||
- | >> I believe the " | ||
- | >> supports running containers. | ||
- | >> | ||
- | >> I'm still trying to figure out where the security contours lie between | ||
- | >> " | ||
- | >> cluster-level support for running containers (but not building them) could | ||
- | >> conceivably be ok. This might be what TACC et al are doing. | ||
- | >> | ||
- | >> I'm willing to table this for a bit, but let's be sure to revisit. I'll ask | ||
- | >> Kyle what the River people are doing. | ||
- | |||
- | |||
- | The conversation then turned to building docker images for different architectures. | ||
- | |||
- | ==== Building amd64 docker image on an ARM M1 MacbookAir ==== | ||
- | |||
- | > | ||
- | > | ||
- | >On my M1 MacBook Air: | ||
- | > | ||
- | >Find the digest entry for amd64 | ||
- | >m1$ docker manifest inspect ubuntu: | ||
- | > | ||
- | >m1$ docker run -it docker.io/ | ||
- | >uname -a | ||
- | > | ||
- | > | ||
- | >Linux afc7a92aafeb 4.19.104-linuxkit #1 SMP PREEMPT Sat Feb 15 00:49:47 UTC 2020 x86_64 x86_64 x86_64 > | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | >m1$ docker buildx build --platform linux/amd64 . | ||
- | > | ||
- | >I’ve built a container that Techstaff uses to deploy the `chisubmit` client to linux.cs (amd64) on my M1 >MacBook (arm64). | ||
- | > | ||
- | > | ||
- | >Export the container: | ||
- | >m1$ docker save -o ubuntu-20.04-chisubmit-2.1.0.tar docker.io/ | ||
- | > | ||
- | >Go to an AMD64 machine and import it. Using Podman just to make this harder. | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | > | ||
- | >Linux 718a5928bc4a 5.8.0-36-generic # | ||
- | > | ||
- | >I tried using the repo name to run the image but it didn’t work. Not sure why at the moment. | ||
- | > | ||
- | >Phil | ||
- | |||
- | |||
- | >> This is going to be really interesting going forward when most scientific | ||
- | >> users are no longer going to have the ability to build containers on their | ||
- | >> laptops due to architectural issues. Sigh. | ||
- | |||
- | |||
- | |||
- | |||
- | |||
- | ====== optimize db ====== | ||
- | < | ||
- | ALTER TABLE `aicluster_assoc_table` ADD INDEX `aicluster_assoc_ta_idx_rgt` (`rgt`); | ||
- | </ | ||
- | < | ||
- | ALTER TABLE `aicluster_assoc_table` ADD INDEX `aicluster_assoc_ta_idx_lft` (`lft`); | ||
- | </ | ||
- | < | ||
- | SHOW INDEX FROM aicluster_assoc_table FROM slurmDB; | ||
- | SELECT DISTINCT TABLE_NAME, INDEX_NAME FROM INFORMATION_SCHEMA.STATISTICS WHERE TABLE_SCHEMA = ' | ||
- | </ |