- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20210105
        Geoffrey Paulsen edited this page Jan 19, 2021 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Aurelien Bouteiller (UTK)
 - Austen Lauria (IBM)
 - David Bernhold (ORNL)
 - Edgar Gabriel (UH)
 - Geoffrey Paulsen (IBM)
 - Harumi Kuno (HPE)
 - Hessam Mirsadeghi (UCX/nVidia)
 - Howard Pritchard (LANL)
 - Jeff Squyres (Cisco)
 - Joseph Schuchart
 - Josh Hursey (IBM)
 - Matthew Dosanjh (Sandia)
 - Naughton III, Thomas (ORNL)
 - Ralph Castain (Intel)
 - Todd Kordenbrock (Sandia)
 - William Zhang (AWS)
 
- Akshay Venkatesh (NVIDIA)
 - Artem Polyakov (nVidia/Mellanox)
 - Barrett, Brian (AWS)
 - Brandon Yates (Intel)
 - Brendan Cunningham (Cornelis Networks)
 - Charles Shereda (LLNL)
 - Christoph Niethammer (HLRS)
 - Erik Zeiske
 - Geoffroy Vallee (ARM)
 - George Bosilca (UTK)
 - Joshua Ladd (nVidia/Mellanox)
 - Mark Allen (IBM)
 - Matias Cabral (Intel)
 - Michael Heinz (Cornelis Networks)
 - Nathan Hjelm (Google)
 - Noah Evans (Sandia)
 - Raghu Raja (AWS)
 - Scott Breyer (Sandia?)
 - Shintaro iwasaki
 - Tomislav Janjusic
 - Xin Zhao (nVidia/Mellanox)
 - mohan (AWS)
 
- link has changed for 2021.  Please see email from Jeff Squyres to 
devel-core@lists.open-mpi.orgon 12/15/2020 for the new link 
- v4.0.6rc1 - built, please test.
 - Issue 8321 -
 - Issue 8335 may affect 4.0.x also (running with external PMIX v4.0
 - Issue 8304 also affects.
 
- Released 4.1.0 in December.
 - Downstream packagers hit an issue with AVX.
- Really an issue of compiler support, and we were doing a subtlely wrong thing in configure/make
 - Fixed.
 
 - Issue 8334 - a performance regression with AVX.  Still digging into.
- Blocker for v4.1.1. Might need to do a quick v4.1.1 turnaround.
 - Probably GCC not generating correct avx v5 instructions.
 
 - Issue 8335 - Trying to run with external PMIx.
- Josh Hursey can look at today.
 - simple configury fix today.
 
 - Michael Heinz is looking at PSM2(?) new issue from yesterday. Possibly for v4.1.1
 - Josh Hursey is working on Issue 8304 (verified in v4.1, v4.0, and v3.1)
 
- Does the community want this ULFM PR 7740 for OMPI v5.0?  If so, we need a PRRTE v3.0
- Aurelien will rebase.
 - Works with PRRTE refered to ompi master submodule pointer.
 - Currently used in a bunch of places.
 - Run normal regression tests. Should not see any performance regressions.
 - When this works, can provide other tests.
 - Is a configure flag.  Default is to configure in, but disabled at runtime.
- A number of things to set to enable.
 - Aurelien is working to get a single parameter
 
 - Lets get some CODE reviews done.
- Look at intersections of the core, and ensure that the NOT-ULFM paths are "clean".
 
 - Also we have a downstream affect PMIX and PRRTE to get a
 - Lets put a deadline on reviews.  Lets say in 4 weeks, we'll push the merge button.
- Jan 26th we'll merge if no issues
 
 
 
- Modified ABI - removed one callback/member function from some components (BTLs/PMLs) used for FT event.
- All these structures for these components.
 - Pending for this discussion.
 - Going to version the frameworks that are affected.
 - Not this simple in practice, because usually we just return a pointer to a static object.
- But this isn't possible anymore.
 - We don't support multiple versions
 
 
 - Do we think we should allow Open-MPI v5.0 to run with mcas from past versions?
- Maybe good to protect against it?
 - Unless we know of someone we need to support like this, we shouldn't bend over for this.
 - Josh thinks the Container community is experimenting with this.
 
 - Josh has advised that Open-MPI doesn't guarantee
 - v5.0 is advertised as an ABI break.
 - In this case, the framework doesn't exist anymore.
 
- 
Still need to coirdinate on this. He'd like this, this week.
 - 
PMIx v4.0 working on Tools, hopefully done soon.
- PMIx go through python bindings.
 - a new Shmem component to replace
 - Still working on.
 
 - 
Dave Wooten pushed up some PRRTE patches, and making some progress there.
- Slow but steady progress.
 - Once tool work is more stabilized on PMIx v4.0, will add some tool tests to CI.
 - Probably won't start until first of the year.
 
 - 
How is the submodule reference updatees on Open-MPI master
- Probably be switching OMPI master to master PMIx in next few weeks.
- PR 8319 - this failed. Should this be closed and create a new one?
 
 - Josh was still looking to see about adding some cross checking CI
 - When making a PRTE PR, could add some comment to the PR and it'll trigger Open-MPI CI with that PR.
 
 - Probably be switching OMPI master to master PMIx in next few weeks.
 
- New web-ex for January
 
- Slurm is now always using a Cgroup, and always setting default number of cores in cgroup to 1.
- So when using mpirun with orted/prrted in slurm, orted/prrted can't
 - Ralph working on PR from user comment (PR 8288)
 - Issue, and possibly in README (will catch a lot of people)
 
 - Ralph and Jeff iterated with SLURM and downstream for FAQ entry
- PRed Added env by default in v4.0.6, v4.1.0, and master.
 - OMPI v3.x - can just export this env var.
 
 
- Too latest ROMIO from and it failed on both
 - But then he took LAST week's 3.4 BETA ROMIO and it passed.  But it's a little too new.
- He gave a bit more info about the stuff he integrates, and stuff he moves forward.
- 
- ROMIO modernization (don't use MPI1 based things)
 
 - 
- ROMIO integration items.
 
 
 - 
 - We're hesitant to put this into 4.1.0 because it's NOT yet release from MPICH
 - hesitant to even update ROMIO in v4.0.6 since it's a big change.
 - If we delay and pickup newer ROMIO in the next minor, would there be backwards compatibility issues?
- Need to ask about compatibility between ROMIO 3.2.2 and 3.4
- If fully compatibile, then only one ROMIO
 
 
 - Need to ask about compatibility between ROMIO 3.2.2 and 3.4
 - We could ship multiple ROMIOs, but that has a lot of problems.
 
 - He gave a bit more info about the stuff he integrates, and stuff he moves forward.
 
- Just got resources to test, and root caused the issue in OMPIO
 - So, given some more time Edgar will get a fix, and OMPIO can be default
 
- What do we want to do about ROMIO in general.
- OMPIO is the default everywhere.
 - Giles is saying the changes we made are integration changes.
- There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
 - We may be able to work with upstream to make a clear API between the two.
 
 - As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
 
 - Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
 - Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
 
- PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
 - Has a built from this PR, so we can see what it looks like.
 - Have a look.  It's a different approach to have one document that's the whole thing.
- FAQ, README, HACKING.
 
 
 - Do people even use manpages anymore? Do we need/want them in our tarballs?
 
- https://github.com/openpmix/prrte/pull/711
 - please review and give opinon.
 - Will commit next week if no opinion
 
How's the state of https://github.com/open-mpi/ompi-tests-public/
- 
Putting new tests there
 - 
Very little there so far, but working on adding some more.
 - 
Should have some new Sessions tests
 - 
What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?
- What's the general state? Any known issues?
 - AWS would like to get.
 - Josh Ladd - Will take internally to see what they have to say.
 - From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
 - Hessam Mirsadeg - All Cuda awareness through UCX
 - May ask George Bosilica about this.
 - Don't want to remove a BTL if someone is interested in it.
 - UCX also supports TCP via CUDA
 - PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
 
 - 
Update 11/17/2020
- UTK is interested in this BTL, and maybe others.
 - Still gap in the MTL use-case.
 - nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
 - What's the state of the shared memory in the BTL?
- This is the really old generation Shared Memory. Older than Vader.
 
 - Was told after a certain point, no more development in SM Cuda.
 - One option might be to
 - Another option might be to bring that SM in SMCuda to Vader(now SM)
 
 - 
Restructure Tech Doc (more features than Markdown, including crossrefrences)
- Jeff had a first stab at this, but take a look. Sent it out to devel-list.
 - All work for master / v5.0
- Might just be useful to do README for v4.1.? (don't block v4.1.0 for this)
 
 - Sphynx is tool to generate docs from restructured doc.
- can handle current markdown manpages together with new docs.
 
 - readthedocs.io encourages "restructured text" format over markdown.
- They also support a hybrid for projects that have both.
 
 - Thomas Naughton has done the restructured text, and it allows
 - LICENSE question - what license would the docs be available under? Open-MPI BSD license, or
 
 - 
Ralph tried the Instant on at scale:
- 10,000 nodes x 32PPN
 - Ralph verified Open-MPI could do all of that in < 5 seconds, Instant-On.
 - Through MPI_Init() (if using Instant-On)
 - TCP and Slingshot (OFI provider private now)
 - PRRTE with PMIx v4.0 support
 - SLURM has some of the integration, but hasn't taken this patch yet.
 
 - 
Discussion on:
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
 - One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
 - Talking about amending to request MCAs to know if it should be slurped in.
- (if the component hard links or dlopens their libraries)
 
 - Roadrunner experiments... The Bottleneck in launching was I/O in loading all the .sos
- spindle, and burst buffer reduce this, but still
 
 - Still going through function pointers, no additional inlining.
- can do this today.
 
 - Still different than STATIC (sharing this image across process), just not calling dlopen that many times.
 - New proposal is to have a 3rd option where component decides it's default is to be slurped into libmpi
- It's nice to have fabric provider's not bring their dependencies into libmpi so that the main libmpi can be run on nodes that may not have the provider's dependencies installed.
 
 - Low priority thing anyway, if we get it in for v5.0 it'd be nice, but not critical.
 
 
- George and Jeff are leading
 - No new updates this week (see last week)