HEIC and AVIF are both based on HEIF, which is based on ISOBMFF (just like for example MP4).
The JPEG XL container (which is optional and only needed if you want to attach metadata to an image) is also based (more directly and with less header overhead) on ISOBMFF.
The ISO media file format is, like GP said, basically QuickTime.
This can be a problem if you're the kind of completionist who needs to implement everything they see and make one C++ class per QuickTime atom - a problem I saw with a lot of mp4 codebases.
But there's no need to do this because almost all the things in the spec don't matter. Just don't read any of them and handle the rest procedurally and it'll be fine. It looks like JPEG XL also has too many features (like this animation and patching thing) so maybe just ignore that too.
Ignoring stuff is fine if you make an encoder. When you're making a decoder, it's a big no-go. If implementers of the original JPEG would have implemented the whole spec, we would have 12-bit and lossless JPEGs. Instead we're stuck with the de facto JPEG standard, the subset of the spec that everyone ended up implementing.
libjpeg supports that with a compile time option. Nobody used it because you can't abstract over bit-depth (and sometimes pixel format) without losing all performance, because it means a lot of if statements have to go inside every hot loop, so instead you have to duplicate all related code. Also, there wasn't a way to view them until HDR displays came around recently.
The main issue with implementing all of MPEG-4 is the spec is overdetermined (the same fields exist at different layers and can disagree), but also it's full of nonsense nobody cares about, like the alternate codec for animating faces only.
The JPEG XL container (which is optional and only needed if you want to attach metadata to an image) is also based (more directly and with less header overhead) on ISOBMFF.