> the consumer cards don't do compute? Typically, software developers only suppo...

GeekyBear · on Dec 17, 2023

> The issue is, no tech company is interested in adding D3D or Vulkan backend to AI libraries like PyTorch.

AMD is interested in making PyTorch take advantage of their chips, and has done so.

> we are delighted that the PyTorch 2.0 stable release includes support for AMD Instinct™ and Radeon™ GPUs that are supported by the ROCm™ software platform.

When NVIDIA is charging 1000% profit margins for a GPU aimed at use as an AI accelerator, you can expect competitors to be willing to do what it takes to move into that market.

https://www.tomshardware.com/news/nvidia-makes-1000-profit-o...

tasty_freeze · on Dec 18, 2023

I know you are just quoting the tomhsardware article, but sheesh, "1000% profit margin" is nonsensical. If the cost of goods is X and they sell it for 10X, then it is a 90% profit margin, not a 1000% profit margin.

RussianCow · on Dec 18, 2023

Yeah, the word they were looking for is "markup". People make this mistake all the time.

KingOfCoders · on Dec 18, 2023

People can't do percentage.

Once when I was in an RPG shop I visited frequently, some youths came in and said "We want to order 5 figures." After a short pause they added "Can we get them cheaper, because we're buying 5?" The shop owner said "Fine, I'll give you one for free." They said "No, the other owner promised us 10 percent". After a small pause the shop owner smiled and said "Ok, I'll stick with the promise, you get 10 percent."

TeMPOraL · on Dec 18, 2023

Percentages suck. They feel almost designed to confuse. Just about anything is better, really. Doesn't help that people using percentages either use them intentionally to confuse their victims - think shop owners, advertisers, banks, scammers, especially in finance and insurance adjacent areas - or just don't realize how easy it is for everyone, including them themselves, to lost track of the base/referent.

I've come to see the use of percentages as a warning sign.

KingOfCoders · on Dec 18, 2023

Great satire, I think the greatest satire is one that sounds totally real. That which lets the reader think twice. But "I've come to see the use of percentages as a warning sign." gave you away.

TeMPOraL · on Dec 18, 2023

Yeah. Except I was 90% earnest.

jdewerd · on Dec 18, 2023

I was hoping for at least 190% earnest.

Dylan16807 · on Dec 18, 2023

They're both percentages. Using the wrong term isn't a math failure, like in your example.

"markup" versus "margin" is an arbitrary choice, not something people should be expected to intuit. And they're usually close in value, so people are less likely to be informed or reminded of the two terms and which is which.

MrBuddyCasino · on Dec 17, 2023

You would think so, and yet AMD dropped the ball so hard on the software side of things, the resulting hole in the ground might as well be bottomless.

jorvi · on Dec 17, 2023

Is that really just on AMD?

All the companies that are now screaming hellfire because of Nvidia's market maker position are also the companies that gave Nvidia a warchest filled with billions of dollars. How is AMD supposed to compete when the whole market is funding their rival?

HideousKojima · on Dec 18, 2023

The whole market funded Nvidia because AMD failed to provide a viable alternative for 10+ years

doikor · on Dec 18, 2023

AMD was almost bankrupt 10 years ago. Basically AMD had to bet the whole company on their next CPU or GPU and their choice was the CPUs and put all their efforts into that and it payed off (Zen architecture).

It has only been the last 4 or 5 years that AMD has had any real money to put into their GPU/AI accelerator sector and that seems to be developing quite well now (though they seem to be mostly interested in super computer/massive data center deployments for now)

fulafel · on Dec 18, 2023

GPU languages shouldn't be proprietary. At one point this was a shared understanding and lots of companies were behind initiatives like OpenCL. Meanwhile progress in higher level non C++ GPU languages has stayed in niches without big industry backing, and we're stuck with CUDA which is bad in both language design sense and market-wise.

paulmd · on Dec 18, 2023

the problem with opencl was that AMD's runtime was so buggy and incomplete that developers usually had to build a whole separate pathway just for AMD. And if you're doing that, you're not portable anyway, so why bother forcing NVIDIA through a more-mismatched generalized API just for the sake of purported "portability"?

it's turtles (broken AMD technical leadership) all the way down.

this is simply not a field AMD cared about, whether you think they were right or wrong (based on their financials or otherwise). and now that it has turned into a cash fountain, everyone wishes it had gone differently. "should have bought bitcoin" but instead of pollution funbux it's investing into your own product.

fulafel · on Dec 19, 2023

This is a AMD centric view but the problem is much wider, it includes other OS platforms (eg Android & Apple platforms) and other hardware vendors (eg intel, mobile chipsets) as well. After all game APIs were also victim to the same kind of "code to the driver" circumstance.

I agree that correctness and robustness of the vendor impolementations went poorly, but it's fixable if there was commitment between vendors, the infighting leading to Apples departure must not have helped that side either.

In the end I think including the high level compilers and APIs in the proprietary driver stack responsibility is unsustainable. Microsoft seems to have a better model, or there could be even a standard JIT code consumed by drivers that emitted by open source stacks shared across platforms. The latter way would also better support development of nicer GPU languages.

paulmd · on Dec 20, 2023

> In the end I think including the high level compilers and APIs in the proprietary driver stack responsibility is unsustainable. Microsoft seems to have a better model, or there could be even a standard JIT code consumed by drivers that emitted by open source stacks shared across platforms. The latter way would also better support development of nicer GPU languages.

OK, let's talk MS.

What NVIDIA has done with PTX is basically the same thing as MSIL/CIL. There is a meta-ISA/meta-language that has been commonly agreed, and if you invoke no tokens that the receiving driver doesn't understand, the legacy compiler understands and emits executable (perhaps not optimal) code.

The legacy hardware stays on a specific driver and a specific CUDA support level. The CUDA support level is everything, that is your entire "feature profile". It's not GFX1030/GFX1031, it's RDNA3.0/3.1/whatever. NVIDIA has been able to maintain their feature support (not implementation details) in a monotonically increasing sequence.

Additionally, they also maintain a guarantee that spec 3.1 guarantees that you can also compile and execute 3.0 code. "Upwards family correctness" I guess. Again, doesn't have to be optimal but the contract is there's no tokens that you don't understand. A Tegra 7.0 compiler can compile desktop 7.0 CUDA (byte/)code etc.

https://en.wikipedia.org/wiki/CUDA#Version_features_and_spec...

Legacy driver support doesn't change, so, you can always read PTX, you can always read anything that's compiled for your CUDA Capability Level even if it's from a future toolkit that wasn't released at the time. PTX is always the ultimate "emit code from 2020 and run on the 2007 gpu that only understands CUDA 1.3" relief valve though. If you don't write code that does stuff that's illegal in 1.3... it'll compile and emit PTX, and the Tesla driver will parse and run it.

Anyway, let's talk MS. MS could come up with MSIL/CIL. NVIDIA can come up with PTX. Why can't AMD do this? What is unique about GCN/RDNA as a language that you can't come up with an intermediate language representation?

fulafel · on Dec 24, 2023

I agree that AMD could have done a proprietary intermediate instruction format and with good execution it would have made their users lives easier.

But the actual big opportunity which could make the whole GPU programming field far more dev friendly and could make the gpu vendor competition work in favour of users would be doing the GPU-IL in a cross-vendor way, driven by someone else than a single GPU vendor who is playing in the proprietary lock-in game (or certain to do it after gaining sufficient market position).

7speter · on Dec 18, 2023

Have a solution (Rocm) that works on all of their modern cards (Rdna 1, 2 and 3 and CDNA)?

paulmd · on Dec 18, 2023

even better, APUs. how about those vega APUs in everyone's AMD laptops and desktops?

oh, right, those are EOL'd, security updates only/separate driver package with a quarterly release cycle, even though they're still on store shelves...

paulmd · on Dec 20, 2023

just to emphasize: it is crazy to me that at this moment of frenzy over AI/ML, that AMD is not making better use of their iGPUs. The Vega iGPU is really powerful compared to its contemporaries, it's not WMMA or DP4a capable (iirc) but it still can brute force a lot more than you'd think, it certainly is capable of doing at least a little bit of inference.

Remember that the "AI core" in the new meteor lake and hawk point stuff is not really all that gangbusters either... it's a low-power sidekick, for the same types of stuff that smartphones have been doing with their AI cores for ages. Enhancing the cameraphone (these cameras would be complete shit without computational/AI enhancement). Recognizing gestures, recognizing keywords for voice assistant activation.

AMD's pitch for the AI core is enhancing game AI. Windows 11/12 assistant. That type of stuff.

Vega can absolutely just brute-force its way through that stuff, and it gets more people onto the platform and developing for AMD. It is crazy that if nothing else they aren't at least making sure the APUs can run shit.

And again, it's pretty damn unethical imo to be pulling support from products that are still actively marketed and sold. That's a cheezy move. AMD dropped support for Terascale before NVIDIA dropped support for fermi, and NVIDIA went back and added Vulkan support to all their older stuff during the pandemic too. Then they dropped the 28nm GCN families, while NVIDIA is still supporting maxwell. And then they dropped Polaris and Vega, and NVIDIA is still supporting Maxwell (albeit I expect them to drop it very soon imo).

The open driver is great under linux because it bypasses AMD's craptacular support. But AMD doesn't support consumer GPUs in ROCm under Linux, and de-facto they don't seem to support ROCm on the open driver in the first place anyway, you have to use AMDGPU-PRO (according to geohot's investigations).

This is such a crazy miss. Yes, it's not a powerhouse, but in the era of Win12 moving to AI everything, and games moving to AI-driven computer opponents, etc - Vega can do that ok, and it at least would give people something to open the door and get them developing.

If there's a contender for breakout against CUDA, sadly it really seems to be Apple. They've got APUs with PS5-level bandwidth and unified memory, and that's just M1 Max, and they have Metal, and it's supported everywhere across their ecosystem. It lets people dip their toes into AI/ML if nothing else, and lets people dip their toes into metal development. That's the kind of environment that NVIDIA spent a decade fostering, and it's also not a coincidence that the second stop for all these data scientists playing with models is not ROCm but their apple laptops. llama.cpp and so on. Everyone likes the hardware they have in their gaming PC or in their laptop, and it's an absolute miss for AMD to not make themselves available in that fashion when they already have the market penetration. Crazy.

https://twitter.com/Locuza_/status/1450271726827413508/photo...

hilios · on Dec 20, 2023

>The open driver is great under linux because it bypasses AMD's craptacular support. But AMD doesn't support consumer GPUs in ROCm under Linux, and de-facto they don't seem to support ROCm on the open driver in the first place anyway, you have to use AMDGPU-PRO (according to geohot's investigations).

While they don't officially support any consumer GPU aside from 7900XT(X) and VII, I haven't encountered any issues using it on a 6700XT with the open source drivers, pretty much the only tinkering required was to export HSA_OVERRIDE_GFX_VERSION=10.3.0. It was quite a pleasant surprise after never getting my RX480 to work even though it was officially supported back in the day.

lostmsu · on Dec 19, 2023

Microsoft has DirectML and a PyTorch backend for it. Although it is stale and significantly slower on 3090 than CUDA (5x+ as of 2 years ago).

zozbot234 · on Dec 17, 2023

Vulkan Compute backends for numerical compute (as typified by both OpenCL and SYCL) are challenging, you can look at clspv https://github.com/google/clspv project for the nitty gritty details. The lowest-effort path so far is most likely via some combination of Rocm/HIP (for hardware that AMD bothers to support themselves) and the Mesa project's RustiCL backend (for everything else).

Const-me · on Dec 17, 2023

> Vulkan Compute backends for numerical compute (as typified by both OpenCL and SYCL) are challenging

Microsoft has an offline dxc.exe compiler which compiles HLSL to Spir-V. Also, DXVK has a JIT compiler which recompiles DXBC byte codes to Spir-V. Both technologies are old, stable and reliable, for example the DXVK’s JIT compiler is a critical software component of the SteamDeck console.

> The lowest-effort path so far is most likely

I agree that’s most likely to happen, but the outcome is horrible from consumer PoV.

Mesa is Linux-only, Rust is too hard to use for vast majority of developers (myself included), AMD will never support older cards with ROCm, and we now have the third discrete GPU vendor, Intel.

my123 · on Dec 17, 2023

SPIR-V for OpenCL and for Vulkan are substantially different, with the translation between the two being quite non-trivial.

(note that rusticl + zink does deal with it _partially_ to some extent nowadays)

+ Vulkan memory management doesn't expose unified address space primitives

Const-me · on Dec 17, 2023

Why would you want OpenCL? Pretty sure D3D11 compute shaders gonna be adequate for a Torch backend, and they even work on Linux with Wine: https://github.com/Const-me/Whisper/issues/42 Native Vulkan compute shaders would be even better.

Why would you want unified address space? At least in my experience, it’s often too slow to be useful. DMA transfers (CopyResource in D3D11, copy command queue in D3D12, transfer queue in VK) are implemented by dedicated hardware inside GPUs, and are way more efficient.

hjabird · on Dec 19, 2023

> Why would you want OpenCL?

OpenCL is stricter with the results of floating point operations, and makes different assumptions with respect to memory aliasing. Whether or not this is important the AI domain I don't know.

> Why would you want a unified address space?

A unified address space doesn't always imply that the memory can be accessed from anywhere (although that might also be supported with some memory allocation mechanisms), and you still may have to copy between host and device memory. But it makes it much easier to have pointers in your GPU kernels, instead of having to deal with objects like OpenCL buffers.

viraptor · on Dec 18, 2023

> Why would you want unified address space?

Mac APU I guess. Or Jetson/Tegra kind of things.

Const-me · on Dec 18, 2023

My laptop has a single GPU inside Ryzen 5 5600U i.e. unified memory, all consoles also have unified memory. These devices are fine with traditional GPU programming model, where shaders only have access to well-shaped pieces of memory accessible through resource views or UAVs.

CPU-style memory access in GPU kernels technically possible (CUDA did it) but unfortunately rather hard. The feature requires hardware support inside GPUs (need pointers, they need to be 64-bits, need 64-bit integer arithmetic instructions), and therefore not going to work on many current GPUs. It becomes harder to compile and optimize GPU-running code. On devices without physically unified memory, the performance of these kernels gonna suck.

Luckily, none of that is needed to implement D3D11 or Vulkan backend for PyTorch. PyTorch is not a general purpose GPGPU runtime, it’s merely a high-level library which manipulates tensors and implements a few BLAS routines operating on them. It’s easy to allocate a GPU resource for each tensor being manipulated.

zozbot234 · on Dec 18, 2023

Vulkan backend for PyTorch exists. It's mostly tested on Android, but it's there. PyTorch maintainers though are reluctant to advertise that as "support" because complete, reliable support for the zillion 'operators' PyTorch includes is quite a different challenge.

jorvi · on Dec 17, 2023

> Intel and AMD are not doing that because both hope to replace CUDA with their proprietary equivalents, instead of an open GPU API

Like AMD did with FreeSync (now VRR) and FSR3 (whatever open standard it will morph into)?

I'll never understand the hate AMD gets from the open source community. Mind-boggling.

trelane · on Dec 18, 2023

> I'll never understand the hate AMD gets from the open source community. Mind-boggling.

A large part of that hate is that, for a very long time, they had the worst of both worlds: a fragile proprietary driver and subpar performance.

Only recently have they fixed both. Indeed I only found out about their newfound openness a few years ago. Not everyone has gotten the memo.

7speter · on Dec 18, 2023

Isn’t Intel’s solution oneAPI, which is supposed to be open and have a sort of translation layer for CUDA?

electroglyph · on Dec 18, 2023

> The issue is, no tech company is interested in adding D3D or Vulkan backend

i'm using Apache TVM right now to run LLMs via Vulkan on Intel Arc cards.

Intel has their own pytorch wheels here for their GPUs: https://developer.intel.com/ipex-whl-stable-xpu

spookie · on Dec 18, 2023

> They aim to replace CUDA with their proprietary equivalents

Neither oneAPI, nor ROCm are proprietary?

einpoklum · on Dec 18, 2023

They're single-vendor-controlled (and vendor-specific implementation-wise). The question of whether the spec is proprietary is less significant.

hjabird · on Dec 19, 2023

SYCL (closely related to oneAPI) isn't single-vendor-controlled. It's a Khronos open standard. If you take a look at the spec, you'll see contributions from various universities, Qualcomm, Huawei, Argonne, Altera, and AMD (Xilinx I think). Intel just adopted it (and bought Codeplay, the original contributor).

oneAPI is a set of SYCL related standards. Originally that was Intel, but now that's owned by the UXL foundation, which is part of the Linux foundation.

boppo1 · on Dec 17, 2023

>Vulkan backend to AI libraries like PyTorch

How many man hours is a project like this?

Const-me · on Dec 17, 2023

Couple times in the past I wanted to port open source ML models from CUDA/Python to a better technology stack. I have ported Whisper https://github.com/Const-me/Whisper/ and Mistral https://github.com/Const-me/Cgml/ to D3D11. I don’t remember how much time I spent, but given both were unpaid part-time hobby projects, probably under 160 hours / each.

These software projects were great to validate the technology choices, but note I only did bare minimum to implement specific ML models. Implementing a complete PyTorch backend gonna involve dramatically more work. I can’t even estimate how much more because I’m not an expert in Python or these Python-based ML libraries.

versteegen · on Dec 17, 2023

Wow, a very nice reimplementation.

To go on a tangent, I note your custom 'BCML1' 5bit per weight compression codec and your optimised hand-coded AVX2 to encode it... was that really needed? Are the weights encoded on every startup? Why not do it once and save to disk?

Const-me · on Dec 17, 2023

> Are the weights encoded on every startup?

Not really, that code only runs while importing the PyTorch format. See readme for the frontend app: https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... When loading the model from *.cgml, the model file already contains compressed tensors. That’s how that file is only 4.55 GB, versus 13.4 GB in the original model.

> was that really needed?

For desktops with many CPU cores, a simpler scalar version would probably work equally well. Still, low-end computers don’t always have many cores to use by these background encoding tasks. Also, CPU usage on laptops translates to battery drain.

spookie · on Dec 18, 2023

This is great!

einpoklum · on Dec 17, 2023

> Typically, software developers only support a single GPGPU API, and that API is nVidia CUDA.

There is very little NVIDIA and CUDA cards off of X86_64 and maybe OpenPower architectures. So I disagree. Also, OpenCL, despite being kind of a "betrayed standard", enjoys quite a lot of popularity even on x86_64 (sometimes even with NVIDIA hardware) - even if it is not as popular there.

> AMD cards often outperform nVidia equivalents

Can you link to benchmarks or other analysis supporting this claim? This has not been my impression in recent years, though I don't routinely look at high-end AMD hardware.

> because nVidia prioritized ray tracing and DLSS over compute power and memory bandwidth.

Has it really?

Const-me · on Dec 17, 2023

> Can you link to benchmarks or other analysis supporting this claim?

Current generation nVidia: https://en.wikipedia.org/wiki/GeForce_40_series#Desktop

Current generation AMD: https://en.wikipedia.org/wiki/Template:AMD_Radeon_RX_7000

The key performance characteristics are processing power TFlops, and memory bandwidth GB/s.

nVidia 4080 which costs $1200: 43 TFlops FP16, 43 TFlops FP32, 0.672 TFlops FP64, 717 GB/s memory.

AMD 7900 XTX which costs $1000: 93 TFlops FP16, 47 TFlops FP32, 1.5 TFlops FP64, 960 GB/s memory.

Note that for applications which bottleneck on FP16 compute (many ML workloads) or FP64 compute (many traditional HPC workloads: numerical solvers, fluid dynamics, etc), the 7900 XTX even outperforms the 4090 which costs $1600.

ColonelPhantom · on Dec 17, 2023

In ML workloads, usually the FP16 operations are matrix operations. On RDNA3, these execute at the same rate as normal shader/vector operations, but on Nvidia RTX cards there are Tensor cores which accelerate them. The Ada whitepaper lists 48.7 shader TFlops (not 43 because boost vs base clock), and 195 TFlops for FP16 Tensor with FP16 Accumulate. That's 4 times faster than regular, and almost double what the XTX lists!

Ampere and newer also have native sparsity support which means that you can skip over the zeroes 'for free', which Nvidia uses to market double the TFlops, which is kind of misleading imo. But the 195 TFlops are even before sparsity is included!

I'm not sure if the 93 TFlops (120 with boost clocks) on AMD are with FP16 or FP32 accumulation, as with FP32 accumulation the 4080 slows down significantly and gets much closer with 97.5 TFlops.

Intel Xe-HPG (used in the Arc A cards) also offers very aggressive matrix acceleration via XMX, with 137.6 FP16 TFlops at base clock, vs. 17.2 FP32 TFlops.

Const-me · on Dec 18, 2023

You’re comparing general-purpose computations to a proprietary feature with limited applicability. For example, in my Mistral implementation the most expensive compute shader is handling matrices compressed with a custom codec unknown to tensor cores.

This is not an apples-to-apples comparison. We use FLOPS numbers to compare processors of different architectures because they’re the same FLOPS everywhere.

ColonelPhantom · on Dec 18, 2023

>This is not an apples-to-apples comparison.

Neither is FLOPS. RDNA3 FLOPS numbers are greatly inflated, because they are only applicable to fairly specific VOPD and WMMA instructions, the former being for two MADs at the same time, and the latter being applicable in the same cases as tensor cores.

Besides, it should be possible to use the tensor cores combined with the codec: you can use vector hardware to decode matrix values to fp16/fp32, then yeet those into the tensor cores. Although most of the time will probably be spent on the decoding part, assuming you're doing matrix-vector multiplication and not matrix-matrix (which might be different with MoE models like Mixtral 8x7B?)

Const-me · on Dec 19, 2023

RDNA3 FLOPS numbers are real. Shader compiler happily uses VOPD for quite a few things including FP32 FMA.

An independent researcher measured 62.9 FP32 TFlops (boost) in Vulkan on 7900XTX, see “Full GPU Throughput – Vulkan” section on that article: https://chipsandcheese.com/2023/01/07/microbenchmarking-amds...

einpoklum · on Dec 18, 2023

That is interesting, even though I would need to look at non-consumer-GFX cards. What about actual benchmarks, though? Passmark, for example:

https://www.videocardbenchmark.net/directCompute.html

has:

GeForce RTX 4090 28,916

RTX 6000 Ada 23,296

GeForce RTX 4080 22,016

Radeon RX 7900 XTX 18,767

and Geekbench 6:

https://browser.geekbench.com/opencl-benchmarks

has:

GeForce RTX 4090 321928

L40 292357 <- L40S is out and may be better

RTX 6000 Ada 274348

H100 267514

NVIDIA GeForce RTX 239863

Radeon RX 7900 XTX 199412

NavinF · on Dec 18, 2023

Neither of those benchmarks is relevant to ML which is what GP is talking about

izacus · on Dec 18, 2023

So what is relevant? As other poster said, nVidia tends to outpace AMD cards in pretty much every use case in real life.

Const-me · on Dec 21, 2023

The only important numbers are processing power TFlops, and memory bandwidth GB/s.

If your compute shader doesn’t approach the theoretical limit of either computations or memory, it doesn’t mean there’s anything wrong with the GPU. Here’s incomplete list of possible reasons.

● Insufficient parallelism of the problem. Some problems are inherently sequential.

● Poor HLSL programming skills. For example, a compute shaders with 32 threads/group wastes 50% of compute units of most AMD GPUs, the correct number for AMD is 64 threads/group, or a multiple of 64. BTW, nVidia and Intel are fine with 64 threads/group, they run 1 thread group as 2 wavefronts which does not waste any resources.

● The problem being too small to compensate for the overhead. For example, CPUs multiply two 4x4 matrices in a small fraction of a time it takes to dispatch a compute shader for that. You gonna need much larger matrices for GPGPU to win.

pjmlp · on Dec 18, 2023

OpenCL is stuck in C99, that is why most people ignored it and adopted CUDA in first place.

The attempts to bring C++ to it never were properly done, and that is why OpenCL 3.0 is basically OpenCL 1.0 rebranded.

imtringued · on Dec 18, 2023

OpenCL 2.0 was ignored because implementing shared virtual memory was mandatory, but it simply isn't possible to implement without cache coherence all the way into the GPU memory. Now that Intel has developed their own GPUs, they have gone out of their way to completely ignore SVM and went with their own proprietary USM extension of OpenCL and the Intel driver developers themselves said SVM is difficult to get right. Intel and AMD were the only ones implementing OpenCL 2.0 and the SVM implementation by AMD was basically unusable.

OpenCL 3.0 got rid of the mandatory features of OpenCL 2.0 that nobody was willing to implement.

pjmlp · on Dec 18, 2023

Basically a wall of text going around that Intel and AMD never delivered anything in OpenCL that would make developers actually care.

They only needed to make something half as usable as CUDA, in polyglot support and graphical development tooling for GPGPU.

I was once in a Khronos panel webminar in HPC, where no one in the panel understood the relevance of Fortran support in OpenCL, then no wonder Fortran workloads get deployed with CUDA.

einpoklum · on Dec 18, 2023

OpenCL has had a C++ variant for a while. But it is like you said: The consortium members, especially NVIDIA, did not do thing properly, either half-heartedly or in an intentionally broken manner.

pjmlp · on Dec 18, 2023

Always blamming NVidia, when Intel and AMD never did anything to properly support it.

ksec · on Dec 18, 2023

>Always blamming NVidia

As LLM, AI takes over the heated topic of discussions on HN, it is hard not to notice the tone on blaming Nvidia for everything. Both Hardware and Software.

fulafel · on Dec 19, 2023

I think we should be blaming the OS platform actors and user communities for letting the lunatics run the asylum (hw vendors dictating sw). And then MS & Apple both went ant did their own proprietary stuff.

paulmd · on Dec 18, 2023

AMD's products have generally failed to catch traction because their implementations are halfassed and buggy and incomplete (despite promising more features, these are often paper features or resume-driven development from now-departed developers). all of the same "developer B" stuff from openGL really applies to openCL as well.

http://richg42.blogspot.com/2014/05/the-truth-on-opengl-driv...

that's the reason blender kicked their openCL implemention to the curb after years of futzing with it. it wasn't portable because AMD's openCL was so broken they had to have their own special codepath anyway, and it didn't perform well and wasn't worth the effort.

AMD has left a trail of abandoned code and disappointed developers in their wake. These two repos are the same thing for AMD's ecosystem and NVIDIA's ecosystem, how do you think the support story compares?

https://github.com/HSA-Libraries/Bolt

https://github.com/NVIDIA/thrust

in the last few years they have (once again) dumped everything and started over, ROCm supported essentially no consumer cards and rotated support rapidly even in the CDNA world. It offers no binary compatibility support story, it has to be compiled for specific chips within a generation, not even just "RDNA3" but "Navi 31 specifically". Etc etc. And nobody with consumer cards could access it until like, six months ago, and that still is only on windows, consumer cards are not even supported on linux (!).

That's fine for non-commercial and HPC, but it makes it difficult to imagine ever shipping commercial products to end-users on it. I guess you... ship as source, and your users have to figure out ROCm and compile it in-situ? Or ship a binary for every single GPU to ever release (including NVIDIA, Intel, ....)

https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

This is on top of the actual problems that still remain, as geohot found out. Installing ROCm is a several-hour process that will involve debugging the platform just to get it to install, and then you will probably find that the actual code demos segfault when you run them.

AMD's development processes are not really open, and actual development is silo'd inside the company with quarterly code dumps outside. The current code is not guaranteed to run on the actual driver itself, they do not test it even in the supported configurations, because all the development is happening on internal builds etc, not release code.

oh and AMD themselves isn't using the open driver of course... they only test and validate on AMDGPU-PRO.

it hasn't got traction because it's a low-quality product and nobody can even access it and run it anyway, let alone ship commercial products on it. and now everyone regrets it because it's turned into a money fountain.

pjmlp · on Dec 18, 2023

DirectML is part of DirectX 12.