You'd think AMD would swing in on something like this and fund it with the money needed to succeed. I have no knowledge of it but my guess is no, AMD never misses an opportunity to miss an opportunity - when it comes to GPUs and AI.
AMD pays the bare minimum in software to get a product out the door. The company does not even have working performance testing and regressions routinely get shipped to customers. Benchmarks the executives see are ad hoc and not meaningful.
HipKittens is an improvement but AMD does not have the ability to understand or track kernel performance so it'll be ignored.
This isn't fixable overnight. Company-wide DevOps and infrastructure is outsourced to TCS in India who have no idea what they're doing. Teams with good leadership maintain their own shadow IT teams. ROCm didn't have such a team until hyperscalers lost their shit over our visibly poor development practices.
Even if AMD did extend an offer to hire all the people in the article, it would be below-market since the company benchmarks against Qualcomm, Broadcom, and Walmart, instead of Google, Nvidia, or Meta.
We haven't had a fully funded bonus in the past 4+ years.
> considering how well it appears AMD is executing from the outside.
The party line is that the stock price is up because the market expects us to perform well in the future, and we won't get a bonus until we actually perform well.
This doesn't sound right. I definitely got yelled at over trivial performance regressions which looked like noise so people were measuring performance.
They've paid serious amounts in RSUs over the last six years. Not top of market by any stretch but firmly in the category of engineers don't care what the steak costs. Bonus might be team dependent, I remember being annoyed and nicely surprised by it in different years.
The aql profiler confuses me quite a lot but it's definitely a tool for measuring performance.
I don't think anon is correct, but I can understand how they'd come to their conclusions. I certainly didn't choose AMD to maximize my pay, though it's always been a comfortable salary.
With regards to performance, there are some things tracked carefully and other things that are not tracked at all. I suspect that is why some folks think we're really good at it and others think we're terrible. There's lots of room for improvement, though. Excitement over trivial performance regressions is more a sign of immaturity than of good tracking.
> I definitely got yelled at over trivial performance regressions which looked like noise so people were measuring performance.
It depends on team, we have some testing, and progress is being made. But it's not "working" or comprehensive as we get complaints from our big customers. We should be replicating their setup internally and not have them catch problems.
> Not top of market by any stretch but firmly in the category of engineers don't care what the steak costs.
We need to pay top of market to steal people from our competitors. We can't pay less than Nvidia and outcompete them. Paying less is a signal we're aiming for second and to copy the market leader.
The MBAs are in charge, and now AMD is the new Intel?
It's not only not fixable overnight, but it's not fixable at all if the leadership thinks they can coast on simply being not as bad as Intel, and Intel has a helluva lot of inertia and ability to simply sell OEM units on autopilot.
Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
But the real issue is we don't want to invest in beating Nvidia on quality. Otherwise we wouldn't be doing stock buybacks and instead use the money on poaching engineers.
The mindset is that we maintain a comfortable second place by creating a shittier but cheaper product. That is how AMD has operated since 1959 as a second source to Fairchild Semiconductor and Intel. It's going to remain the strategy of the company indefinitely with Nvidia. Attempting to become better would cost too much.
> Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
Knocking out Lisa Su would be stupid, since she has the loyalty of the whole company and is generally competent.
What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers. Or phase in the same over a longer period of time. The company is full of people that do nothing because we've paid under market for so long. That's fine when competing against Intel, it's not acceptable when competing against Microsoft, Amazon, OpenAI, Google, and Nvidia.
Lisa Su is the only CEO in the S&P500 who can get away with mass layoffs and still have the loyalty of the rest of the employees.
> What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers.
I was part of company with a similar problem. If AMD’s situation is similar to what I dealt with, it’s more complicated. When you start doing deep cut layoffs at the IC level combined with expectations of big salary increases for those who remain, the office politics escalate to a level I didn’t know was possible.
All of those people who do nothing find a way to join forces with those people who are showing those inflated benchmarks to execs and before you know it the layoffs are about as accurate as random chance when it comes to cutting the dead weight from the company.
In my experience, the change needs to start closer to the top: Upper layers of management need to be shaken up. Middle management audited by new upper management hires who have fresh eyes and aren’t afraid to make honest evaluations. High performing teams who are stuck under management hell need to be identified and rotated into other projects that are critical for the company but have become occupied by fiefdom-building managers. Hiring needs to ramp up to bring in new talent that was previously priced out by the low comp.
It’s hard. I wish there was an easy way to cut the low performers, but they have an amazing way of teaming up with the bad managers. Maybe because they have so much free time to do office politics because they’re not doing much work.
"Maybe because they have so much free time to do office politics because they’re not doing much work.”
I mean, isn’t that always the way? Honestly, I feel like you could do a lot worse than just firing most of the people who demonstrate above average social skills. Sure, some would be fired unnecessarily, but I can’t think of any engineers that have seemed almost pathologically shy that also didn’t want to work hard.
Came into this thread hoping for good news about GPUs and instead there's some surprisingly thoughtful management discussion!
> What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers.
Tell me you're an engineer without telling me you're an engineer. The problem is they don't know which half and they can't know. It's an issue of legibility and transparency - put yourself into the shoes of the C-suite. You're staring down a complete black box of, what, 5,000 people. How can you possibly know who's good and who's not? Think of the information they have at hand - what the chain of command tells them. What if the chain of command itself is the problem? Think about how you yourself could protect a bad employee if you were a manager. You could! How can they possibly find the truth?
People rightly hate stack ranking, but you can see why ideas like that exist - attempts to come up with organizational pruning algorithms that are resistant to the managers themselves being the problem.
And this is also why CEOs incoming with a turnaround mission often do a clean sweep and stack the c-suite with all their friends. Not because they're giving jobs to their mates - although sure, that does happen - but because they're trying to establish at least a single layer of trust, which can then in time hopefully be extended downwards. But it all takes time, and for some organizations, they never do manage it. When unlimited orgs all compete for the same limited number of good managers - well, some of them are going to lose.
Ironically I'm bullish on AI being able to greatly help with all of this. Maybe running on AMD GPUs...
> How can you possibly know who's good and who's not? Think of the information they have at hand - what the chain of command tells them. What if the chain of command itself is the problem? Think about how you yourself could protect a bad employee if you were a manager. You could! How can they possibly find the truth?
Senior managers should look at what people are actually doing. It doesn't take that much time. If tickets and PRs/MRs/changes are searchable by author, reviewer, and the files they touch (if they aren't, that's your problem right there) then it takes a few minutes to figure out who did the critical work, and who doesn't do much of anything.
In big tech, I've had senior managers (1-3 levels up) that do this, and ones that don't. The ones that do it are great managers. Under this type of manager, people are usually focused on making things actually work and making projects successful. The ones that don't do it can be good managers, but usually aren't. Under these types is where politics festers and dominates, because why wouldn't it? If you don't let the actual work guide your understanding, you're left with presentations and opinions of others.
When I do this (a few times a year), it takes 10 minutes for the easy cases, 1 hour for the hard cases, and once you do a few of these kinds of investigations in the same work area, you start to understand what the collaborators are doing before even looking at them specifically. So you're talking a few weeks of work for 100s of people. A few weeks a few times a year is not too much to ask someone to spend on their primary responsibility as a senior manager.
Past some point in scale, this does become impractical, I don't expect the CEO of a 10k person company to be doing this. But at that scale, the metrics are different anyways.
> AMD never misses an opportunity to miss an opportunity
Well said, their Instinct parts are actually, at a hardware level, very very capable pieces of kit that - ignoring software/dev ecosystem - are very competitive with NVidia.
Problem is, AMD has a terrible history of supporting it's hardware (either just outright lack of support, cough Radeon VII; or constantly scrapping things and starting over and thus the ecosystem never matured) and is at a massive deficit behind the CUDA ecosystem meaning that a lot of that hardware's potential is squandered by the lack of compatibility with CUDA and/or a lack of investment in comparable alternative. Those factors has given NVidia the momentum it has because most orgs/devs will look at the support/ecosystem delta, and ask themselves why they'd expend the resources reinventing the CUDA wheel to leverage AMD hardware when they can just spend that money/time investing in CUDA and NVidia instead.
To their credit, AMD it seems has learned it's lesson as they're actually trying to invest in ROCm and their Instinct ecosystem and seem to be sticking to their guns on it and we're starting to see people pick it up but they're still far behind Nvidia and CUDA.
One key area that Nvidia is far ahead of AMD on in the hardware space is networking.
> constantly scrapping things and starting over and thus the ecosystem never matured
AMD hires talented people at below-market and doesn't promote them or give raises. This causes employees to aim at resume-driven development by reinventing the wheel so they can get a job somewhere else.
It's a similar problem to Google, except at Google it's because promotions are explicitly for people that ship new products.
Our hardware is arguably better (spec for spec) apart from critical areas like memory bandwidth, and GPU to GPU bandwidth. You can tweak your implementations to get the same if not better performance. We do that, we see this, our customers see this.
ROCM pre Rock, suffers from the ossification in the engineering organization. The Rock seeks to completely change that, and the team driving it is amazing. Try out the pre-alpha installer. It is already better than the default installer.
Indeed. For clarity, I agree the performance is certainly there. My comment about being behind was in the context of marketshare and ecosystem maturity compared to CUDA. In fact, I'd say there's more than just hope but actual meaningful progress and commitment being made there, and I'm happy to see it.
I wouldn’t even look at it like they are learning their lesson. The total addressable market is 1T according to them, and they are usually very conservative with their approach and projections. They will solve the software issue because there is simply too much money in it.
From the performance comparison table, basically AMD could be NVIDIA right now, but they aren’t because… software?
That’s a complete institutional and leadership failure.
Ironically, building chips is the actual _hard_ part. The software and the compilers are not trivial but the iteration speed is almost infinite by comparison.
It goes to show that some companies just don’t “get” software. Not even AMD!
Funnily enough AMD was actually the first with GPGPU... they just floundered and managed to start 3 or more completely new software stacks for it, while CUDA focused not just on keeping one backward compatible one, but also made it work from cheapest NVS card to high end parts.
You're saying it like hardware and software are disjoint. You design hardware with software in mind (and vice versa); you need to if you want performance rivaling nvidia. This codesign, seeing their products are not only usable but actually tailored to maximize resource utilization in real workloads (not driven by w/e benchmarks), is where AMD seems to lack.
Why oversimplify the premise and frame your take as some 'proof'. Just use the term counter-argument/example
AMD have had people contribute optimised ROCm kernels in the past. They closed the PR without merge. ROCm are not interested in this. Baffling behaviour.