Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GoFetch: New side-channel attack using data memory-dependent prefetchers (gofetch.fail)
297 points by kingsleyopara on March 21, 2024 | hide | past | favorite | 92 comments


As long as we're getting efficiency cores and such, maybe we need some "crypto cores" added to modern architectures, that make promises specifically related to constant time algorithms like this and promise not to prefetch, branch predict, etc. Sort of like the Itanium, but confined to a "crypto processor". Given how many features these things wouldn't have, they wouldn't be much silicon for the cores themselves, in principle.

This is the sort of thing that would metaphorically drive me to drink if I were implementing crypto code. It's an uphill battle at the best of times, but even if I finally get it all right, there's dozens of processor features both current and future ready to blow my code up at any time.


Speaking as a cryptography implementer, yes, these drive us up the wall.

However, crypto coprocessors would be a tremendously disruptive solution: we'd need to build mountains of scaffolding to allow switching to and off these cores, and to share memory with them, etc.

Even more critically, you can't just move the RSA multiplication to those cores and call it a day. The key is probably parsed from somewhere, right? Does the parser need to run on a crypto core? What if it comes over the network? And if you even manage to protect all the keys, what if a CPU side channel leaks the message you encrypted? Are you ok with it just because it's not a key? The only reason we don't see these attacks against non-crypto code is that finding targets is very application specific, while in crypto libraries everyone can agree leaking a key is bad.

No, processor designers "just" need to stop violating assumptions, or at least talk to us before doing it.


Processor designers are very unlikely to do that for you, because everyone not working on constant time crypto gives them a whole lot of money to keep doing this. The best you might get is a mode where the set of assumptions they violate is reduced.


> No, processor designers "just" need to stop violating assumptions, or at least talk to us before doing it.

No, you don't get to say processor designers need to stop violating your assumptions. You need to stop making assumptions about behaviour if that behavior is important (for cryptographic or other reasons). Your assumptions being faulty are not a valid justification, because that would mean no one could have ever added any caches or predictors at any point because that would be "violating your assumptions". Also lets be real here: even if "not violating your assumptions" was a reasonable position to take, it is not reasonable in any way to make any kind of assumption about modern processors (<30 years old) processors not caching, predicting, buffering, or speculating anything.

If you care about constant time behaviour you should either be writing your code such that it is timing agnostic, or you could read the platform documentation rather than making assumptions. The apple documentation tells you how to actually get constant time behavior, rather than making assumptions.


> you should either be writing your code such that it is timing agnostic, or you could read the platform documentation rather than making assumptions

Have you even read the paper? Especially the part where the attack applies to everyone’s previous idea of “timing agnostic” code, and the part where Apple does not respect the (new) DIT flag on M1/M2?


No, the paper targets "constant time" operations, not timing agnostic.

The paper even mentions that blinding works, and that to me is the canonical "separate the time and power use of the operation from the key material" solution. The complaint about this approach in the paper being is that it would be specific to these prefetchers, but it seems this type of prefetcher is increasingly prevalent across multiple cpus and architectures so it seems unlikely to be apple specific for long. The paper even mentions new intel processors have these prefetchers and so necessarily provide functionality to disable them there too. This is all before we get to the numerous prior articles showing that key extraction via side channels is already possible with these constant time algorithms (a la last months(I think?) "get the secrets from the power led" paper). The solution is to use either specialized hardware (as done for AES) or timing agnostic code.

Trying to create side channel free code by clever construction based on assumptions about power and performance of all hardware based on a simple model of how CPUs behave is going to just change the side channels, not remove them. If it's a real attack vector that you are really concerned about you should probably just do best effort and monitor for repeated key reuse or the like, and then start blinding at some threshold.


> processor designers "just" need to stop violating assumptions

"Security" rarely (almost never) seems to be part of any commercially-significant spec.

Almost as if by design...


Wouldn't that "just" allow someone to see if a key was present (and any information that informs) but dramatically help prevent secret key extraction?


I don’t think the security community is also going to become experts in chip design, these are two full skill sets that are already very difficult to obtain.

We must stop running untrustworthy code on modern full-performance chips.

The feedback loop that powers everything is: faster chips allow better engineering and science, creating faster chips. We’re not inserting the security community into that loop and slowing things down just so people can download random programs onto their computers and run them at random. That’s just a stupid thing to do, there’s no way to make it safe, and there never will be.

I mean we’re talking about prefetching. If there was a way to give ram cache-like latencies why wouldn’t the hardware folks already have done it?


I almost gave you up an upvote until your third paragraph, but I have to now give a hard disagree. We're running more untrusted code than ever, and we absolutely should trust it less than ever and have hardware and software designed with security in mind. Security should be priority #1 from here on out. We are absolutely awash in performance and memory capacity but keep getting surprised by bad security outcomes because it's been second fiddle for too long.

Software is now critical infrastructure in modern society, akin to the power grid and telephone lines. It's a strategic vulnerability to neglect security, and it must happen at all levels of the software and hardware stack. Meaning, trying to crash an enemy's entire society by bricking all of its computers and send them back to the dark ages in milliseconds. I fundamentally don't understand the mindset of people who want to take that kind of risk for a 10% boost in their games' FPS[1].

Part of that is paying back the debt that decades of cutting corners has yielded us.

In reality, the vast majority of the 1000x increase in performance and memory capacity over the past four decades has come from shrinking transistors and increasing clockspeeds and memory density--the 1 or 5 or 10% gains from turning off bounds checks or prefetching aren't the lion's share. And for the record, turning off bounds checks is monumentally stupid, and people should be jailed for it.

[1] I'm exaggerating to make a point here. What we trade for a little desktop or server performance is an enormous, pervasive risk. Not just melting down in a cyberwar, but the constant barrage of intrusion and leaks that costs the economy billions upon billions of dollars per year. We're paying for security, just at the wrong end.


Turning off bounds checks is like a 5% performance penalty. Turning off prefetching is like using a computer from twenty years ago.


Turning off prefetching while running crypto code would be a performance gain before you can implement the algorithms safely without even more expensive and fragile software mitigations. Just give me the option of configuring parts of the caches (at least data + instructions + TLBs) as scratchpad and and a "run without timing side-channels pretty please" bit with a clearly defined API contract and accessible (by default) to unprivileged userspace code. Lots of cryptographic algorithms have such small working sets that they would profit from a constant-time accessible scratchpad in the L1d cache if they get to use data dependent addresses into it again.


Happily there are mechanisms to do just that, specifically for the purpose of implementing cryptography (I commented at the top level and don't want to just spamming the url).


I agree that hardware/software codesign are critical to solving things like this, but features like prefetching, speculation, and prediction are absolutely critical to modern pipelines and broadly speaking are what enable what we think of as "modern computer performance." This has been true for over 20 years now. In terms of "overhead" it's not in the same ballpark -- or even the same sport, frankly -- as something like bounds checking or even garbage collection. Hell, if the difference was within even one order magnitude, they'd have done it already.


> I fundamentally don't understand the mindset of people who want to take that kind of risk for a 10% boost in their games' FPS[1]

Me either. But, lots of engineers are out there just writing single threaded matlab and python codes with lots of data-dependencies and just hoping the system manages to do a good job (for those operations that can’t be offloaded to BLAS). So I'm glad gamer dollars subsidize the development of fast single threaded chips that handle branchy codes well.

> In reality, the vast majority of the 1000x increase in performance and memory capacity over the past four decades has come from shrinking transistors and increasing clockspeeds and memory density

I disagree, modern designs include deep pipelines, lots of speculation, and complex caches because that’s the only way to spend that higher transistor budget for higher clocks and compensate for the fact that memory latencies haven’t kept up.

> Part of that is paying back the debt that decades of cutting corners has yielded us.

It will be tough, but yeah, server and mainframe users need to roll back the decision to repurpose consumer focus chips like the x86 and arm families. RISC-V is looking good though and seems open enough that maybe they can pick-and-choose which features they take.

> I almost gave you up an upvote until your third paragraph, but I have to now give a hard disagree.

I’m not too worried about votes on this post; this site has lots of web devs and cloud users, pointing out that the ecosystem they rely on is impossible to secure is destined to get lots of downvotes-to-disagree.


How is RISC-V going to solve anything here?


It isn’t a sure thing. Just, since it is a more open ecosystem, maybe the designers of chips that need to be able to safely run untrusted code can still borrow some features from the general population.

I think it is basically impossible to run untrusted code safely or to build sand-proof sandboxes, but I thought the rest of my post was too pessimistic.


It is significantly less complex yet without compromising anything. This means a larger portion of a chip's design effort can be put elsewhere such as into preventing side channel attacks.


I don't really see how the design of RISC-V avoids the need to have a DMP


>I don't really see how the design of RISC-V avoids the need to have a DMP

Because it does not. I also do not see where, if at all, such claim was made.


Perhaps you should explain how this design effort spent preventing side-channel attacks is spent, then?


Anything specific?


> download random programs onto their computers and run them at random

To be clear that includes what we're all doing by downloading and running Javascript to read HN.

Maybe I can say "don't run adversarial code on my same CPU" and only care about over-the-network CPU side-channels (of which there are still some), because I write Go crypto, but it doesn't sound like something my colleagues writing browser code can do.


Speak for yourself; I've got JavaScript disabled on news.ycombinator.com and it works just fine.


Is this exploitable through JavaScript?

In general from what I've seen, most of these JS-based CPU exploits didn't strike me as all that practical in real world conditions. I mean, it is a problem, but not really all that worrying.


> Is this exploitable through JavaScript?

Why wouldn't it be?


How is JavaScript going to run a chosen-input attack against one of your cores for an hour?


If you leave a tab open that's running that JS..


Because JS/html provides APIs to perform cryptography(I can't recall whether the cryptography specs are part of ES or HTML/DOM) - if you try to implement constant time cryptography in JS you will run into a world of hurt due to the entire concept of "fast JS" being dependent on heavy speculation, and lots of exciting variations in timing of even "primitive" operations.


No, the attack would be implemented in JS, not the victim code (though, that too, but that's not what's interesting here).


Ah, you’re concerned about person using js to execute the side channel portion of the attack, not the bit creating the side channel :)


FYI malicious JS executing in victim users' browsers is a huge concern. All sorts of vulnerabilities can be exploited via JS in this way -- every local side-channel like Spectre/Meltdown, worse things like Row Hammer, etc.


Unfortunately somebody has tricked users to leaving JavaScript on for every site, it is a really bad situation.


Security and utility are in opposing balances often. The safest possible computer is one buried far underground without any cables in a faraday cage. Not very useful.

> We’re not inserting the security community into that loop and slowing things down just so people can download random programs onto their computers and run them at random. That’s just a stupid thing to do, there’s no way to make it safe, and there never will be.

Setting aside JavaScript, you can see this today with cloud computers which have largely displaced private clouds. These run untrusted code on shared computers. Fundamentally that’s what they’re doing because that’s what you need for economies of scale, durability, availability, etc. So figuring out a way to run untrusted code on another machine safely is fundamentally a desirable goal. That’s why people are trying to do homomorphic encryption - so that the “safely” part can go both ways and both the HW owner and the “untrusted” SW don’t need to trust each other to execute said code.


> The feedback loop that powers everything is: faster chips allow better engineering and science, creating faster chips. We’re not inserting the security community into that loop and slowing things down just so people can download random programs onto their computers and run them at random. That’s just a stupid thing to do, there’s no way to make it safe, and there never will be.

Note that in the vast majority of cases, crypto-related code isn't what we spend compute cycles on. If there was a straightforward, cross-architecture mechanism to say, "run this code on a single physical core with no branch prediction, no shared caches, and using in-order execution" then the real-world performance impact would be minimal, but the security benefits would be huge.


I’m in favor of adding some horrible in-order, no speculation, no prefetching, 5 stage pipeline architectures 101 core which can be completely verified and bulletproof to chips.

But the presence of this bulletproof core would not solve the problem of running bad code on modern hardware, unless all untrusted code is run on it.


I think what's more likely is "mode switching" in which you can disable these components of the CPU for a certain section of executing code (the abstraction would probably be at the thread level).


Isn't that the entire point of the secure enclave[1]?

https://support.apple.com/guide/security/secure-enclave-sec5...


The secure enclave is not a general-purpose/user-programmable processor. It only runs Apple-signed code, and access is only exposed via the Keychain APIs, which only support a very limited set of cryptographic operations.

Presumably latency for any operation is also many orders of magnitude higher than in-thread crypto, so that just doesn't work for many applications.


If you look at the cryptokit API docs the Secure Enclave essentially only supports P-256. Which is maybe why they didn’t include ECC crypto in the examples.


Encrypted bus mmu have existed since the 1990's.

However, the trend to consumer-grade hardware for cost-optimized cloud architecture ate the CPU market.

Thus, the only real choice now is consumer CPUs even in scaled applications.


Many modern architectures have crypto extensions, usually to accelerate a few common algorithms, maybe it would be good to add a few crypto-primitives instructions to allow new algorithms?


see DIT and DOIT flags referenced in the paper and in the faq question about mitigations. newer CPUs apparently provide functions to do just that.


One option would be for people to stop downloading viruses and then running them.


Except when these vulnerabilities are exploitable from JavaScript in your web browser.


From the paper: "OpenSSL reported that local side-channel attacks (...) fall outside of their threat model. The Go Crypto team considers this attack to be low severity".


At least one Go crypto developer publicly expressed concern about this very issue in 2021: https://github.com/golang/go/issues/49702


The end result of these side channel attacks would be to have CPUs that perform no optimizations at all and all opcodes would run in the same number of cycles in all situations. But that will never happen. No one wants a slow CPU.

As long as these effects cannot be exploited remotely, it's not a concern. Of course multi-tenant cloud-based virtualization would be a no go.


We need to drop all the untrusted code on some horrible in-order, no speculative execution, no prefetching, 5 stage pipeline from architectures 101 class core.


It might be preferable.

We have ridiculously fast hardware. In many use cases (client machines in particular) we do not usually really need that. I would gladly drop features for security.


If you account for all of the CPU "features" that can be exploited, you're looking at probably 80% of what makes it "ridiculously fast". If you also account for all of the ways in which the entire modern hardware ecosystem can be exploited, you're probably looking at gross performance loss of over 90% to remove these "holes".

An overclocked 486 PC that can only run a single program at a time and isn't continuously connected to a network might be very secure, but replacing every modern computer with something like it will not be even remotely feasible. In most situations, it would be better to have some risk tolerance, and couple modern hardware with mitigations, disposability, and supply-chain security instead.


It will also be good because users will become more annoyed when people try to sneak full programs into their websites, hopefully resulting in a generally less bloated internet.


If untrusted code includes JavaScript that would make Web apps ridiculously slow. (I know what you're thinking...)


Oh no, a totally unexpected side effect, less complex webpages.


> multi-tenant cloud-based virtualization

And that's why I'm not as worried about this as I was about the same vulnerability in Intel chips a few years ago.

There are a few cloud service providers that will rent you clock cycles on a rack-mounted Mac Mini, but not many, and even then they're for highly-specific workloads or build tasks. I suppose that's a problem for people paying far out the butt for that kind of service, but the vast majority of Apple Silicon devices are never, ever going to host cloud services.


This is why high core counts and isolation matter. Isolate the code to a specific core. Assuming everything is working as intended, an exploit won’t compromise other tenants.


> Can the DMP be disabled?

> Yes, but only on some processors. We observe that the DIT bit set on m3 CPUs effectively disables the DMP. This is not the case for the m1 and m2.

Surely there is a chicken bit somewhere to do this?


I’ve often wondering how are these bits set?

Like can you do it from Swift? Or need assembly?


It's probably in a MSR accessible from the kernel?


It seems to be userspace accessible: https://developer.apple.com/documentation/xcode/writing-arm6...

The kernel would have to be aware of it in order to be able to restore its state across context switches though, unless it's part of a set of registers that is automatically persisted. But given that Apple is publicly documenting this flag, I suppose it is.

Here's an interesting conversation by the Go developers from as early as 2021 being suspicious of DIT: https://github.com/golang/go/issues/49702


No, that’s something else. I’m talking about the thing that disables DMP, which would not be part of the standard architecture.


Give the quote, I don't understand why you think these are different.


While this doesn't have to be the case, in theory the DMP and DIT can be orthogonal, since leaking data from the DMP is done after-the-fact on caches that might have been populated from code running in constant time. More generally, you can't really know whether such effects are eliminated or not, because DIT specifies some architectural level of "things take the same time" and doesn't actually tell you more about what is going on in the chip. If Apple mistakenly thinks that the DMP is actually not sensitive, and they forget to wire it up to DIT, then you'll be stuck.


So, what you are saying is that you believe the authors were incorrect when they stated "We observe that the DIT bit set on m3 CPUs effectively disables the DMP."; like, your response to my question is (effectively) "I don't believe that quote"?


My response to your specific question is "I believe them when they say that but there is no need for this to be true, and in fact apparently Apple didn't do them in older chip revisions and I'm not sure that is a bug". However I do believe the authors were incorrect when they said "there is no way to disable the DMP on M1 and M2" (surely not involving DIT).


No one claimed it needed to be true, merely that it is true: if we believe (as you claim you do) the first part of the quote, Apple clearly decided at some point -- maybe due to the dawning realization of this very kind of attack (even if the organization didn't model it as such) -- to make DIT also (if saying such makes you feel better) disable this feature, at which point this mechanism is available to userland... which you claimed it would not be (which honestly doesn't make sense anyway to assume as nothing prevents a new bespoke M-specific mechanism / register / whatever--even if it were undocumented!!--from being available to userland).


Apple typically does not make these kinds of things (namely, special Apple silicon stuff) accessible to userland. I think they probably have some specific agreement with ARM to not do it.


On reading it seems a lib like libsodium can simply set the disable bit prior to cryptographic operations that are sensitive on M3 and above.

Also looks like they need to predetermine aspects of the key.

Very cool but I don’t think it looks particularly practical.


Reminded me of the Augury attack[1] from 2022, which also exploits the DMP prefetcher on Apple Silicon CPUs.

[1]: https://www.prefetchers.info


BTW. Three of the authors of GoFetch where also behind Augury.


Yes, they specifically mention that in the article and FAQ.


Why does Apple have so many hardware backd... innocent bugs?


why do we even need caches?

why do we need prefetchers?

But in answer to your bullshit backdoor conspiracy theory (JFC processors have caches and timing variants because people want fast CPUs, you cannot have constant time and fast, apple is not the only company with prefetchers), here's some apple provided documentation on how disable the hardware backd... enable constant time operations specifically for the purpose of cryptography, almost like it's designed into the hardware. So weird. https://developer.apple.com/documentation/xcode/writing-arm6...


The M1 and M2 don't have that bit.


Same reason Intel and AMD had Meltdown and Specter.


If you're writing cryptographic routines you should either use the platform cryptography libraries, or follow the documentation:

https://developer.apple.com/documentation/xcode/writing-arm6...


So malware scanning and virus scanners just became relevant for Macs and IPads.

(Compromise must be running on the same hardware.)


Is it naive to ask whether implementing this mitigation would impact performance and memory interaction speed?


what's the attack vector here? access to an encrypt oracle and co-location on the target machine?


Why does every attack needs its own branding, marketing page, etc...? Genuine question.


Science isn't just about discovering information. Dissemination is critical. Communicating ideas is just as important as discovering them and promotion is part of effective communication. It's natural and healthy for researchers to promote their ideas.


Names are critical to enable discussion.

The "marketing" page is where documentation is. Summaries that don't require reading a whole academic papers are a good thing, and they are the place where all the different links are collected. Same reason software has READMEs.

Logos... are cute and take 10-60 minutes? If you spend months on some research might as well take the satisfaction of giving it a cute logo, why not.


Well, names are useful for the same reason people's names are useful. The rest just kinda happens naturally, I think.


Yes, it saves time vs. starting a discussion on "that crypto cache sidechannel attack that one team in China found".


Name makes enough sense. "Branding, marketing page, etc..." was my question.

"Happens naturally" isn't really an answer.


Is your position that any write-up about an attack must be plain text only, and must not use its own URL?

I truly cannot understand why this is brought up so often. You aren't paying for it, it doesn't hurt you in any way, it detracts nothing from the findings (in fact, it makes the findings easier to discuss), etc. There is no downside I can think of.

Can you share what the downsides of a picture of a puppy and a $5 domain are? Sorry, "branding" and "marketing page"?

Or at least, maybe you can share what you think would be a more preferable way?


Dunno, but I'm glad they do it. In other fields of research, researchers often purposely hold off on naming something, so that the community kind of has no choice but to name it after the authors themselves.

Eg in my field, they would have called Spectre "the Horn-Genkin-Hamburg vulnerability" or something. Which one of these is hard-to-remember jargon, and which one is catchy and evocative?


It's science these days. They need funding, one way is to get people to recognize the importance of their work


So people talk about it


Why does the comments of every such attack need a question about why it has its own branding, marketing page, etc…? Genuine question.

(Seriously, this comes up every time, just do a search for it if you actually want to figure out why.)


Because it makes it feel like you need some marketing department if you want to publish your work. Rather than give _only_ the work merit, we give too much merit to its colorful presentation. That shouldn't be the case.


Good communication has always been a part of making sure your work is influential.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: