Hacker Newsnew | past | comments | ask | show | jobs | submit | greesil's commentslogin

If you were anticipating the stock to drop sharply in the future, this may be cheaper.

Their only accountability is to the stock price. The insanity will continue.

As long as our stock price continues to... Continues to rise... Which... Hmm... I'm just now reading our balance sheet. Is this number right? Great, thanks.

As I was saying, you're all fired.


I’m willing to bet that most of us here are capable of acquiring pitchforks and torches.

I predict that will be their comeuppance; it will begin a new era in history.


Because nobody reputable reported on it?

Reputable reporters know that publishing those stories leads to break-in burglaries where everyone is killed and nothing is stolen.

Or with hands tied and two gunshot wounds to the back of the head and its ruled a suicide (Gary Webb)

Or crawled into a suitcase and zipped it up = suicide.

https://en.wikipedia.org/wiki/Death_of_Gareth_Williams


> A subsequent Metropolitan Police re-investigation concluded that Williams's death was "probably an accident"

My sides


You think that reputation was earned without submission to intelligence agencies?

Capital, politicians, conservatives, libertarians,


We’ve gotta add American Liberals, majority of Democratic Party to the list. The Sanders faction is unfortunately not yet the prevailing force.


At the very top, yes and unfortunately many workers on both sides of the fence run interference/collaborate for those at the top it’s one of reasons the Molly Maguire‘s never win or rarely win for too long.


Around the world "liberal" is synonymous with "capitalist". US is pretty unique in that it considers liberalism a leftwing ideology


Left/Right alignment is relative, and the American political center is...where it is, and has been drifting rightward since Bill Clinton's "Third Way".


No longer true; the left wing in the U.S. started splitting from the liberals over a decade ago and that's more or less complete at this point.


Yeah for sure, speaking purely on the American common framing of big L Liberals as akin to social liberals rather than classical/economic liberals.


Don't forget liberals too!


I doubt it's the greatest given all that's happened in the past year. But it's certainly up there, no pun intended.


It's hard to say how much it contributed to the pre-eminence of modern-day China. But overall the rise of China surely dominates anything that's happened in the last year. No other nation even comes close to vying for hegemony with the US. We could have another full-on Vietnam-esque quagmire in Iran and it wouldn't even be a blip in comparison.


Server load, for sure.


How do you know this? I'm not trying to attack your statement, I am genuinely curious how anyone knows anything about model performance outside of benchmarks that are already in the training set.


using them you kind of get a feeling for skill level and can extrapolate that better than juiced benchmarks.


Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.


The policy is how you select your actions -- in this case, the next token. It can be random, but it doesn't have to be. "Deterministically choose the best action" is a valid policy (we would call it the greedy policy), as long as you have some other means of injecting stochasticity so the model explores the space. Uniform random is also a valid policy, as is always selecting the same token (it obviously wouldn't be very performant, and would defeat the purpose here, but it might be fine in, for example, a multi-armed bandit scenario). Most of the time, the policy is a parameterized distribution, and we want to learn the model parameters that maximize some measure of success (the reward component).

Off-policy versus on-policy refers to what data the model is trained on. On-policy training is where the training data is collected by the policy. Off-policy training is where the data was collected by a different sampling process (e.g. we have a standard dataset that we're going to use for supervised training).


It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases.

This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.


What is this comment? It’s an RL paper, these are standard RL terms


It's a comment. On Hacker News. Not the RL subreddit, or whatever. I'm just amazed at the jargon. I'm sure it's useful, but one could just call it model output.



> one could just call it model output.

That would be incorrect. My other reply attempts to address this.


But the probability vector is the output of the LLM, no?


> But the probability vector is the output of the LLM, no?

In some contexts yes, but that's not actually the policy. As I wrote in my other comment (quoting because I think it's worth highlighting):

> "the policy is a function that, given some context, assigns probabilities to possible next tokens."

In the same sentence, I also incorrectly referred to this as a "probability distribution", but that's not accurate: it's a function that produces a probability distribution. The policy instantiated at a specific context produces a probability distribution.

In fact, you'd be closer to the mark if you called the policy "the model", but the two terms emphasize different aspects - as I said, "policy" views it from an RL perspective. From that perspective, the policy is a function, the model is an implementation of that function.

Besides, "output of the LLM" is ambiguous. It commonly means the actual generated token(s) (or text), not the probability distribution. Depending on context, "output of the LLM" could refer to (1) logits, (2) the probability distribution, (3) a single selected token, (4) the full generated text.

"Policy" has no such ambiguity - it has a precise definition. That's why technical subjects rely on jargon in the first place, but it results in the exact issue we're running into here: "Jargon enables quick and precise communication among insiders, but it is usually confusing or unintelligible to outsiders."


Yes, I understand one function of jargon, which can be useful to insiders in that it conveys a precise meaning. But, it can be confusing to outsiders, and that is also a useful thing for insiders. In the context of LLMs, what other function can produce p(next token) if not the LLM? And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :) In any case, it's an interesting paper. Thanks for all your down votes everyone.


The LLM is the whole car and policy is a specific part.


> In the context of LLMs, what other function can produce p(next token) if not the LLM?

You're thinking about it from a specific implementation-oriented perspective. Policy is a well-defined theoretical concept that generalizes beyond LLMs - as we've discussed, it comes from RL. If one is discussing the use of RL techniques on LLMs, it can makes sense to use well-defined RL terminology.

Here's a definition from Sutton & Barto's RL intro (https://web.stanford.edu/class/psych209/Readings/SuttonBarto...):

> "At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy and is denoted π_t, where π_t(a|s) is the probability that A_t = a if S_t = s. Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent’s goal, roughly speaking, is to maximize the total amount of reward it receives over the long run."

If you apply this definition to an LLM, you find that the model itself becomes the implementation of a policy. But narrowing one's thinking about this to purely thinking about it in terms of what an LLM happens to do to implement a policy is not necessarily a good idea for a researcher.

As Sutton & Barto go on to write:

> "This framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive stages of decision-making and acting. The actions can be low-level controls, such as the voltages applied to the motors of a robot arm, or high-level decisions, such as whether or not to have lunch or to go to graduate school. Similarly, the states can take a wide variety of forms. [...]"

Referring to this as a policy connects it to a much broader body of work that's highly relevant to the problem being studied.

---

> And, you just about make my point for me with regards to jargon being confusing by misidentifying what the policy actually is (something i never would have noticed :)

It's quite the opposite. The jargon exists to make things precise, so that it become easier to identify when some nuance has been accidentally dropped, as in this case. It's bad faith to claim that a mistake in my attempt to simplify things for you proves your point.

> But, it can be confusing to outsiders, and that is also a useful thing for insiders.

You should be careful that you're not using anti-intellectual conspiracy theorizing to justify your refusal to try to understand the purpose of terminology you happen to be unfamiliar with.


But me asking questions is in fact my trying to understand, is it not? I ask a stupid simple question with a slightly rude tone, and then I get downvoted by a bunch of pedantic insiders. Although to be fair it appears some are trying to help.

Look dude, every field develops its own terminology. It's not a conspiracy, just an emergent property. But it always makes getting into the field, or understanding what's in the field, much harder than it needs to be.


Gemini didn't really say that exactly, did it? Because it's oversimplified to the point of being wrong.

“Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens. It's what a model’s behavior looks like when viewed through an RL lens.

The paper discusses “on-policy” and “off-policy” training, which is central to its idea.

Off-policy training is what happens in standard supervised fine-tuning (SFT): the model is trained on examples that were produced independently of the model. This means that the examples have a different distribution than what the model produces. This can have a negative effect on previously learned capabilities.

On-policy training (in this context) uses data generated by the model itself. It samples the model’s own outputs, scores them against whatever results are being trained for, and updates the model based on those scores. This reinforces certain aspects of the model's own pretrained behavior, so is a "gentler" way to change the model's behavior. The authors claim that this reduces "catastrophic forgetting" and other negative consequences of SFT.


> “Policy” here refers to a probability distribution, i.e. a function that, given some context, assigns probabilities to possible next tokens.

This should say "...refers to a function that produces a probability distribution." The latter half of the quoted sentence describes it correctly.


Thanks, very good explanation. One question: One could mix both kind of policies, are there hybrid policies? (with samples both from the inner and outer distributions?), if so, how are they named?


Policies are not of two types. There is just _a_ policy. On- and off- policy are properties of the training process. If you learn a policy using data which was generated using another policy, it is off-policy. If the data was generated using the same policy, it is on-policy. The distinction matters because (very loosely) the nudges that the other policy's data tell you to make are based on the other policy's existing shape, which might be different from your current policy's shape. Typically, an algorithm itself is called off-policy if it does not care about the source of the data. Example: Q-learning. An algorithm is called on-policy if it requires the source of the data to be the policy itself. In practice, you always use a mixture of both, and apply techniques such as importance sampling to mitigate the off-policy data mismatch.

To answer your question, yes, you can use any mixture of data for your training process. Whenever you use off-policy data, depending on your objective, you might have to use some technique to "fix" your updates.


> some sort of RL thing I'm too ML to understand

Oh boy.


"This is notably fast given that this is the first time that an Android driver bug I reported was patched within 90 days of the vendor first learning about the vulnerability."

This makes me feel better about Google, but also makes me kind of frightened of the rest of Android. I wonder what Apple's response time is?


Android vendors have been notorious about updates for a long time. Part of that is supposedly because all of the phone companies want to distinguish themselves from each other, and so they all want to fork the default Android UI so they can offer some psychedelic UI vision with some brand-specific features. But that means that when an update to stock Android comes out, it's a lot of work to migrate.


I don't think Android UI customization is the main issue. Many vendors are not even able to keep device firmware and Linux kernels in sync. Qualcomm and others are doing monthly bulletins:

https://docs.qualcomm.com/securitybulletin/may-2026-bulletin...

Since a lot of vendors are months or even years behind, their phones are full of known holes.

When it comes to security, basically: GrapheneOS > iOS > PixelOS >> Samsung OneUI >>>>>>>> everybody else.

Sadly, Samsung lets anyone who pays enough push bloatware and analytics on their phones. E.g. AppCloud from an Isreali company, Meta services that stay even when you remove Meta apps (only removable with ADB/UAD), etc. So there are only three somewhat serious options (and for two of them, you still give a lot of analytics to Apple or Google).


How is GrapheneOS able to get around the issue of SoC firmware blobs being slow to roll out?


they aren't, but they often push kernel/system patches faster than Google. they also have more kernel hardening in place, which makes some classes of exploits ineffective.


mainly by only supporting devices with consistent fast fw updates (which is how pixelos is also on the list)(samsung is also mostly on top of their shit but multiple security features are unavailable to third party operating systems so unviable)


I've reported security bugs to Apple before. Was a couple years back but I remember it taking around 6 months to patch (there was a couple back and forth for me to get a more reliable POC). Maybe 2 months from when I submitted a POC with 100% reproducibility


At least in the past there has been instances where Apple sat on security bugs for years until they were fixed, one example: https://jonbottarini.com/2021/12/09/dont-reply-a-clever-phis...

I've heard they cleaned up their program recently to respond much quicker nowadays


Not sure how much it helps, but I just run all my Apple devices in "Lockdown mode", don't install apps (use Safari), and try to mostly use Safari in private sandboxed mode.


This makes sense if you’re a human-rights journalist working in a dangerous country, with the threat of state-level actors looking to compromise you.

If you’re not then this seems quite paranoid, bordering on LARPing.


I turned it on a week ago to see what it was like. I expected it to be significantly annoying, but I found basically nothing changed other than a bit of text in safari that says it's in lockdown mode. Otherwise I wouldn't have been able to to tell the modes apart. I was expecting the browser to be slower without JIT or use more battery but I haven't noticed any change, it's all still snappy.

Apple over hypes the "you need to be in significant danger" part. Basically anyone can turn this on and it's fine. The UX seems mostly exactly the same either way.


I take it that you mostly communicate with other people using services that are not iMessage.


I’m not a heavy iMessage user but I use it a bit and I haven’t noticed a difference there either. Photos still load, maybe pdfs wouldn’t work?


It basically degrades back to SMS if you turn this on. Obviously, this is fine for a lot of things, but most people generally expect more than that out of their messaging app in this day and age.


BRING BACK EMOTICONS!


I thought it was common knowledge that all kinds of Americans (not to mention other nations) are routinely compromised with zero-clicks, mostly developed in the US and Israel.


This is the kind of assertion without evidence that just muddies the waters. “All kinds” of people is so vague as to be an almost entirely vacuous category and “routine” means almost nothing without an actual quantification of how prevalent and frequent the problem is.

It’s undeniable that the proverbial guns for hire make it easy (if not cheap) to target basically anyone — but just because the vibes are bad doesn’t mean we can just say “it’s common knowledge that …”

The fact is mitigations are costly in terms of convenience and ease of use. Helping people make informed choices about whether to enable mitigations and bear that cost requires more than platitudes imo


LARPing is imagining that Lockdown mode protects you from state-level actors. It is frankly baffling why a industry that has been laughing for literal decades at even the possibility of stopping state-level actors just turns around and uncritically believes Apple's marketing team with literally zero support, evidence or proof except for a long track record of failure. You would think that extraordinary claims would demand extraordinary evidence.

We have seen multiple software hacks resulting in >10 million dollar payouts. Apple's bug bounty program only pays out 4 million dollars (2 million dollars (2x) more than non-Lockdown) for a zero-click total compromise that can trivially worm to take down hundreds of millions of iPhones simultaneously. Even at the low end of that cyberattack payout range that is still a >2x ROI if your successful cyberattack depends on a iPhone zero-click, with many publicly known attacks being in the 10x ROI range. Lockdown mode, at best, raises the bar slightly for commercial profit-motivated attackers and reduces their profit margin from wildly profitable to slightly less, but still, wildly profitable.

And of course I am using the Apple bug bounty program as merely a available metric with at least some semblance of objective support. There are zero certifications, audits, or analysis that Apple has even attempted that would confirm any claim of protection against state level actors.


I strongly disagree that there is no evidence that Lockdown mode is effective; there have been numerous exposed, active iOS exploitation campaigns of which none have worked against Lockdown mode. When we're trying to prove a negative, that's actually some of the strongest evidence we can get.

The economics of the device exploitation industry are completely orthogonal from bug bounty payouts; the markets only overlap at the _extreme_ fringes. Trying to use one as a proxy for the other is meaningless.


I don't necessarily disagree but a lot of chains will bail out if they find like the Norton Antivirus app on your phone so


In this case the body of evidence is still quite powerful though, given that not only do we not have any forensic evidence of compromise from a phone with Lockdown Mode, but in all public cases where chains were RE'd back out of the forensic evidence, they don't work when tested on Lockdown Mode! So, there's even signal that the lack of forensics indicating Lockdown Mode compromises is not due to artificial targeting or detonation gates, but rather successful mitigation.

(as an aside): I'm not trying to say Lockdown Mode is infallible; I am sure phones in Lockdown Mode are or will be compromised. But it's clearly a very powerful tool, and to try to argue that it is some kind of marketing-driven conspiracy, against the body of evidence of its success, using bug bounty payout numbers (???), as the grandparent post did, is ridiculous.


I wouldn't say that it's useless but I do want to consider the option that the chains that get caught are probably the ones that are less competent.


That is a total strawman. The standard of “effective” being used by the person I was responding to and Apple themselves is “protects against state actors targeting you”, not “has any benefit whatsoever” or even “has a material benefit”.

Protecting against state actors is not a instantaneous property of the present. It demands durable protection against compromise by state actors who can easily spend tens to hundreds of millions of dollars on teams of hundreds for multiple years to develop novel, durable exploits known only to them. To the extent that compromises exist, they would require expected resource expenditure in excess of what state actors can deploy or are in excess of the value derivable by state actors which is going to be in the hundreds of millions to billions of dollar range to constitute as being "effective against state actors targeting you".

Protecting against state actors means secure against Iran, Saudi Arabia, China, and the NSA. That is the unsupported marketing bullshit I am calling out.


Apple is almost certainly spending hundreds of millions if not billions on software security.


Sure, but that is not related to anything I said.

I said that “protects against state actors” means the cost of finding a exploit as generally applicable and powerful as a zero-click RCE needs to cost on the order of hundreds of millions to billions of dollars per exploit to be problematic for state actors to field.

That amount of resources is a competent team of 100 skilled individuals finding zero zero-click RCEs after 3 years of full time investigation. That could credibly be called secure against state actors, though would still not be out of reach of a real military operation as a hundred million dollars is still just the cost of a single jet fighter.


> We have seen multiple software hacks resulting in >10 million dollar payouts

This sets a nice price bar for exploitation. Is someone willing to pay 10+ million dollars to get access to your phone?

The obvious caveat here is that for a lot less than 10 million dollars someone can be hired to hit you with a metal pipe until you give up your passcode.

> click total compromise that can trivially worm to take down hundreds of millions of iPhones simultaneously

Where is the profit motive in doing this? Possibility is one thing, but a realistic threat is another.


Is someone willing to pay 10+ million dollars to get access to your phone?

Not yours specifically usually, but there is a lot of money in a general tool that law enforcement can use to read out phones. Of course, most of them focus on physical access. In the few Cellebrite reports/presentations that have leaked, iPhones would fall after a relatively short time (IIRC a few months), but did better than most Android phones (except GrapheneOS).

Also, sometimes you do not need the 10M exploit, you can buy many cheaper exploits and make a chain yourself.

The obvious caveat here is that for a lot less than 10 million dollars someone can be hired to hit you with a metal pipe until you give up your passcode

If they hit you with a metal pipe, it's likely that you won't survive even if you give up your passcode. So most likely you are protecting something or someone else. Set up a duress PIN so that you have options in that case.


... really? Zero-click RCEs can be used on arbitrarily many phones until they are discovered which usually takes on the order of months. You do not need to burn them on every individual target.

As a example of how they might be used in that fashion for profit, NSO group had a revenue of 240 million dollars in 2020. Many of their customers were governments who wanted to spy on activists and journalists. NSO group was in the business of economies of scale to democratize access to journalist devices by reusing a small stockpile of exploits across many targets with enough revenue to assure a steady stream of new exploits as fast as they were burned.


You’re right, I misstated. It’s not 10 million per exploitation, it instead limits the pool of people who can exploit you to those willing and have the ability to spend 10 million+ on an exploit.

That is still quite a small pool, and there are other network effects preventing any Joe blogs with that much capital from launching an exploitation campaign.


Again, no. You do not need to spend 10 million on a exploit if you are working with a company like NSO Group who sells white-glove access to target individual as a service. The cost lower bound is going to be on the order of ((cost of exploit) / (number of times exploit can be used)) and the denominator there is going to easily be in the hundreds to thousands. Of course prices are likely to be higher than the minimum due to profit margins.

To, once again, use the same example of NSO Group as it is infamous and well-documented [1]. In 2016 it was 500,000 $ upfront and 650,000 $/year for 10 devices. That article claims Saudi Arabia was monitoring 15,000 phones at a average cost of 10,000 $/phone. In [2] it was 7 million $ for 15 devices, but the upfront versus marginal cost per device is not broken down. And this was a relatively "above-board" company in the sense that they were a legitimate business entity with government deals which commands a premium relative to random unknown blackhat organization with no reputation.

And again, my original comment was discussing commercial profit-motivated attackers for which 1 million $ is easily within reach and just a cost of doing business to unlock greater amounts of profit. That is less than the cost of setting up a McDonalds. There is a vast, vast gap spanning factors of millions between Joe Schmo and commercial actors and a even vaster gap to state actors. There is no evidence that Lockdown mode is adequate against even commercial actors, let alone the vastly more capable state actors.

[1] https://prodefence.io/news/pegasus-spyware-operating-costs-c...

[2] https://www.reuters.com/business/media-telecom/meta-suit-aga...


No root and no way to firewall them. "Lockdown" mode - a lot of inconveniences.


"If you’re not then this seems quite paranoid, bordering on LARPing."

There are sooooooo many other situations where such device lockdown is warranted. Government intrusion, sensitive industry, journalism, anything ITAR/EAR covered, and more. Your reduction to a single issue is absurd.


Are you at an above average risk of being targeted by a state level threat actor?


No, just keep the usual tax/finacial/health data on my devices.

I consider Anthropic's Mythros security bug finder mostly marketing, but other things worry me that there might be a global hack contagion: for example, a few months ago I saw in the news that an executive at a US security company was caught selling information to a hacking group.

Except for disabled Javascript compilation possibly slowing down web sites, not getting some attachments in messages, and some graphics not showing up on some web sites, having Lockdown mode set doesn't seem to affect anything I do. For dev I use VPSs with ssh set for ensuring SSH agent forwarding is strictly disabled, as are reverse tunnels.

It seems like doing little things like this make sense because it is such a tiny hassle to be a little safer.


For the most part "AI Exploit Research" is just lots of automated fuzzing. It's nothing new, it just takes time, and they're just throwing a lot of CPU/GPU at that


Given that 42% of Android devices are unpatched as of now [1] it's an interesting decision on their part to release their research and make them all vulnerable

[1] https://gs.statcounter.com/android-version-market-share [2] https://www.cybersecurity-insiders.com/survey-reveals-over-1...


That's perennially the case. A big portion of the world buys bargain-basement android devices that are unsupported right out of the box.

Search "android phone" on aliexpress and there's top selling phones on the first page running android 8, android 10, etc. They're not getting security updates of any sort, let alone driver updates.


It frustrates me no end that there's so many fly-by-night Android phones available from China. But with zero way to change the software on them. It's not even like they're running weird chips either.

It would be nice to find one where the bootloader is unlockable, and you can just build a standard Android image and flash it..


The old way of keeping security bugs private is just completely broken now. If you aren't on a device that gets security updates you are in significant danger, regardless of what Google decides to publish. No name hackers are sitting on stacks of exploits these days and are actively using them.


"Now"

Everything you describe is absolutely nothing new. It's literally where the name "0day" comes form.


On brand-name android devices you can count on getting OS security updates. The first-party vendor can build and push these themselves. Driver and firmware security updates are a maybe. These often have to come from an upstream vendor, who may or may not care to fix the issues.

Smaller brands often ship budget android devices and never update them.


Which comments?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: