Hacker Newsnew | past | comments | ask | show | jobs | submit | capnrefsmmat's commentslogin

I work on research studying LLM writing styles, so I am going to have to steal this. I've seen plenty of lists of LLM style features, but this is the first one I noticed that mentions "tapestry", which we found is GPT-4o's second-most-overused word (after "camaraderie", for some reason).[1] We used a set of grammatical features in our initial style comparisons (like present participles, which GPT-4o loved so much that they were a pretty accurate classifier on their own), but it shouldn't be too hard to pattern-match some of these other features and quantify them.

If anyone who works on LLMs is reading, a question: When we've tried base models (no instruction tuning/RLHF, just text completion), they show far fewer stylistic anomalies like this. So it's not that the training data is weird. It's something in instruction-tuning that's doing it. Do you ask the human raters to evaluate style? Is there a rubric? Why is the instruction tuning pushing such a noticeable style shift?

[1] https://www.pnas.org/doi/10.1073/pnas.2422455122, preprint at https://arxiv.org/abs/2410.16107. Working on extending this to more recent models and other grammatical features now


You may be interested in my collection of links about AI's writing style: https://dbohdan.com/ai-writing-style. I've just added your preprint and tropes.fyi. It has "hydrogen jukeboxes: on the crammed poetics of 'creative writing' LLMs" by nostalgebraist (https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...), which features an example with "tapestry".

> Why is the instruction tuning pushing such a noticeable style shift?

Gwern Branwen has covered this: https://gwern.net/doc/reinforcement-learning/preference-lear....


The RLHF is what creates these anomalies. See delve from kenya and nigeria.

Interestingly, because perplexity is the optimization objective, the pretrained models should reflect the least surprising outputs of all.


No, that doesn't really work so well. A lot of the LLM style hallmarks are still present when you ask them to write in another style, so a good quantitative linguist can find them: https://hdsr.mitpress.mit.edu/pub/pyo0xs3k/release/2

That was with GPT4, but my own work with other LLMs show they have very distinctive styles even if you specifically prompt them with a chunk of human text to imitate. I think instruction-tuning with tasks like summarization predisposes them to certain grammatical structures, so their output is always more information-dense and formal than humans.


The first sentence is a reference to prior research work that has found those productivity gains, not a summary of the experiment conducted in this paper.


In that case it should not be stated as a fact, it should then be something like the following.

While prior research found significant productivity gains, we find that AI use is not delivering significant efficiency gains on average while also impairing conceptual understanding, code reading, and debugging abilities.


Outside of disciplines that use LaTeX, the ability of authors to do typesetting is pretty limited. And there are other typesetting requirements that no consumer tool makes particularly easy; for instance, due to funding requirements, many journals deposit biomedical papers with PubMed Central, which wants them in JATS XML. So publishers have to prepare a structured XML version of papers.

Accessibility in PDFs is also very difficult. I'm not sure any publishers are yet meeting PDF/UA-2 requirements for tagged PDFs, which include things like embedding MathML representations of all mathematics so screenreaders can parse the math. LaTeX only supports this experimentally, and few other tools support it at all.


I bet if you offer to waive a $1500 fee for authors who submit a latex version, a lot of grad students will learn it pretty fast.


At least in my experience, grad students don't pay submission fees. It usually comes out of an institutional finances account, typically assigned to the student's advisor (who is generally the corresponding author on the submission). (Not that the waiver isn't a good idea — I just don't think the grad students are the ones who would feel relieved by that arrangement.)

Also, I'm pretty sure my SIG requires LaTeX submissions anyway... I feel like I remember reading that at some point when I submitted once, but I'm not confident in that recollection.


> Outside of disciplines that use LaTeX, the ability of authors to do typesetting is pretty limited.

Since this is obviously true, and yet since most journals (with some exceptions) demand you follow tedious formatting requirements or highly restrictive templates, this suggests, in fact, that journals are outsourcing the vast majority of their typesetting and formatting to submitters, and doing only the bare minimum themselves.


Most of the tedious formatting requirements do not match what the final typeset article looks like. The requirements are instead theoretically to benefit peer reviewers, e.g., by having double-spaced lines so they can write their comments on the paper copy that was mailed to them back when the submission guidelines were written in the 1950s.

The smarter journals have started accepting submissions in any format on the first round, and then only require enough formatting for the typesetters to do their job.


...really? (Incredulous, not doubtful.)

For my area, everybody uses LaTeX styles that more or less produce PDFs identical to the final versions published in proceedings. Or, at least, it's always looked close enough to me that I haven't noticed any significant differences, other than some additional information in the margins.


It didn't "survey" devs. It paid them to complete real tasks while they were randomly assigned to use AI or not, and measured the actual time taken to complete the tasks vs. just the perception. It is much higher quality evidence than a convenience sample of developers who just report their perceptions.


Sure, if you're learning to write and want lots of examples of a particular style, LLMs can generate that for you. Just don't assume that is a normal writing style, or that it matches a particular genre (say, workplace communication, or academic writing, or whatever).

Our experience (https://arxiv.org/abs/2410.16107) is that LLMs like GPT-4o have a particular writing style, including both vocabulary and distinct grammatical features, regardless of the type of text they're prompted with. The style is informationally dense, features longer words, and favors certain grammatical structures (like participles; GPT-4o loooooves participles).

With Llama we're able to compare base and instruction-tuned models, and it's the instruction-tuned models that show the biggest differences. Evidently the AI companies are (deliberately or not) introducing particular writing styles with their instruction-tuning process. I'd like to get access to more base models to compare and figure out why.


Go vibe check Kimi-K2. One of the weirdest models out there now, and it's open weights - with both "base" and "instruct" versions available.

The language it uses is peculiar. It's like the entire model is a little bit ESL.

I suspect that this pattern comes from SFT and RLHF, not the optimizer or the base architecture or the pre-training dataset choices, and the base model itself would perform much more "in line" with other base models. But I could be wrong.

Goes to show just how "entangled" those AIs are, and how easy it is to affect them in unexpected ways with training. Base models have a vast set of "styles" and "language usage patterns" they could draw from - but instruct-tuning makes a certain set of base model features into the "default" persona, shaping the writing style this AI would use down the line.


Kimi tends to be very.. casual from my usage, like informal millenial style, without being prompted to do so.


I definitely know what you mean, each model definitely has it's own style. I find myself mentally framing them as like horses with different personalities and riding quirks.

Still, perhaps saying "copy" was a bit misleading. Influence would have been more precise way of putting it. After all, there is no such thing as a "normal" writing style in the first place.

So long as you communicate with anything or anyone, I find people will naturally just absorb the parts they like without even noticing most of the time.


I don't think the AI companies are systematically working to make their models sound more human. They're working to make them better at specific tasks, but the writing styles are, if anything, even more strange as they advance.

Comparing base and instruction-tuned models, the base models are vaguely human in style, while instruction-tuned models systematically prefer certain types of grammar and style features. (For example, GPT-4o loves participial clauses and nominalizations.) https://arxiv.org/abs/2410.16107

When I've looked at more recent models like o3, there are other style shifts. The newer OpenAI models increasingly use bold, bulleted lists, and headings -- much more than, say, GPT-3.5 did.

So you get what you optimize for. OpenAI wants short, punchy, bulleted answers that sound authoritative, and that's what they get. But that's not how humans write, and so it'll remain easy to spot AI writing.


That's interesting. I had not heard that. I wonder if making them sound more human and making them better at specific tasks though are mutually exclusive. (Or if perhaps making them sound more human is in fact also a valid task.)


In our studies of ChatGPT's grammatical style (https://arxiv.org/abs/2410.16107), it really loves past and present participial phrases (2-5x more usage than humans). I didn't see any here in a glance through the lightfastness section, though I didn't try running the whole article through spaCy to check. In any case it doesn't trip my mental ChatGPT detector either; it reads more like classic SEO writing you'd see all over blogs in the 20-teens.

edit: yeah, ran it through our style feature tagger and nothing jumps out. Low rate of nominalizations (ChatGPT loves those), only a few present participles, "that" as subject at a usual rate, usual number of adverbs, etc. (See table 3 of the paper.) No contractions, which is unusual for normal human writing but common when assuming a more formal tone. I think the author has just affected a particular style, perhaps deliberately.


Tangent, but I'm curious about how your style feature tagger got "no contractions" when the article is full of them. Just in the first couple of paras we have it's, that's, I've, I'd...


Probably because the article uses the Unicode right single quotation mark instead of apostrophes, due to some automated smart-quote machinery. I'll have to adjust the tagger to handle those.


If the output is interpreting sources rather than just regurgitating quotes from them, you need to exert judgment to verify they support its claims. When the LLM output is about some highly technical subject, it can require expert knowledge just to judge whether the source supports the claims.


Courts have always had the power to compel parties to a current case to preserve evidence. (For example, this was an issue in the Google monopoly case, since Google employees were using chats set to erase after 24 hours.) That becomes an issue in the discovery phase, well after the defendant has an opportunity to file a motion to dismiss. So a case with no specific allegation of wrongdoing would already be dismissed.

The power does not extend to any of your hypotheticals, which are not about active cases. Courts do not accept cases on the grounds that some bad thing might happen in the future; the plaintiff must show some concrete harm has already occurred. The only thing different here is how much potential evidence OpenAI has been asked to retain.


> Courts have always had the power to compel parties to a current case to preserve evidence.

Not just that, even without a specific court order parties to existing or reasonably anticipated litigation have a legal obligation that attaches immediately to preserve evidence. Courts tend to issue orders when a party presents reason to believe another party is out of compliance with that automatic obligation, or when there is a dispute over the extent of the obligation. (In this case, both factors seem to be in play.)


Lopez v. Apple (2024) seems to be a recent and useful example of this; my lay understanding is that Apple was found to have failed in its duty to switch from auto-deletion (even if that auto-deletion was contractually promised to users) to an evidence-preservation level of retention, immediately when litigation was filed.

https://codiscovr.com/news/fumiko-lopez-et-al-v-apple-inc/

https://app.ediscoveryassistant.com/case_law/58071-lopez-v-a...

Perhaps the larger lesson here is: if you don't want your service provider to end up being required to retain your private queries, there's really no way to guarantee it, and the only real mitigation is to choose a service provider who's less likely to be sued!

(Not a lawyer, this is not legal advice.)


So if Amazon sues Google, claiming that it is being disadvantaged in search rankings, a court should be able to force Google to log all search activity, even when users delete it?


Yes. That's how the US court system works.

Google can (and would) file to keep that data private and only the relevant parts would be publicly available.

A core aspect to civil lawsuits is everyone gets to see everyone else's data. It's that way to ensure everything is on the up and up.


A great model – in a world without the Internet and LLMs (or honestly just full text search).


Maybe you misunderstood. The data is required to be retained, but there is no requirement to make it accessible to the opposition. OpenAI already has this data and presumably mines it themselves.

Courts generally require far more data to be retained than shared, even if this ask is much more lopsided.


If Amazon sues Google, a legal obligation to preserve all evidence reasonably related to the subject of the suit attaches immediately when Google becomes aware of the suit, and, yes, if there is a dispute about the extent of that obligation and/or Google's actual or planned compliance with it, the court can issue an order relating to it.


At Google's scale, what would be the hosting costs of this I wonder. Very expensive after a certain point, I would guess.


>At Google's scale, what would be the hosting costs of this I wonder. Very expensive after a certain point, I would guess.

Which would be chump change[0] compared to the costs of an actual trial with multiple lawyers/law firms, expert witnesses and the infrastructure to support the legal team before, during and after trial.

[0] https://grammarist.com/idiom/chump-change/


It can be just anonymised search history in this case.


> It can be just anonymised search history in this case.

Depending on the exact issues in the case, a court might allow that (more likely, it would allow only turning over anonymized data in discovery, if the issues were such that that there was no clear need for more) but generally the obligation to preserve evidence does not include the right to edit evidence or replace it with reduced-information substitutes.


We found that one was a bad idea in the earliest days of the web when AOL thought "what could the harm be?" about turning over anonymised search queries to researchers.


How did you go from a court order to persevere evidence and jump to dumping that data raw into the public record?

Courts have been dealing with discovery including secrets that litigants never want to go public for longer than AOL has existed.


That sounds impossible to do well enough without being accused of tampering with evidence.

Just erasing the userid isn’t enough to actually anonymize the data, and if you scrubbed location data and entities out of the logs you might have violated the court order.

Though it might be in our best interests as a society we should probably be honest about the risks of this tradeoff; anonymization isn’t some magic wand.


So then the courts need to find who is setting their chats do be deleted and order them to stop. Or find specific infringing chatters and order OpenAI to preserve these specified users’ logs. OpenAI is doing the responsible thing here.


OpenAI is the custodian of the user data, so they are responsible. If you wanted the court (i.e., the plaintiffs) to find specific infringing chatters, first they'd have to get the data from OpenAI to find who it is -- which is exactly what they're trying to do, and why OpenAI is being told to preserve the data so they can review it.


So the courts should start ordering all ISPs, browsers, and OSs to log all browsing and chat activity going forward, so they can find out which people are doing bad things on the internet.


No, they should not.

However, if the ISP, for instance, is sued, then it (immediately and without a separate court order) becomes illegal for them to knowingly destroy evidence in their custody relevant to the issue for which they are being sued, and if there is a dispute about their handling of particular such evidence, a court can and will order them specifically to preserve relevant evidence as necessary. And, with or without a court order, their destruction of relevant evidence once they know of the suit can be the basis of both punitive sanctions and adverse findings in the case to which the evidence would have been relevant.


If those entities were custodians in charge of the data at hand in the court case, the court would order that.

This post appears to be full of people who aren’t actually angry at the results of this case but angry at how the US legal system has been working for decades, possibly centuries since I don’t know when this precedent was first set


Is it not valid to be concerned about overly broad invasions of privacy regardless of how long such orders have been occurring?


What privacy specifically? The courts have always been able to compel people to recount things they know which could include a conversation between you and your plumber if it was somehow related to a case.

The company records and uses this stuff internally, retention is about keeping information accurate and accessible.

Lawsuits allow in a limited context the sharing of non public information held by individuals/companies in the lawsuit. But once you submit something to OpenAI it’s now there information not just your information.


I think that some of the people here dislike (or are alarmed by) the way that the court can compel parties to retain data which would otherwise have vanished into the ether.


> I think that some of the people here dislike (or are alarmed by) the way that the court can compel parties to retain data which would otherwise have vanished into the ether.

Maybe so, but this has always been the case for hundreds of years.

After all, how on earth do you propose having getting fair hearing if the other party is allowed to destroy the evidence you asked for in your papers?

Because this is what would happen:

You: Your Honour, please ask the other party to turn over all their invoices for the period in question

Other Party: We will turn over only those invoices we have

*Other party goes back to the office and deletes everything.

The thing is, once a party in a suit asks for a certain piece of evidence, the other party can't turn around and say "Our policy is to delete everything, and our policy trumps the orders of this court".


I think your points are all valid, but… On the other hand, this sort of preservation does substantially reduce user privacy, disclosing personal information to unauthorized parties, with no guarantees of security, no audits, and few safeguards.

This is much more concerning (from a privacy perspective) than a company using cookies to track which pages on a website they’ve visited.


> On the other hand, this sort of preservation does substantially reduce user privacy,

Yes, that's by design and already hundreds of years old in practice.

You cannot refuse a court evidence to protect your or anyone else's privacy.

I see no reason to make an exception for rich and powerful companies.

I don't want a party to a suit having the ability to suppress evidence due to privacy concerns. There is no privacy once you get to a civil court other than what the court, at its discretion, allows, such as anonymisation.


I disagree because the information has already been recorded and users don’t have a say in who at the company or some random 3rd party the company sells that data to is “authorized” to view data.

It’s the collection itself that’s the problem not how soon it’s deleted as economically worthless.

> with no guarantees of security, no audits, and few safeguards.

The courts pay far more attention to that stuff than profit maximizing entities like OpenAI.


I agree that your assessment of the legal state-of-play is likely accurate. That said it is one thing for data to be cached in the short-term, and entirely different for it to be permanently stored and then sent out to parties which the user has only a distant and likely adversarial relationship with.

There are many situations in which the deletion/destruction of ‘worthless’ data is treated as a security protection. The one that comes to mind is how some countries destroy fingerprint data after it has been used for the creation of a biometric passport. Do you really think this is a futile act?

>”The courts pay far more attention to that stuff than profit maximizing entities like OpenAI.”

I would be interested to see evidence of this. The courts claim to value data security, but I have never seen an audit of discovery-related data storage, and I suspect there are substantial vulnerabilities in the legal system, including the law firms. Can a user hold the court or opposing law firm financially accountable if they fail to safeguard this data? I’ve never seen this happen.


> That said it is one thing for data to be cached in the short-term

Cashed data isn’t necessarily available for data retention to apply in the first place. Just because an ISP has parts of a message in some buffer doesn’t mean it’s considered as a recording of that data. If Google never stores queries beyond what’s needed to serve a response then it likely wouldn’t qualify.

Also, it’s on the entity providing data for the discovery process to do redaction as appropriate. The only way it ends up at the other end is if it gets sent in the first pace. There can be a lot of back and forth here and as to evidence that the courts care: https://www.law.cornell.edu/rules/frcp/rule_5.2


That is helpful, thanks, but I think it is not practical to redact LLM request information beyond the GDPR personally identifiable standards without just deleting everything. My (admittedly quick) read of those rules is that their ‘redacted’ information would still be readily identifiable anyway (not directly, but using basic data analysis). Their redaction standards for CC# and SIN are downright pathetic, and allow for easy recovery with modern techniques.


Its not an “invasion of privacy” for a company who already had data to be prohibited from destroying it when they are sued in a case where that data is evidence.


Yeah, sure. But understanding the legal system tells us the players and what systems exist that we might be mad at.

For me, one company obligated to retain business records during civil litigation against another company, reviewed within the normal discovery process is tolerable. Considering the alternative is lawlessness. I'm fine with it.

Companies that make business records out of invading privacy? They, IMO, deserve the fury of 1000 suns.


It’s not private. You handed over the data to a third party.


If you cared about your privacy, why are you handing all this stuff to Sam Altman? Did he represent that OpenAI would be privacy-preserving? Have they taken any technical steps to avoid this scenario?


> So the courts should start ordering all ISPs, browsers, and OSs to log all browsing and chat activity going forward, so they can find out which people are doing bad things on the internet.

Not "all", just the ones involved in a current suit. They already routinely do this anway (Party A is involved in a suit and is ordered to retain any and all evidence for the duration of the trial, starting from the first knowledge that Party A had of the trial).

You are mischaracterising what happens; you are presenting it as "Any court, at any time can order any party who is not involved in any suit in that sourt to forever hold user data"

That is not what is happening.


Or you didn't read what was written by the other comment, or are just arguing in bad faith, what's even weierder because the guy was only explaining how the the system always worked


> So then the courts need to find who is setting their chats do be deleted and order them to stop.

No, actually, it doesn't. Ordering a party to stop destroying evidence relevant to a current case (which is its obligation even without a court order) irrespective of whether someone else asks it to destroy that evidence is both within the well-established power of the court, and routine.

> Or find specific infringing chatters and order OpenAI to preserve these specified users’ logs.

OpenAI is the alleged infringer in the case.


Under this theory, if a company had employees shredding incriminating documents at night, the court would have to name those employees before ordering them to stop.

That is ridiculous. The company itself receives that order, and is IMMEDIATELY legally required to comply - from the CEO to the newest-hired member of the cleaning staff.


Time does not need user logs to prove such a thing if it was true. Times can show that it is possible so they can show how their own users can access the text. Why would they need other user's data?


> Time does not need user logs to prove such a thing if it was true.

No it needs to show how often it happens to prove a point of how much impact its had.


Why would that matter, if people didn't use it as much, does it mean that it doesn't matter if there were few people?


> Why would that matter

Because its a copyright infringement case, so existence and the scale of the infringement is relevant to both whether there is liability and, if so, how much; the issue isn't that it is possible for infringement to occur.


You have to argue damages. It actually has to have cost NYT some money, and for that you need to know some extent.


We don't even know if Times uses AI to get information from other sources either. They can get a hint of news and then produce their material.


OpenAI is also entitled to discovery. They can literally get every email and chat the times has and require from this point on they preserve such logs


> We don't even know if Times uses AI to get information from other sources either

which is irrelevant at this stage. Its a legal principle that both sides can fairly discover evidence. As finding out how much openAI has infringed copyright is pretty critical to the case, they need to find out.

After all, if its only once or twice, thats a couple of dollars, if its millions of times, that hundreds of millions


Who cares? That's not a legal argument and it doesn't mean anything to this case.


Oh, I was unaware that Times was inventing a novel technology with novel legal questions.

It’s very impressive they managed to do such innovation in their spare time while running a newspaper and site


For the most part (there are a few exceptions), in the US lawsuits are not based on "possible" harm but actual observed harm. To show that, you need actual observed user behavior.


> Times can show that it is possible

The allegation is not that merely that infringement is possible; the actual occurrence and scale are relevant to the case.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: