Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't understand exactly why it's necessary to release usernames along with the passwords, or why it's ethical to do so. Stripping the domain portion of email addresses does absolutely nothing when you can find the real email, and other accounts of the victim, by Googling the unique part of the email address.

How does tying each password to its corresponding username help with password research, and does the value gained outweigh the cost of someone using this list for malicious purposes?

I'm not saying this should be illegal, but I'm struggling to understand the intent here.



What about research to determine to what extent usernames with words in a certain language will tend to use passwords with words for the same language? (More generally, is there any connection between the bi- or trigram distribution on usernames and the one on passwords? In fact, do they just look the same, or could you tell given a string whether it's more likely a username or a password?)

Do usernames of people with weaker passwords have something in common? How do they differ from people with stronger passwords? In France there is a practice of picking names like "foobar42" or "foobardu42", where "foobar" is a first name and 42 a "département" (country subdivision) number, which I would associate to casual users. Here I could quantify whether people with usernames of this form tend to pick weaker passwords. Insert your favorite prejudice here about lame and skilled username patterns, and quantify how the password diversity of this group fares in comparison with others.

Is it true that the most common passwords were associated to usernames that were also common? Does username frequency correlate with password frequency? Are there more people with unique usernames or people with unique passwords?

In some countries it is customary to annotate usernames with the user's year of birth. Filtering on such usernames could give insight about the correlation between age and password quality, or identify which passwords are more or less popular given the user age. You could try to check correctness of the filter using the fact that some of those people may have used their birthdate (including the year) as a password.

If a seemingly rare password in the dataset only occurs for two distinct user names, then maybe those two user names actually correspond to the same user. Do such usernames have a low edit distance? Could you use this to learn general rules to determine, given two usernames, whether they seem to correspond to the same person?

I just gave those off the top of my head, and I'm not at all working in this field, but I'd have no trouble imagining interesting applications for this data that would not have been possible with the passwords alone.


I feel like most of those research questions could be answered if it was a "username -> password strength" mapping, in addition to a hash to study duplicate trends, rather than just "username -> password". Obviously there is no objective ranking of "password strength", but a decent approximation could be provided.

There are serious risks to having your username and password in a public list. Yes, all of these usernames and passwords were already technically publicly released, but to a lazy and ignorant script kiddie, finding or even being aware of those lists can be outside their grasp.

By aggregating everything into one list, you 1) increase the search engine visibility for all credentials, which means someone Googling the username of, say, an Internet commenter who pissed them off may find a plaintext password they could use to impact the person's life with much higher probability (I work in information security and have seen that happen on many occasions), 2) encourage script kiddies and fraudsters to spend time working through the list to find working accounts that other criminals have missed in the past decade, and 3) undo any work that paste sites like Pastebin and file sharing sites like Mediafire have done to remove copies of the database dumps. 1) may not apply if it strictly remains a torrent, but it'll probably be floating around public paste sites within a few days, which would likely mean search engine visibility for every username on it.

If even 0.01% of the users on this list have accounts compromised due to its release, then I don't think that cost justifies the research benefits relative to a more redacted version of the list.


> I feel like most of those research questions could be answered if

If the person who releases this kind of information has the foresight to know what the questions are going to be, they could provide the answers directly rather than go half-way and modify the data. It would likely be less work than trying to produce anonymized data that is both useful and secure.

What I see used in cases like this is one of two options. Either full public access, or restricted access where only a few selected get the chance to do the research. The 0.01% misuse is thus balanced to that choice, rather than the theoretical case of anonymized data.


As I explained in the article I seriously doubt that any more than a tiny number of these passwords are still valid. And there is no reason for them to be, having already been widely available, indexed (and cached) by every search engine, archived at archive.org, and downloaded by thousands or tens of thousands of people. Anyone who would use this data maliciously probably already has it.

Much of this data is the same data monitored by sites like haveibeenpwned.com and a dozen others. Facebook scrapes these. Lastpass will send you alerts. The risk here is minimal; the research value is much more than you realize.


>Anyone who would use this data maliciously probably already has it.

You might be surprised. The fact that these dumps are supposedly quite old certainly mitigates the risk, but I've seen cases of primary email accounts being taken over from a plaintext password in a dump 5+ years old. No one ever tried it on the email because it wasn't in the dump and wasn't identical to the username, though it was very close.

Aggregators like haveibeenpwned.com and Lastpass responsibly use the passwords they scrape, they don't release them all in a big batch like this. Many cybercriminals do the same kind of scraping and share these aggregated lists privately, but they're always going to be missing things, so there's no question they're all going to be pulling in your list, too. And odds are there's going to be at least one dump that a lot of them missed which yours has.

I do understand there is some research benefit here, but even in the best possible scenario I don't think the value from the research outweighs the costs.


First of all, a good number of these passwords were simply gathered through google. Some were gathered via the archive.org archive of pastebin pastes and their normal web page archive. Some were from forums that were located via google. This data is already out there, being aggregated doesn't make it any easier to hack these people.

Try searching for "Cucum01:Ber02" or "shawman:badman" and you will see how many passwords are indexed. I have hundreds of searches like these that I monitor and scrape.

Second, I regularly share my data with the owners of password checking sites such as haveibeenpwned to make sure users are able to be aware of these breaches. Releasing this data isn't something I have taken lightly, I debated it for years. I have weighed the risks and felt it was important to release the raw data, although not everyone will agree with me on this. I made a good effort to minimize the risks to actual users.

Finally, keep in mind that most users are already at risk simply because they have bad passwords. Ten percent of users have a password on the top 1000 list. A large percentage of users are at risk because the websites they are on don't have proper security. This is how people get hacked, not because of a password found on this list.


Still, the whole purpose of a password is to remain secret. He's certainly doing these users a disservice by releasing this list regardless of the hypothetical likelihood of the data already being available. Basically the arguments for doing this all seem to boil down to "they should already know their passwords are compromised" which nobody can guarantee is the case.

I agree that having a crappy password puts you at risk, but what about the people who genuinely tried to use some common sense but are on this list anyway? Is it their fault for not religiously keeping up with the latest indexed password lists?


OK, I'll bite: can you give us some ideas on how this would lead to a genuine advancement in user authentication (that we wouldn't have with username/pw de-linked)?


Example:

Username: mickael

Password: mickael69

EDIT: Just to be more precise, there is a correlation here, and with so much data a lot can be known. Patterns can then be forbidden from password fields so the website is less prone to dictionary attacks.


So what would you do here? Disallow "mickael" from the password? That's pretty user-hostile and almost completely pointless.


Is it pointless to reduce the attack vector against your website? And, no, for a banking system, it is not that user-hostile to say things like "we have found that using <pattern> in your password makes it easy for people to guess, please choose a more complicated password".


All possibly interesting questions (certainly not to me) but I fail to see how they would lead to any genuine advancements in authentication.


A list of 10 million passwords alone answers almost no questions. In fact, it's probably possible to programmatically predict, with a depressing level of accuracy, what a great deal of such a list will look like, given the already available research about the distribution of complexity, the parts of speech and numbers commonly used and in what patterns, etc.

So, the next interesting question is: given the already plaintext-available lists of usernames and passwords, just how much coverage is there in the known space? Are your passwords known? Are your users' and clients' passwords known?

This document is perfect for a true positive on the matter of needing to deprecate particular combinations of username and password, and, as an obvious corollary, presenting evidence for consultation advice about the same. (Of course, being only a sample, it doesn't say anything about a true negative.)


Before I go into the research aspect of it, there is no reason to hide the usernames from the passwords. They are already out there. The bad guys have them. So why not release them so that every one can look at them?

Also I am sure there are some research aspects to the usernames. At the very least behavioral deductions that can be drawn based on these combinations.


Probably to find out how many people do stuff like type their username backwards as a password/what kind of patterns they use. If that is useful enough information to warrant publishing data like this is debatable, yes.


Also interesting, how features of a username might correlate with password strength. Who do you think uses a stronger password, someone with the username "carguy551978" or someone with the username "w1ntermute"?


carguy followed by the 24'th n such that 1 + n + n^13 is prime, followed by the 34'th such n? I would expect a very, very strong password from someone who picks their username like that.

(see https://oeis.org/search?q=__%2C+551%2C+__%2C+978&sort=&langu...)



I dunno if he should have said "released", because he's not releasing any new data. Everything he's posted is already available to anyone with a search engine and a bit of curiosity.

So if you're concerned that information which wasn't previously public is now public, you can be at ease -- all of this data was not only public already, but less "cleaned up".


I'm curios to see if any of my accounts/passwords have been compromised


Wouldn't be surprised if one of these sites already has it

https://breachalarm.com/ https://haveibeenpwned.com/

The author does not seem like the type of person who did the hacking himself to obtain these, but rather curated leaks into his database


exactly why I'm curios. haveibeenpawned listed a username I often use as being pwned in a "battlefield heroes" leak, but I couldn't find the "release" for it.


> I'm struggling to understand the intent here.

A desire for a particular type of attention his ego seems to need.

Which, combined with either a moronic lack of appreciation for the hassle and damage he's going to cause to end-users who've already been hosed once before, or an arrogance that makes him not care, makes him difficult to fit for a white hat.

FTA:

> This is completely absurd that I have to write an entire article justifying the release of this data out of fear of prosecution

What's absurd is his assumption that stripping domain names is somehow sufficient.

Edit: I'm getting downvoted like crazy here. Which is fine, but people seem to think it's ad hominem because I'm narrowing the reasons behind why someone would release a data set with a considerable price of collateral damage attached to it, while doing very little to mitigate that damage.

Just because the likely options for why someone would do such a thing don't speak favorably of the person, doesn't make it ad hominem. An ad hominem attack is seeking to undermine someone's argument by attacking their character.

I'm saying Mark Burnett made it difficult to assume good things about him after a stunt like that. If he actually made a real argument that what he did was sufficient, or that the harm he's going to cause is more than offset by the greater good it'll do (or some such argument), then we'd have something to try to undermine (whether legitimately or fallaciously), but as it stands, he hasn't even justified his actions.


>Ad hominem + ad hominem

Research requires data. If I want to do research on how best to implement my bank system, I would like to know what passwords are more likely to be contained in a dictionary attack. Usernames may have a high correlation with passwords and thus are useful. Considering all of these passwords can be obtained from obscure forums/websites and that the website where the IDs are used are not specified, I don't see why he could not release it to the public for researchers to use.


> Research requires data.

There's a lot of research that could be performed if we were willing to generate data without due regard for the inherent downsides.

Saying research requires data is just insufficient justification in this case.

> I don't see why he could not release it to the public for researchers to use.

Because the collateral damage doesn't justify it. That aspect of it seems to be little more than a side note to him.

He could quietly and securely give the data to established researchers.

Or, he could very publicly release a torrent for everyone's use, with almost no concern for how it'll be used.

There's a massive difference there and the likely potential reasons behind his decision to do the latter leave very little room for one to make favorable judgements about either his motives, or his ability to responsibly mitigating risk.

I'm sorry if you believe any of that to be ad hominem, but it just isn't.

> Usernames may have a high correlation with passwords and thus are useful.

And that's precisely why the likelihood of collateral damage stemming directly from his actions is much higher than it should reasonably be in this instance.

At some point what you're giving up to further research isn't worth the tradeoff. He's selling innocent bystanders up the river to further his own cause, with little evidence that he's done everything possible to limit collateral damage.

I don't understand why this line of thinking is a hard sell here.

When a government or corporation releases lightly-redacted, personally-identifying information about people, the outcry is (rightly) massive. White knight does it and, well, to question his motives is ad hominem?

Really?


> A desire for a particular type of attention his ego seems to need.

> moronic lack of appreciation

> or an arrogance

This is ad hominem.

Here's a reference: http://en.wikipedia.org/wiki/Ad_hominem


Sorry, nope. I'd have to be attacking the character of the person making the argument, and do so in an attempt to undermine their argument, for it to be ad hominem.

I'm questioning the motives of someone who just released a data set that's going to cause very real harm to very real people, who've done nothing to deserve it.

For the record, given his credentials, it's highly unlikely that he didn't fully appreciate the ramifications of his actions. Which narrows down the other options on the table. (Did I mention he's selling books?)

Just because I'm not blowing sunshine at the guy, doesn't make it ad hominem.


Yeah, I wish people would quit using "ad hominem", it's turning into a tell for "people who spend too much time online and still don't know how to disagree".

Still, I think you're really overstating the risk here. The data set doesn't have email addresses and it doesn't list the specific services involved. How would you propose causing real harm to these real people using the data here, in a way that hasn't already been done or tried?

It sounds like he did put a lot of thought in to his decision. You seem to be arguing that he thought about it, and then decided to do it anyway to help his book sales, which would make him a pretty indecent person. Do you really want your opinion to boil down to, "I think this guy is greedy and bad"?

As far as the value of research goes ... well, we don't really know yet. This particular dump, yeah, probably won't add much value to the current body of research. (I personally have much larger dumps, and don't consider myself a researcher ... so it's not like there's a shortage of data available.)

That's the thing about research though. You start off by investigating something and seeing where it leads. Maybe this will be the dump that would encourage developers to start maintaining password blacklists ("Please do not use this password, it is too common"), that would be valuable. Maybe this will just be another straw on the camel's back that eventually leads to everybody giving up on the idea of passwords entirely.

Who knows? It might be valuable, it might not, but it's not dangerous.


Do think it might cause harm if the domain names were retained?


I'm not sure.

Given what the author says about the data (it's all gathered from public sources, a lot of it is very old), it shouldn't matter whether the domain names or service names were there or not.

But then the data would go from being mostly anonymous to somewhat personal, and I couldn't defend that as much. Practically speaking, the risk of harm should still be really really low, but it just seems like a bad practice to distribute information that might be used to identify someone that's had their password leaked somewhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: