Yet another approach is to send the entire list of all malware URLs to each clie...

deogeo · on Oct 13, 2019

Why is this getting downvoted? I'm also interested in why this approach isn't taken.

jhgg · on Oct 13, 2019

Chrome does something like this, you can learn more about it here: https://codereview.chromium.org/6286072/

mokus · on Oct 13, 2019

A couple possibilities I can think of:

* the list may be prohibitively large

* it exposes to the bad actors exactly which of their scams is detected, so they can simply refine their methods until their sites don’t make “the list”

souterrain · on Oct 13, 2019

Bad actors can also occasionally poll the safebrowsing API.

cryptonector · on Oct 13, 2019

Bloom filters take care of the first. There will always be an arms race between attack and defense, so I'm not concerned about the second issue.

yorwba · on Oct 13, 2019

Bloom filters can give false positives, and to eliminate them, you'd need to send "data (hashed, anonymized, truncated, or otherwise)" to some entity that has the full list. That's exactly how Google's Safe Browsing API works.

deogeo · on Oct 13, 2019

Can't the bad actors already check each of their sites individually by pretending to be a normal user?

pmoriarty · on Oct 13, 2019

Exactly how large is it?

OS vendors already make a habit of regularly sending gigantic OS updates. I'd have a hard time believing that a compressed list of malware URLs would be noticeably bigger, by comparison.

Also, once the list is sent the first time (or just included with the OS so it'd be already present on your device when you bought it), they could just send the deltas as the list changed, and those deltas (especially once compressed) should be relatively small even compared to the original (probably not that large) list.

NetMageSCW · on Oct 15, 2019

Google Safe Browsing transparency report lists 40k new bad URLs per week - how large do you think the list is now? It is far, far too large for local processing but k-anonymity is perfectly trustworthy when used with cryptographic hashing.

Gaelan · on Oct 13, 2019

Wouldn’t it be pretty easy for the bad actors to check the database anyway? I can’t imagine they would need to query often enough to hit any rate limits.

lostlogin · on Oct 13, 2019

Local handling seems more Appley too.

cryptonector · on Oct 13, 2019

Downvoters who think that might be too much data don't know about Bloom filters.

robocat · on Oct 13, 2019

Bloom filters are likely useless in this situation - following facts for phishing only:

1. Phishing sites have a lifecycle of about 15 hours.

2. Most malicious links are hidden within benign domains.

3. About 400,000 phishing sites are created each month.

From: https://www.itgovernance.co.uk/blog/4-eye-opening-facts-abou...

I haven't run the numbers, but I am guessing that a clientside solution would have a lot of bandwidth sucking and avoiding false positives is very important.

Also with a clientside solution, how are new phishing URLs detected?

PS: perhaps try to assume HNers know what a Bloom filter is (I've seen them come up lots of times in comments).

cryptonector · on Oct 14, 2019

Google's safe browsing API is probabilistic too. The idea is that you do so many rounds of checking to get closer and closer to the mark. You start with a fairly high false positive probability, high-privacy check, then if you get a positive, you try a lower false positive rate check that also loses you some privacy, and the trade-off is that you don't have to have the full malicious site DB with you at all times (and keep it up to date).

Why did you assume I'd not know about false positives?

pmoriarty · on Oct 13, 2019

400,000 sounds like a lot, but I wonder how many new URLs Tencent adds to its database each month. I expect they don't add every phishing URL but some small subset of them (possibly even a very small subset.. we'll proably never know).

But let's say it is 400,000. I took the URL you linked and made a file of 400,000 copies of it. The file size was 28 MB. I didn't bother compressing that particular file since the URL is the same in each instance, but I expect a file full of actual phishing URLs would probably compress pretty well, so it would probably be significantly less than 28 MB.

Considering that OS vendors regularly ship multi-gigabyte size updates, having to download less than 28 MB extra every month shouldn't even be noticeable. If updates needed to be done more frequently, the client could subscribe to get regular updates as they become available.

tomxor · on Oct 13, 2019

> 28 MB extra every month shouldn't even be noticeable

Parent comment suggests phishing site life-cycle <15hrs, at 400k a month that's 8333 every 15 hrs. To give an idea of how frequency sensitive this is, assume URLs are added equidistributed in time: that would be a new one every 154ms - for such time critical information it makes no sense to attempt to synchronize clients, it would require constant polling or push updates to have _any_ chance of catching a malicious URL.

At such a frequency, efficiency becomes less about bandwidth and more about the overhead of continuously synchronising so many clients (think of that 28 MiB spread out over 400k separate messages over one month, one every 154ms, that not only inflates the size, but causes a constant network usage and processing that is far less efficient than a single 28MiB download).

Or you could just send the URL hash when you visit a URL... (do you request any where near 8k URLs every 15hrs?, 1 URL every 154ms? no), it's so clearly a simpler solution that will be faster for everyone without letting bad URLs slip through before a latent sync.

deogeo · on Oct 14, 2019

There's no need to sync every time a new phishing URL is added - only every time a URL is visited by a client.

The delta can be derived just from the version number of the client's URL database, and should be a total of 1 MB in size for a whole day's worth of updates. So ~1 MB for the 1st URL visited in a day, and considerably less afterwards. Compared to average webpage size, that's nothing.

Really, only thing that changes is instead of sending a URL hash, you send the URL DB version, and the reply is the list of changes since that version.

tomxor · on Oct 14, 2019

> Really, only thing that changes is instead of sending a URL hash, you send the URL DB version, and the reply is the list of changes since that version.

Or none at all and a simple confirmation that the list is up to date. Yes this is a way better idea.

Although it's always going to be less efficient. For instance i'm not sure how it would scale into the future. Checking URLs server side is optimal, it's always going to be relatively constant in proportion to the URL size, but with DB deltas each URL is now related to both the URL size and the DB update frequency, i.e as the malicious URL rate increases over time, individual URL lookups will incur greater network cost... this is probably not a big deal for the client, but It would make a significant difference for the provider of the deltas - or maybe network caching would disolve it again? I mean there would be a lot of duplicate deltas flying around every minute... basically a content distribution problem but with a high frequency twist.

pmoriarty · on Oct 14, 2019

Do you really think Tencent is detecting a new phishing site every 154ms?

I'd seriously question how many of the total new phising sites they detect to start off with, and then how frequently they do so.

If a user only downloads the deltas periodically they'd risk being out of sync with the master list (which might not be updated even once a month or at all, for all we know), but that's the price they'd need to pay to have to send any information about their web browsing to parties they don't trust.

One other thing to consider is the likelihood that the URL you happen to be surfing is both a phishing URL to begin with and one of the ones that just appeared since the last delta download you did, compared to the likelihood that it's one of those already in the entire phishing URL database you've already downloaded. I'd expect those odds to be very low.

tomxor · on Oct 14, 2019

> the likelihood that the URL you happen to be surfing is both a phishing URL to begin with and one of the ones that just appeared since the last delta download you did, compared to the likelihood that it's one of those already in the entire phishing URL database you've already downloaded. I'd expect those odds to be very low.

Ignoring the first condition (otherwise why bother with a list at all)... Consider that this information is very transient (average 15hrs), this is pretty simple: deltaT / 54000

This is still horrible, because your safety is determined by how frequently you can sync with the DB.

> If a user only downloads the deltas periodically they'd risk being out of sync with the master list (which might not be updated even once a month or at all, for all we know).

Being updated monthly doesn't match the statistic of average 15hr lifecycle, because it would be useless after that length of time. And while I don't claim to know 15hrs as a fact, it is intuitive that the average will become ever shorter as malicious URL checkers become updated ever faster.

> but that's the price they'd need to pay to have to send any information about their web browsing to parties they don't trust.

Full URL information need not be sent, a hash of the URL domain and path would probably suffice... if that's not enough then it's a dilemma, but that doesn't make continuous syncing a good or fail safe replacement.

pmoriarty · on Oct 14, 2019

"Being updated monthly doesn't match the statistic of average 15hr lifecycle, because it would be useless after that length of time."

And maybe it is useless. We don't actually know, but we should at least recognize that there may be a difference between how frequently phishing sites allegedly appear and how frequently they appear in Tencent's malware URL database.

"This is still horrible, because your safety is determined by how frequently you can sync with the DB."

And being identified by the Chinese government as someone who surfs to forbidden websites might be even more horrible, for some.

morelisp · on Oct 13, 2019

People who think they know about Bloom filters should consider what attack vectors a false positive with such would allow.

(The end result of the thought experiment will be basically what Google does now.)