Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wouldn't it had been easier to simply increment a counter for each visit and then set a short lived cookie in the browser for that post? And put the spam detection system before the counter increment


I think they said the product decision was made primarily to prevent abuse. I don't think a cookie could stop a sophisticated abuser.


A browser cookie that can be trivially deleted by the client? What's the purpose of the cookie?


Indeed it can easily be deleted, that's why I said to put the spam detection before the counter.

I assume most of the visits are from normal users, not spammers, so if a normal user has the cookie for the post set then it means it's a page refresh, so don't increase the counter.

Identifying sophisticated spammers accurately is more complicated though. You can't rely on any client side info (user agent, cookies, browser history, screen resolution, OS, etc) because they can all be modified. You can't rely on IP address either, because there are public hotspots used by genuine users also. I think their spam detector is more complicated than this and they have to use it for HLL also.

So, for the genuine users, a counter increased based on the cookie mechanism would've worked just fine.


> I assume most of the visits are from normal users

That's not a safe assumption. If spam prevention controls aren't working well enough, bots start to outnumber humans very quickly.


I suspect the number of Reddit users who run AdBlockers and/or aggressively block cookies is non-trivial.

Indeed, one of the main reasons for this architecture is to measure the difference between what it measures and what the cookies method gives.


How do you concurrently update a counter?


I think the main problem is not concurrently updating the counter. After all updating the HLL must also be done concurrently.

The primary motivation on their part for using HLL is given in the intro.

>In order to maintain an exact count in real time we would need to know whether or not a specific user visited the post before. To know that information, we would need to store the set of users who had previously visited each post, and then check that set every time we processed a new view on a post. A naive implementation of this solution would be to store the unique user set as a hash table in memory, with the post ID as the key.

>This approach works well for less trafficked posts, but is very difficult to scale once a post becomes popular and the number of viewers rapidly increases. Several popular posts have over one million unique viewers! On posts like these, it becomes extremely taxing on both memory and CPU to store all the IDs and do frequent lookups into the set to see if someone has already visited before.


Redis writes are atomic - you just use the increment function


Writes are atomic in redis because redis is single threaded. So you are bounded by how fast redis can write. If you try to write any faster then redis can handle you'll get queueing or errors.


The wonderful thing about HyperLogLogs is that you can split the counter in N servers and "merge" the registers later, in case you want an architecture that shards the same counter in multiple servers. But sharding directly by resource looks simpler actually...


Run enough redis servers to handle the load. Choose a server by hashing a user id. Total = sum of counts from all servers.


It's also got cassandra behind the scenes, which has fast, concurrent, distributed counters.


Yes, that was my idea as well. They must have some sort of cache system for serving basic user meta-data at scale when a page is loaded, and they could add a time-expiring list post ids of the posts viewed by a user to do detection on a per user basis on the backend.

I think they want to break it into different services for (whatever) reasons.

Running a counter across a sharded in memory cache implementation (like Redis) for something like post views should be ok, but they might have some weird rule-sets for spam detection that are slow.


I'm not following your solution for how it handles the unique visitor counting. You have to know when to increment the counter. Thus why they used the HLL.


Ah, sorry if it wasn't clear. By storing an expiring list post IDs in the close cache for loading the user metadata, you could filter out the posts that were already visited by that user.


That might work, but the cookie would be huge for people who read a lot of reddit threads, no?


That depends on their definition of 'short lived'.

How many posts can a user visit in, lets say, an hour?

They said they want to avoid increasing the views when a user refreshes the page in a short interval




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: