Hacker Newsnew | past | comments | ask | show | jobs | submit | nonsequitarian's commentslogin

I don't know how many folks will see this, and of those that do I don't expect many will necessarily be moved by what I say here. I'm going to say it anyways, however, and then I may never look at this thread again. I'm the person who designed the download token scheme that is discussed in this article, and, while I understand all of the concerns and suspicions, I believe that the way we designed this and the way we handle our telemetry data means that this is not the privacy violation some of you are claiming it is. Also, to be clear, I am speaking for myself here, these are my own thoughts and opinions, and I am not representing Mozilla in any official capacity.

So, a download token is a UUID associated with a unique download event. It gets generated when you click the 'download' link, added to the installer, and then passed through to the installed browser. It is returned to us in the telemetry pings that the browser sends back to our telemetry ingestion endpoints. When the download happens, on the server side we capture the download token and the GA session ID and store those in a table. There is nothing else stored in this table.

Having access to this table means that you can correlate the user's activity on the Mozilla website that GA provides with the telemetry data that Firefox sends us. The website activity contains URLs that the user visited, so we consider this "category 3" data (see https://wiki.mozilla.org/Data_Collection#Data_Collection_Cat...), quite sensitive. For that reason this table has highly restricted access, only a small number of individuals are able to get to it.

Access restrictions offer no protection against subpoenas, of course. But I believe you can safely maintain your anonymity by opting out of our telemetry gathering, because when you opt out of telemetry we delete all of the historical telemetry data we have collected for your Firefox profile. Everything, including all of the records that contain the download token.

If this happens, all we are left with is that original record with the download token and a GA session. The download token can no longer be correlated with your telemetry data, and we have no way of associating your Firefox installation with your GA session, not even under subpoena. And this is all assuming that you haven't blocked GA, or that you haven't specified 'Do Not Track' before visiting our website. If you've done either of those things, we won't have a GA session ID for you to begin with.

Oh, incidentally, we never store any IP addresses or other PII in our telemetry data. That all gets scrubbed during ingestion.

Again, I don't expect this to have much impact, but I'm sharing what I know to counter some of the more extreme claims that this removes the ability for Firefox users to remain anonymous.

Finally, we have the obvious question: Why we would even do this? Believe it or not, understanding your user base does actually have some value in serving that user base. For most of Firefox's existence, there has been no trustable feedback loop. Sure, folks out there in the world have opinions, and share them, but opinions differ, and anecdotes are not data. If one person thinks most users will like a particular change, and someone else thinks they won't, nobody can prove their point in any meaningful way. The folks making decisions about Firefox have been flying blind. And, as many of you in this thread have pointed out, it hasn't necessarily been going that well.

In Firefox's early years, there was lots of low hanging fruit, and the competition was a poorly maintained Internet Explorer, so it was easy to win a bunch of market share. Then Chrome came on the scene with their effectively limitless budget and famously data driven product process. We'll never match their budget, but we can try to make choices based on data instead of just letting whoever has the most organizational power decide. My team has spent the last few years building out a data infrastructure that we hope will support better decision making going forward while still trying to honor user privacy and choice. This is a tough balance to strike, and we're far from perfect, but we do our best.

You can learn about or data collection infrastructure and policies in great detail on our docs site (https://docs.telemetry.mozilla.org/index.html), and you can see nearly all of the code that handles our data ingestion and processing in our public repositories (https://github.com/mozilla/gcp-ingestion and https://github.com/mozilla/bigquery-etl).


While it's true that Mozilla pays its people well (necessary, to compete for high end tech talent) and we (I work for Mozilla) are in no way free from commercial concerns, I think you're missing a big piece of why being a nonprofit is a distinguishing factor: Mozilla doesn't have the insane pressure for growth that most startups and all publicly traded companies have to reckon with. Any company that has gone public, wants to go public, or wants to get acquired has a never ending pressure for user and/or revenue growth at all costs. Mozilla doesn't have that pressure for growth. Of course we need to maintain enough market share to stay relevant, and we'd love to have more and more and more users, but ultimately as long as we can make enough money to pay for our operations then we're golden. This gives us a lot more freedom of choice when making decisions about what and how to make money.


> This gives us a lot more freedom of choice when making decisions about what and how to make money.

I'd argue the opposite. Mozilla seem too scared to try anything substantial because they're scared of losing the money they do get (because they aren't actively going after other sources and not trying to grow into new revenue sources). Think of e.g. tracking protection - Apple are the ones actually making moves there, not Mozilla. I wonder why...


Mozilla created their own high performance programming language in order to build a faster web engine from scratch (all of which has been largely successful). They developed and shipped a mobile OS well into the reign of Android and iOS. You could argue about the placement of their priorities, but they have definitely thrown their weight behind ambitious projects.


Ambitious in a technical sense - yes. But not ambitious in a "protect the users" sense (see e.g. tracking protection/ad blocking), and also not ambitious in a business sense (i.e. try to do something beyond sell users/traffic to the highest bidding search engine).

(Perhaps they could have made money by selling devices with the OS, but that didn't seem to be the aim there, as far as devices being only sold by partners.)

This is very much the stereotypical engineer approach: trying to engineer your way into revenue. Yes you need strong technology to keep users, but there's a lot more to business than the tech stack and quality.


Well stated.


Sorry, where's the part where I complained that the language is broken? I said that Go doesn't protect you from mutating shared state across multiple concurrent goroutines. Not a very controversial statement, that.


What you go on to say is:

But how does Go’s approach to concurrency fare when viewed through the lens of encouraging code that supports local reasoning? Not very well, I’m afraid. Goroutines all have access to the same shared memory space

The article is flawed because you: 1. Base it on Go failing to protect you, when in actually you fail to use any of Gos "very useful concurrency primitives baked right into the language" 2. When you choose to use these primitives you choose the the wrong primitive that results in a much more verbose and complex solution then is necessary.

It would have been much better had you: 1. Showed why the solution with no protection is unsafe (and mention its unsafe in just about any mainstream language) 2. Show a mutuex solution 3. Show a channel based solution


He uses channels, which are the main Go primitive. "Share by communicating" and all that.


>>which are the main Go primitive.

When it comes to protecting a structs state, in most Go code (including the standard library) RW/Mutexes are the "go to" synchronization primitive. Channels are more commonly used for communication between longer running goroutines.


Snark aside, I added a bit to clarify that the first example is contrived, implemented as such for illustrative purposes, and a mutex would actually be a better choice in the trivial case.


Thank you, I appreciate that.

I would have understood snarking at my original (admittedly somewhat snarky) comment. I'm not sure why you felt the need to snark at my constructive criticism, which you actually took-


It's maybe a bit silly having a conversation about tone on Hacker News, but generally people tend to respond better when you don't state opinions, especially critical ones, as though they're fact. See "The article is flawed because you..." and "It would have been much better had you..." I statements FTW.

I don't think the original version was particularly flawed, and your description of why you think it was doesn't make a lot of sense to me. But I added a note anyway to make it clear that I do in fact know about the alternative approaches, and that I made the choices I made deliberately and not out of ignorance.

Regardless, I appreciate you taking the time to read it, and to give your feedback. Cheers.


Thanks for explaining to me the error of my ways.


Absolutely true. And if I were writing this for actual real world use then I would have done so. The main technique I was trying to get to was tightly coupling a struct w/ an internal event loop, though, so I allowed myself some awkward code (called out as such in the post) to build up to that point.


Unfortunately, the diagram you linked to is misleading; we don't have all of those plugins built yet. We've got a (quite rich) Python client (https://github.com/mozilla-services/heka-py) and a (rudimentary) Go client (https://github.com/mozilla-services/heka/tree/master/client) but Memcached and PostgreSQL connectors aren't yet in place. Fleshing out our plugin set (especially the inputs) is one of our highest priorities, so those should be coming Real Soon Now(tm). Contributions welcome!


Not really. Piwik is focused on web traffic analytics. You could build a piwik-type system using Heka, but Heka is a lower level tool w/ a wider focus.


It's certainly not available out-of-box, and it would take a bit of work, but it should be possible to build something like this with Heka. You could write a dynamic filter plugin that watched for peaks in a specific graph (or any other arbitrary trigger) and, if found, generate a message to trigger the activation of a bunch of new dynamic filters, quickly turning on a more comprehensive set of data analyzers.


It's definitely possible, at least with RRDtool and 30 lines of Python (only 20 lines for react to measured peak, not forecasting one).


The question is whether reacting to peaks identified this way can be helpful or not. If you have hundreds of thousands of metrics, how many peaks get detected per minute and how many of those indicate something that actually requires attention?


There are many ways to sort peaks out. For instance, there is a script for RRDtool that removes obviously irrelevant spikes [1].

The vast majority of hundreds of thousands metrics are common for any node/server actually, so predefined recipes/settings should be used for them. Only aplication-level metrics for only in-house applications have to be tuned manually.

[1] http://oss.oetiker.ch/rrdtool/pub/contrib/removespikes-20080...


Actually, there is a fair amount of overlap between collectd and Heka. And Heka provides mechanisms for in-flight data processing and graphing output, in addition to logging and routing. But you're also right that Heka might in some cases also be used in place of syslog.

Of course, both syslog and collectd have been around and battle-hardened for many(!) years, whereas this first Heka release is being called "0.2-beta-1" for a reason. I wouldn't go rush into replacing any mission-critical infrastructure just yet. ;)


Heka and Circus have very different goals. Heka is about generic data gathering, processing, and routing. Circus is about task and process monitoring and management. There's some overlap in that Circus needs to gather and process a bit of data to do its job, and it would certainly be possible to use Heka as a part of that, but we're not doing so. We'll probably use Circus to manage at least a few of our `hekad` processes, though.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: