Love the idea! Maybe a suggestion with the "bit rot" problem (getting out of sync with code). Code tours will likely cover the most important parts of the codebase, which should be well tested in large projects (and large projects are the ones that need the tours). You could link the comments to those tests, and if any of those tests are changed, you will be advised to take a look at the tour comment. These tests can be also discovered automatically, as there exist code coverage tools.
Ah I really like this idea! Currently, a tour can be associated with a specific commit or tag, to enable it to be resilient to code changes over time (e.g. minor refactorings that don’t fundamentally change the value of the tour).
That said, there isn’t a way to automatically know when a tour should be updated, after a significant enough code has changed. Being able to bind a tour to one or more tests is a really interesting idea, and something that I’ll try to explore this upcoming week. Thanks so much for the feedback here!
Great question! I’m not currently sure :) I’ve been fairly deliberate about building CodeTour in a language agnostic way, in order to ensure it could be applied to any file type within a codebase. This is partly why I haven’t based the tour definition experience on code comments: https://github.com/vsls-contrib/codetour/issues/38.
That said, I’m trying to keep an open-mind when it comes to increasing tour resiliency, since it may require language/platform-specific solutions. I’m not sure. If you had any thoughts, I’d love to hear them!
Another potential solution is to have a CI task, that could check to see how far the code has deviated from the original commit that the tour was recorded on, and notify you when the deviation crosses some threshold. Maybe that’s a terrible idea, but something like that would have the benefit of being language-agnostic.
The post talks about coin flipping, or 0/1 classification. Many competitions use different scores however - multiclasses, learning to find bounding boxes of objects, etc. It is much less likely to find "good" answers on the test set by chance. I think the points in the article are important, but with this context become a non-issue, when a random answer is unlikely to be correct.
The article is not about models being indistinguishable from random classifiers, the difference there should be very significant even on the tasks it discussed. Instead, the problem originates from the small differences in test set performance between the top N models. While that difference may very well increase when moving from binary classification to a more technically involved regression task, that is by no means guaranteed, and the main points of the article still apply.