> IMHO the bigger issue with NaN-boxing is that on 64-bit systems it relies on the address space only needing <50 bits or so, as the discriminator is stored on the high bits.
Is this right? You get 51 tag bits, of which you must use one to distinguish pointer-to-object from other uses of the tag bits (assuming Huffman-ish coding of tags). But objects are presumedly a minimum of 8-byte sized and aligned, and on most platforms I assume they'd be 16-byte sized and aligned, which means the low three (four) bits of the address are implicit, giving 53 (54) bit object addresses. This is quite a few years of runway...
There's a bit of time yes, but for an engine that relies on this format (e.g. spidermonkey), the assumptions associated with the value boxing format would have leaked into the codebase all over the place. It's the kind of thing that's far less painful to take care of when you don't need to do it than when you need to do it.
But fair point on the aligned pointers - that would give you some free bits to keep using, but it gets ugly.
You're right about the 51 bits - I always get mixed up about whether it's 12 bits of exponent, or the 12 includes the sign. Point is it puts some hard constraints on a pretty large number of high bits of a pointer being free, as opposed to an alignment requirement for low-bit tagging which will never run out of bits.
This was 20+ years ago, so the "sophisticated" baseline wasn't ML or AI.
I was looking into an initial implementation and use of order files for a major platform. Quick recap: C (and similar languages) define that every function must have a unique address, but place no constraints on the relative order of those addresses. Choosing the order in which functions appear in memory can have significant performance impact. For example, suppose that you access 1,000 functions over a run of a program, each of which is 100 bytes in size. If each of those functions is mixed in with the 100,000 functions you don't call, you touch (and have to read from disk) 1000 pages; if they're all directly adjacent, you touch 25 pages. (This is a superficial description -- the thousand "but can't you" and "but also"s in your mind right now are very much the point.)
I went into this with moderately high confidence that runtime analysis was going to be the "best" answer, but figured I'd start by seeing how much of an improvement static analysis could give -- this would provide a lower bound for the possible improvement to motivate more investment in the project, and would give immediate improvements as well.
So, what are all the ways you can use static analysis of a (large!) C code base to figure out order? Well, if you can generate a call graph, you can do depth first or breadth first, both of which have theoretical arguments for them -- or you can factor in the function call size, page size, page read lookahead size, etc, and do a mixture based on chunking to those sizes... and then you can do something like an annealing pass since a 4097 byte sequence is awful and you're better off swapping something out for a slightly-less-optimal-but-single-page sequence, etc.
And to test the tool chain, you might as well do a trivial one. How about we just alphabetize the symbols?
Guess which static approach performed best? Alphabetization, by a large margin. This was entirely due to the fact that (a) the platform in question used symbol name prefixes as namespaces; (b) callers that used part of a namespace tended to use significant chunks of it; and (c) call graph generation across multiple libraries wasn't accurate so some of these patterns from the namespaces weren't visible to other approaches.
The results were amazingly good. I felt amazingly silly.
(Runtime analysis did indeed exceed this performance, significantly.)
> 2. Become very good (top 25%) at two or more things.
Is this idea that top 25% is "very good" at something innumeracy, or a subtle insight I'm missing? There's got to be a million skills that you could assess rank at -- writing embedded C code, playing basketball, identifying flora, PacMan, archery, bouldering… I can't imagine ever being able to not continue this list -- and you should expect to be in the top 25% of roughly a quarter of those skills, obviously heavily biased towards the ones you've tried, and even more biased towards the ones you care about. It's hard to imagine anyone who's not in the top 25% of skill assessment in a dozen things, let alone two or more…
Ignore the numbers - the gist is being good enough at the right two or three things can create similar value for you as being the best at one specific thing.
Everyone (for the sake of my argument) wants to be an engineer at a FAANG but there are tons of folks making more money with more autonomy because they've found a niche that combines their good-enough technical ability with an understanding or interest in an underserved market.
It depends on the population you are taking from. Being the top quartile embedded C developer in the world is perhaps unimpressive (there are up to 2 billion people better than you at embedded C programming), but being the top quartile embedded C developer within the population of professional embedded C developers is much more impressive.
I think it's generally accepted that at a high level being in the top quartile is considered very good. Not excellent. Not unicorn. Just very good.
Beyond that, it's not about becoming very good at two different, completely orthogonal things, it's about becoming very good at two things that are complementary in some way that is of value to others. Being good at PacMan and Bouldering is only particularly valuable if you are competing for opportunities to participate in a hypothetical mixed reality video game, or perhaps a very niche streaming channel. Being the top quartile of embedded c code, and flora identification could result in building software/hardware tools to identify flora, which is a niche that currently has multiple competing products that are high value to those interested.
If you consider your denominator to be the population of practitioners, rather than "everybody", top quartile would be pretty good. To use chess as an example, the 75th percentile of the global population probably knows the rules and nothing else. The 75th percentile of chess players would be an Elo of 1800 and change.
It's (obviously) a random number pulled out from someone's ass. However, I think top 25% isn't that off. It means top 25% of people who actually tried.
If it still sounds easy, try to reach top 25% rank of a video game that you are not familiar with (diamond in Starcraft II or whatever). You'll find it's literally the workload of a full-time job.
a [chemist, biologist, mathematician, DSP researcher] who can code at a professional level (that 25%) is worth far more to the right position than either of those skills individually.
If you haven't tried Hidden Rose apples, give them a try. Besides being gorgeous, they have a tart:sweet ratio that's similar to Granny Smith, but with a texture that's further away from a baking apple and a thinner skin. Absolutely my favorite lately.
Suppose two models with similar parameters trained the same way on 1800-1875 and 1800-2025 data. Running both models, we get probability distributions across tokens, let's call the distributions 1875' and 2025'. We also get a probability distribution finite difference (2025' - 1875'). What would we get if we sampled from 1.1*(2025' - 1875') + 1875'? I don't think this would actually be a decent approximation of 2040', but it would be a fun experiment to see. (Interpolation rather than extrapolation seems just as unlikely to be useful and less likely to be amusing, but what do I know.)
These probability shifts would only account for the final output layer (which may also have some shift), but I expect the largest shift to be in the activations in the intermediate latent space. There are a bunch of papers out there that try to get some offset vector using PCA or similar to tune certain model behaviours like vulgarity or friendlyness. You don't even need much data for this as long as your examples capture the essence of the difference well. I'm pretty certain you could do this with "historicalness" too, but projecting it into the future by turning the "contemporaryness" knob way up probably won't yield an accurate result. There are too many outside influences on language that won't be captured in historical trends.
On whether this accounts only the final output layer -- once the first token is generated (i.e. selected according to the modified sampling procedure), and assuming a different token is selected compared to standard sampling, then all layers of the model would be affected during generation of subsequent tokens.
This way it wouldn't be much better than instructing the model to elicit a particular behaviour using the system prompt. Limiting tokens to a subset of outputs is already common (and mathematically equivalent to a large shift in the output vector), e.g. for structured outputs, but it doesn't change the actual world representation inside the model. It would also be very sensitive to your input prompt to do it this way.
> No, LIDAR is relatively trivial to render immune to interference from other LIDARs.
For rotating pulsed lidar, this really isn't the case. It's possible, but certainly not trivial. The challenge is that eye safety is determined by the energy in a pulse, but detection range is determined by the power of a pulse, driving towards minimum pulse width for a given lens size. This width is under 10 ns, and leaning closer to 2-4 ns for more modern systems. With laser diode currents in the tens of amps range, producing a gaussian pulse this width is already a challenging inductance-minimization problem -- think GaN, thin PCBs, wire-bonded LDs etc to get loop area down. And an inductance-limited pulse is inherently gaussian. To play any anti-interference games means being able to modulate the pulse more finely than that, without increasing the effective pulse width enough to make you uncompetitive on range. This is hard.
I think we may have had this discussion before, but from an engineering perspective, I don't buy it. For coding, the number of pulses per second is what matters, not power.
Large numbers of bits per unit of time are what it takes to make two sequences correlate (or not), and large numbers of bits per unit of time are not a problem in this business. Signal power limits imposed by eye safety requirements will kick in long after noise limits imposed by Shannon-Hartley.
> For coding, the number of pulses per second is what matters, not power.
I haven't seen a system that does anti-interference across multiple pulses, as opposed to by shaping individual pulses. (I've seen systems that introduce random jitter across multiple pulses to de-correlate interference, but that's a bit different.) The issue is you really do get a hell of a lot of data out of a single pulse, and for interesting objects (thin poles, power lines) there's not a lot of correlation between adjacent pulses -- you can't always assume properties across multiple pulses without having to throw away data from single data-carrying pulses.
Edit: Another way of saying this -- your revisit rate to a specific point of interference is around 20 Hz. That's just not a lot of bits per unit time.
> Signal power limits imposed by eye safety requirements will kick in long after noise limits imposed by Shannon-Hartley.
I can believe this is true for FMCW lidar, but I know it to be untrue for pulsed lidar. Perhaps we're discussing different systems?
I haven't seen a system that does anti-interference across multiple pulses...
My naive assumption would be that they would do exactly that. In fact, offhand, I don't know how else I'd go about it. When emitting pulses every X ns, I might envision using a long LFSR whose low-order bit specifies whether to skip the next X-ns time slot or not. Every car gets its own lidar seed, just like it gets its own key fob seed now.
Then, when listening for returned pulses, the receiver would correlate against the same sequence. Echoes from fixed objects would be represented by a constant lag, while those from moving ones would be "Doppler-shifted" in time and show up at varying lags.
So yes, you'd lose some energy due to dead time that you'd otherwise fill with a constant pulse train, but the processing gain from the correlator would presumably make up for that and then some. Why wouldn't existing systems do something like this?
I've never designed a lidar, but I can't believe there's anything to the multiple-access problem that wasn't already well-known in the 1970s. What else needs to be invented, other than implementation and integration details?
Edit re: the 20 Hz constraint, that's one area where our assumptions probably diverge. The output might be 20 Hz but internally, why wouldn't you be working with millions of individual pulses per frame? Lasers are freaking fast and so are photodiodes, given synchronous detection.
I suggest looking at a rotating lidar with an infrared scope... it's super, super informative and a lot of fun. Worth just camping out in SF or Mountain View and looking at all the different patterns on the wall as different lidar-equipped cars drive by.
A typical long range rotating pulsed lidar rotates at ~20 Hz, has 32 - 64 vertical channels (with spacing not necessarily uniform), and fires each channel's laser at around 20 kHz. This gives vertical channel spacing on the order of 1°, and horizontal channel spacing on the order of 0.3°. The perception folks assure me that having horizontal data orders of magnitude denser than vertical data doesn't really add value to them; and going to a higher pulse rate runs into the issue of self-interference between channels, which is much more annoying to deal with then interference from other lidars.
If you want to take that 20 kHz to 200 kHz, you first run into the fact that there can now be 10 pulses in flight at the same time... and that you're trying to detect low-photon-count events with an APD or SPAD outputting nanoamps within a few inches of a laser driver putting generating nanosecond pulses at tens of amps. That's a lot of additional noise! And even then, you have an 0.03° spacing between pulses, which means that successive pulses don't even overlap at max range with a typical spot diameter of 1" - 2" -- so depending on the surfaces you're hitting, on their continuity as seen by you, you still can't really say anything about the expected time alignment of adjacent pulses. Taking this to 2 MHz would let you guarantee some overlap for a handful of pulses, but only some... and that's still not a lot of samples to correlate. And of course your laser power usage and thermal challenges just went up two orders of magnitude...
Is this right? You get 51 tag bits, of which you must use one to distinguish pointer-to-object from other uses of the tag bits (assuming Huffman-ish coding of tags). But objects are presumedly a minimum of 8-byte sized and aligned, and on most platforms I assume they'd be 16-byte sized and aligned, which means the low three (four) bits of the address are implicit, giving 53 (54) bit object addresses. This is quite a few years of runway...
reply