I love my ESP32-S series. I use them in tons of LED projects[1]. Sadly it seems WS2812b are fundamentally incompatible, under FreeRTOS, with wifi due to timing glitches from interrupts.
In my next two project I'm going to have to run two 30ft cables because I can figure out how to get low latency wifi to work on a Raspberry Pi 2040 (where PIO is fabulous, but 500ms request latency is killing me) or without glitches on a ESP32.
Been using ESP32 to drive WS2812b LEDs for years. You need to use the SPI peripheral to drive the LEDs, using a driver that outputs the data format the LEDs understand. The SPI peripheral uses DMA and does not glitch when wifi is accessed at the same time.
word_bag: dict[str, int] = dict() # Multiset
for line in raw_corpus:
words = line.split()
for word in words:
if word in word_bag:
word_bag[word] += 1
else:
word_bag[word] = 1
keys_to_drop = []
for k, v in word_bag.items():
if v < min_frequency:
keys_to_drop.append(k)
for k in keys_to_drop:
del word_bag[k]
pprint(word_bag)
print(len(word_bag))
vocabulary = word_bag.keys()
return set(vocabulary)
Can be a python one/two liner
# word_bag = Counter() # defaultdict(int)
word_bag = Counter(word for line in raw_corpus for word in line.split())
return set(word for word, count in word_bag.items() if count >= min_frequency)
I politely disagree. Both pieces of code have a teaching value, but different.
The iterative code is busy and long, but allows to track exactly how the pretty trivial calculation is happening, down to elementary(-ish) operations.
The comprehension-based code is more declarative; it succinctly shows what is happening, in almost plain English, without the minute details cluttering out the purpose of the code.
For anyone who is not a Python beginner, but is an ML beginner, the shorter version is much more approachable, as it puts the subject matter more front-and-center.
(Imagine that every matrix multiplication would be written using explicit loops, instead of one "multiply" operation. Would it clarify linear algebra for you, or the other way around?)
> For anyone who is not a Python beginner, but is an ML beginner, the shorter version is much more approachable, as it puts the subject matter more front-and-center.
It certainly depends on the audience. Interestingly, I had the opposite conclusion about Python beginners in my head before reaching this line!
I think it's more about the learner's prior background. Lately, I've mostly been helping friends who do a lot of scientific computing get started in ML. For that audience, the "nested loops" presentation is typically much easier to grok.
> (Imagine that every matrix multiplication would be written using explicit loops, instead of one "multiply" operation. Would it clarify linear algebra for you, or the other way around?)
Obviously "every" would be terrible. But there's a real question here if we flip "every" to "first". For a work-a-day mathematician who doesn't write code often, certainly not! For a work-a-day programmer who didn't take or doesn't remember linear algebra, the loopy version is probably worth showing once before moving on.
On a related note: I sometimes find folds easier to understand than loops. Other times find loops easier to understand than folds. I'm not particularly sure why. Probably having both versions stashed away and either exercising judgement based on the learner at hand -- or just showing both -- is the best option.
I sympathize with your position, but in this case the two liner is significantly more readable. I had no problem digesting it. I then looked at the longer version and it was a much higher cognitive load to digest that. I have to look at multiple for loops to realize they're merely counting the words in a corpus. The shorter version lets me see it immediately.
It's probably a matter of audience. For the folks I've been teaching lately, who mostly know some combination of Java/C++/MATLAB, the Pythonic version is probably harder to follow.
Anyways, now I have two ways of saying the same thing, which is always nice to have when teaching.
The original code looked like someone had learned old-school C++ and just shoehorned that into python. This phenomenon is all over physics. The fixed code isn't just short, it's idiomatic and clear (YMMV) and hence much easier to understand.
Reasonably modern C++ would allow you to define a Counter class, and to define maps and filters, if the stdlib versions don't work for you for whatever reason.
Moderately different; Particularly with respect to how to performance.
If I read this correctly immortal objects stop changing refcount which prevents cache line invalidation while `gc.freeze()` just stops cleaning up objects with zero refcount.
Apparently some Ryzen models have no fixed microcode available. You can boot with clearcpuid=xsaves as a workaround, probably at some performance cost.
As I understood the email thread, they do have microcode updates but they weren't actually released anywhere but in some crusty vendors BIOS update, so you can only get them if someone fished them out of there.
However, for some unexplainable reason AMD doesn't tend to update the microcodes in that repo particularly often, leaving it up to BIOS vendors and users updating their BIOS.
The reality is most consumer motherboards rarely post updates especially after the first year or so. You'll tend to get updates to fix CPU compatability for newer CPUs if the motherboard is still on sale, but otherwise most long term BIOS updates seem largely to be from enterprise vendors (Dell, Lenovo, etc) and much less common on consumer or gaming type hardware.
I think most people rely on the operating system to (amazingly) hot patch it during boot. Intel and AMD both publish them and are integrated regularly into most distros (and the linux-firmware git tree). Surprising/weird that they haven't released the Renoir ones.
Also seems Tavis had a bug where Debian wasn't applying them on boot for one reason but didn't give details. Wonder what it was.
You have that like exactly backwards. You'll get a lot fewer bios updates from Dell or Lenovo than you will from MSI, Asus, Gigabyte, etc.. consumer / gaming motherboard lines. My 5 year old X370-F GAMING is still getting BIOS updates. Others, like MSI, practically forced AMD to continue issuing AGESA updates for X370 & X470 chipsets after AMD had announced official end of support - they got AMD to change course and add new CPU support to those old chipsets.
But otherwise all the major consumer / gaming motherboards pick up new AGESA updates quickly & consistently, even when they're EOL platforms.
> The reality is most consumer motherboards rarely post updates especially after the first year or so
I can’t confirm that. My current board is the MSI X570-A PRO. First BIOS was 2019-06-20, the latest 2022-08-19. And that’s still updating versions and settings. After 3 years, and I’m expecting more. This has also been my experience with other boards. MB updates tend to last several years.
I had to look up the difference between XSAVES and XSAVEC:
"Execution of XSAVES is similar to that of XSAVEC. XSAVES differs from XSAVEC in that it can save state components corresponding to bits set in the IA32_XSS MSR and that it may use the modified optimization."
As an outsider from the hardware world, I find it astounding that it's possible to fix the behaviour a CPU instruction by changing code. (assuming I understand correctly)
In my mind a CPU instruction is hardwired on the chip, and it blows my mind that we keep finding workarounds to already released hardware.
Only one small part of the CPU actually understands the "x86_64 language". Most of the CPU executes a completely different, much simpler language, where instructions are called "micro-operations" (or µops). There's a hardware component called the "decoder" (part of what we call the "front-end") which is responsible for parsing the x86_64 instructions and emitting these µops. One x86_64 instruction often produces multiple µops.
You can change the mapping from x86_64 instruction to sequence of micro-operations during boot on modern CPUs. That's what we mean by updating the microcode.
At least that's my understanding, as someone who has implemented a few toy CPUs in digital logic simulation tools and has consumed a bunch of material on the topic as hobbyist but have no actual knowledge of the particulars of how AMD and Intel does stuff.
Micro-ops aren’t simpler than the AMD64 instructions, the complexity is about the same. For instance, the following instruction
vfmadd231ps ymm3, ymm1, YMMWORD PTR [rax-256]
does quite a few things (memory load, and 8-wide fused multiply+accumulate), yet it decodes into a single micro-op.
Most AMD64 instructions decode into a single micro-op. Moreover, there’s a thing called “macro-ops fusion”, when two AMD64 instructions are fused into a single micro-op. For example, scalar comparison + conditional jump instructions are typically fused when decoding.
That's an important detail, not all macro-ops are more complex than micro-ops, and most of our everyday x86 instructions are simpler than the more complex micro-ops.
But we can agree that the complexity ceiling is much higher on macro-ops than micro-ops, right? The µop you mentioned does one (vector) FMA operation on two (vector) registers and stores the result to RAM. While in x86, we have things like the rep instruction which repeats an action until ECX is zero, or the ENTER and LEAVE instructions to set up and tear down a stack frame. Those are undoubtedly implemented in terms of lots of micro-ops.
> complexity ceiling is much higher on macro-ops than micro-ops, right?
Other examples are crc32, sha1rnds4, aesdec, aeskeygenassist - the math they do is rather complicated, yet on modern CPUs they are single micro-op each.
> one (vector) FMA operation on two (vector) registers and stores the result to RAM.
It loads from there.
> Those are undoubtedly implemented in terms of lots of micro-ops.
Indeed, but I don't think it's complexity. I think they use microcode for 2 things: instructions which load or store more than 1 value (a value is up to 32 bytes on AVX, 64 bytes on AVX512 processors), or rarely used instructions.
So the decoder is like an emulator. If so it would theoretically be possible to provide a different ISA and have it execute by the cops as well. Not saying it would be fast, or practically possible due to how locked down it is.
Transmeta couldn't bring their product fast enough to the market because Intel was suing them.
It had nothing to do with the quality of the product itself.
The CPU only pretends to be a CPU. In reality, it is a small datacenter comprised of several small special purpose computers doing all the work. I gave up understanding CPUs in depth by the time I read an introducion to Intel's then-new i860 CPU in the April issue of a magazine and it turned out to be a real device.
It's true that the distinction is a bit vague, the term JIT is overloaded to enough that it stopped being a useful technical term.
Compared to 'JVM JIT' or 'LuaJIT': there is no instrumentation to detect what is hot or not. The CPU frontend will crack x86 instructions into micro-ops, while looking for some patterns to merge certain instructions or uops into larger micro-ops. The micro-coded instructions (like many of the legacy instructions) are likely just lookups.
Most of this is my speculation, mind. Modern CPU frontends are still kind of a black magic box to me, but I think they are limited by relatively simple transformations by virtue of being on the critical execution path.
The chip has quasi-compiler that compiles stream of assembler instructions into uops and dispatches it onto various parts to run, often in parallel.
That part is driven via microcode (kind of like firmware) and fixes in that can fix some CPU bugs.
Which also means that one core can essentially run multiple assembler instructions in parallel (say fetch memory at same time floating point operation is running, at the same time some other integer operation is running etc.) and it just makes it look like it was done serially.
It's nothing new. Most of us might think one instruction will only do one thing but it's actully much more complicated than that. Instructions can actually be broken down to multiple steps and some of those are in common. And thus modern CPUs have a concept of μOp which refers to aforementioned step. What and especially how an instruction does things can be updated by uploading a new firmware to the CPU.
In short, the microcode instructions are a bunch of flags that enable different parts of the processor during that clock cycle (eg is data being loaded off the bus into a register? Is the adder active? Etc), so to implement an instruction that says add value from memory a to value from memory b and store in memory c, the microcode might be: copy memory a onto bus, store bus to register, copy memory b to bus, store b to register, add both register and put result online bus, store value on bus to memory c. (In a hypothetical simple cpu like the one Ben built, a real one is obviously much more sophisticated). So in Ben’s toy CPU, the instructions are just indices nto an EEPROM that stored the control logic but pattern “microcode” for each instruction, and IIRC each instruction takes however many cycles the longest instruction requires (in real life that would be optimised of course).
This is also how some processors like the 6502 have “undocumented” instructions: they’re just bit patterns enabling parts of the processor that weren’t planned or intended.
So you can see that it may be possible to fix a bug in instructions by changing the control logic in this way, even though the actual units being controlled are hard wired. I guess it very much depends on what the bug is. Of course I only know how Ben’s microcode works and not how an advanced processor like the one in question does it, but I imagine the general theme is similar.
Slightly off-topic, but I highly recommend Inside The Machine by Jon Stokes if you'd like to understand a bit more about how CPUs work... it's an extremely accessible book (I also knew next to nothing about the hardware world)
The instructions that you the user see are in themselves little sequences of code. Think about it this way - you like code reuse, right? DRY? If you want a bit of hardware that can add two numbers in registers, why would you want to have another copy of the same thing that can add a value to the program counter? It's just a register, even if it's a bit special.
The thing is, the microcode is often using instructions that are a very different "shape" from sensible machine-code instructions, because quite often they have to drive gates within the chip directly and not all combinations might make sense. So you might have an instruction that breaks down as "load register A into the ALU A port, load register X in the the ALU B port, carry out an ADD and be ready to latch the result into X but don't actually latch it for another clock cycle in case we're waiting for carry to stabilise", much of which you simply don't want to care about. The instructions might be many many bits long, with a lot of those bits "irrelevant" for a particular task.
The 6502 CPU was a directly-wired CPU where everything was decoded from the current opcode. It doesn't really have "microcode" but it does have a state machine that'll carry out instructions in phases across a few clocks. It does actually have a lot of "undefined" instructions, which are where the opcode decodes into something nonsensical like "load X and Y at the same time into the ALU" which returns something unpredictable.
CPUs internally are made up of various components connected to various busses.
Take a simple example: the registers are made up of latches that hold onto values and have a set of transistors that switch their latches to connect to the BUS lines or disconnect from them, along with a line that makes them emit their latched value or take a new value to latch. This forms a simple read/write primitive.
If the microcode wants to move the result of an ADD out of the ALU into register R1 then it will assert the relevant control lines:
1. The ALU's hidden SUM register WRITE high which connects the output of its latches to the lines of the bus. For a 64-bit chip there would be 64 lines, one per bit. Each bit line will then go high or low to match the contents of SUM.
2. It will also set R1's READ line high, meaning the transistors that connect R1's bit latch inputs to the bus lines will switch ON, allowing the voltages on each bus line to force R1's latch input lines high or low (for 1 or 0).
In a real modern CPU things are vastly more complex than this but it is just nicer abstractions built on top of these kinds of simple ideas. Microcode doesn't actually control the cache with control lines, it issues higher level instructions to the cache unit that takes responsibility. The cache unit itself may have a microcode engine that itself delegates operations to even simpler functional units until you eventually get to something that is managing control lines to connect/disconnect/trigger things. Much like software higher level components offer their "API" and internally break operations down into multiple simpler steps until you get to the lowest layer doing the actual work.
This particular instruction - XSAVES - isn't the sort of simple building block that most user code is full of like ADD or MOV. It does quite a bit of work (saving a chunk of the CPU state) and is implemented more like calling a subroutine within the CPU than the way the normal number-crunching instructions are executed. These updates basically just change that subroutine code within the CPU.
> I find it astounding that it's possible to fix the behaviour a CPU instruction by changing code.
Sometimes, CPU vendors run out of space for such bug fixes. They have to re-introduce another bug to free up space to fix a more serious one. That one kinda blew my mind.
I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.
At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs
90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.
A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.
In my next two project I'm going to have to run two 30ft cables because I can figure out how to get low latency wifi to work on a Raspberry Pi 2040 (where PIO is fabulous, but 500ms request latency is killing me) or without glitches on a ESP32.
[1] https://anima.haus/events/seacompression