More

SethTro · on June 15, 2023

I love my ESP32-S series. I use them in tons of LED projects[1]. Sadly it seems WS2812b are fundamentally incompatible, under FreeRTOS, with wifi due to timing glitches from interrupts.

In my next two project I'm going to have to run two 30ft cables because I can figure out how to get low latency wifi to work on a Raspberry Pi 2040 (where PIO is fabulous, but 500ms request latency is killing me) or without glitches on a ESP32.

[1] https://anima.haus/events/seacompression

leptons · on June 15, 2023

Been using ESP32 to drive WS2812b LEDs for years. You need to use the SPI peripheral to drive the LEDs, using a driver that outputs the data format the LEDs understand. The SPI peripheral uses DMA and does not glitch when wifi is accessed at the same time.

bryal · on June 16, 2023

Another alternative is to use the RMT (remote control module). I've had success with that. There's a driver for it in Rust: https://lib.rs/crates/ws2812-esp32-rmt-driver

leptons · on June 18, 2023

RMT glitches when wifi is accessed. I started with RMT, it was a big fail for me.

SethTro · on June 7, 2023

LLVM merge: https://reviews.llvm.org/D118029 Benchmark: https://bit.ly/3AtesYf

Benchmark seems to be in the range of a 1-5% improvement for 80% of sizes.

andreamichi · on June 7, 2023

And for the hashing patch this is the commit to Abseil: https://github.com/abseil/abseil-cpp/commit/74eee2aff683cc7d...

SethTro · on June 1, 2023

Right out of the bag lots of wasted lines

    word_bag: dict[str, int] = dict()  # Multiset

    for line in raw_corpus:
        words = line.split()
        
        for word in words:
            if word in word_bag:
                word_bag[word] += 1
            else:
                word_bag[word] = 1

    keys_to_drop = []
    for k, v in word_bag.items():
        if v < min_frequency:
            keys_to_drop.append(k)

    for k in keys_to_drop:
        del word_bag[k]

    pprint(word_bag)
    print(len(word_bag))

    vocabulary = word_bag.keys()
    return set(vocabulary)

Can be a python one/two liner

    # word_bag = Counter()  # defaultdict(int)
    word_bag = Counter(word for line in raw_corpus for word in line.split())
    return set(word for word, count in word_bag.items() if count >= min_frequency)

NathanFulton · on June 1, 2023

I collect mini implementations of ML things for teaching purposes. In this case, the longer-form version is a better artifact.

Pithy readable implementations of core ideas have a lot of value. I don't see much value in code golfing besides having fun :)

nine_k · on June 1, 2023

I politely disagree. Both pieces of code have a teaching value, but different.

The iterative code is busy and long, but allows to track exactly how the pretty trivial calculation is happening, down to elementary(-ish) operations.

The comprehension-based code is more declarative; it succinctly shows what is happening, in almost plain English, without the minute details cluttering out the purpose of the code.

For anyone who is not a Python beginner, but is an ML beginner, the shorter version is much more approachable, as it puts the subject matter more front-and-center.

(Imagine that every matrix multiplication would be written using explicit loops, instead of one "multiply" operation. Would it clarify linear algebra for you, or the other way around?)

NathanFulton · on June 1, 2023

> For anyone who is not a Python beginner, but is an ML beginner, the shorter version is much more approachable, as it puts the subject matter more front-and-center.

It certainly depends on the audience. Interestingly, I had the opposite conclusion about Python beginners in my head before reaching this line!

I think it's more about the learner's prior background. Lately, I've mostly been helping friends who do a lot of scientific computing get started in ML. For that audience, the "nested loops" presentation is typically much easier to grok.

> (Imagine that every matrix multiplication would be written using explicit loops, instead of one "multiply" operation. Would it clarify linear algebra for you, or the other way around?)

Obviously "every" would be terrible. But there's a real question here if we flip "every" to "first". For a work-a-day mathematician who doesn't write code often, certainly not! For a work-a-day programmer who didn't take or doesn't remember linear algebra, the loopy version is probably worth showing once before moving on.

On a related note: I sometimes find folds easier to understand than loops. Other times find loops easier to understand than folds. I'm not particularly sure why. Probably having both versions stashed away and either exercising judgement based on the learner at hand -- or just showing both -- is the best option.

BeetleB · on June 2, 2023

I sympathize with your position, but in this case the two liner is significantly more readable. I had no problem digesting it. I then looked at the longer version and it was a much higher cognitive load to digest that. I have to look at multiple for loops to realize they're merely counting the words in a corpus. The shorter version lets me see it immediately.

NathanFulton · on June 4, 2023

It's probably a matter of audience. For the folks I've been teaching lately, who mostly know some combination of Java/C++/MATLAB, the Pythonic version is probably harder to follow.

Anyways, now I have two ways of saying the same thing, which is always nice to have when teaching.

CamperBob2 · on June 1, 2023

Sometimes Python's conciseness works against its value as a didactic language.

Y_Y · on June 1, 2023

The original code looked like someone had learned old-school C++ and just shoehorned that into python. This phenomenon is all over physics. The fixed code isn't just short, it's idiomatic and clear (YMMV) and hence much easier to understand.

nine_k · on June 1, 2023

Reasonably modern C++ would allow you to define a Counter class, and to define maps and filters, if the stdlib versions don't work for you for whatever reason.

Fortran, on the other hand,...

SethTro · on May 19, 2023

Another factor in favor of 8 bits is that inference speed will generally be 2x faster than a larger model at 4 bits.

SethTro · on April 9, 2023

Moderately different; Particularly with respect to how to performance.

If I read this correctly immortal objects stop changing refcount which prevents cache line invalidation while `gc.freeze()` just stops cleaning up objects with zero refcount.

viraptor · on April 9, 2023

Oh, that would be amazing for quite a few things in web apps if exposed properly. Love it.

SethTro · on March 1, 2023

https://lkml.org/lkml/2023/2/22/982

Seems to be a known errata which was fixed by a microcode update

caf · on March 1, 2023

It's erratum #1386 (see https://www.amd.com/system/files/TechDocs/56323-PUB_1.00.pdf ):

The XSAVES instruction may fail to save XMM registers to the provided state save area if all of the following are true:

• All XMM registers were restored to the initialization value by the most recent XRSTORS instruction because the XSTATE_BV[SSE] bit was clear.

• The state save area for the XMM registers does not contain the initialization state.

• The value in the XMM registers match the initialization value when the XSAVES instruction is executed.

• The MXCSR register has been modified to a value different from the initialization value since the most recent XRSTORS instruction.

sp332 · on March 1, 2023

https://mobile.twitter.com/taviso/status/1630695259935219713

Apparently some Ryzen models have no fixed microcode available. You can boot with clearcpuid=xsaves as a workaround, probably at some performance cost.

stefan_ · on March 1, 2023

As I understood the email thread, they do have microcode updates but they weren't actually released anywhere but in some crusty vendors BIOS update, so you can only get them if someone fished them out of there.

i.e. thats what this repo seems to be: https://github.com/platomav/CPUMicrocodes

jabl · on March 1, 2023

Since this is about Linux, the microcodes it applies on boot can be found in the linux-firmware repository. For AMD microcode, in particular https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...

However, for some unexplainable reason AMD doesn't tend to update the microcodes in that repo particularly often, leaving it up to BIOS vendors and users updating their BIOS.

caf · on March 1, 2023

The person who reported the bug in their family 0x17 model 0x60 Renoir SoC said it wasn't fixed even by the latest BIOS-supplied microcode available.

lathiat · on March 1, 2023

The reality is most consumer motherboards rarely post updates especially after the first year or so. You'll tend to get updates to fix CPU compatability for newer CPUs if the motherboard is still on sale, but otherwise most long term BIOS updates seem largely to be from enterprise vendors (Dell, Lenovo, etc) and much less common on consumer or gaming type hardware.

I think most people rely on the operating system to (amazingly) hot patch it during boot. Intel and AMD both publish them and are integrated regularly into most distros (and the linux-firmware git tree). Surprising/weird that they haven't released the Renoir ones.

Also seems Tavis had a bug where Debian wasn't applying them on boot for one reason but didn't give details. Wonder what it was.

kllrnohj · on March 1, 2023

You have that like exactly backwards. You'll get a lot fewer bios updates from Dell or Lenovo than you will from MSI, Asus, Gigabyte, etc.. consumer / gaming motherboard lines. My 5 year old X370-F GAMING is still getting BIOS updates. Others, like MSI, practically forced AMD to continue issuing AGESA updates for X370 & X470 chipsets after AMD had announced official end of support - they got AMD to change course and add new CPU support to those old chipsets.

But otherwise all the major consumer / gaming motherboards pick up new AGESA updates quickly & consistently, even when they're EOL platforms.

Semaphor · on March 1, 2023

> The reality is most consumer motherboards rarely post updates especially after the first year or so

I can’t confirm that. My current board is the MSI X570-A PRO. First BIOS was 2019-06-20, the latest 2022-08-19. And that’s still updating versions and settings. After 3 years, and I’m expecting more. This has also been my experience with other boards. MB updates tend to last several years.

oynqr · on March 1, 2023

I have a really old GA-Z87X-UD5H, it got NVME support five years after release. Probably by accident, but still.

amluto · on March 1, 2023

No obvious reason for a performance cost. If XSAVEC is available, performance should be essentially identical to XSAVES.

BoardsOfCanada · on March 1, 2023

I had to look up the difference between XSAVES and XSAVEC: "Execution of XSAVES is similar to that of XSAVEC. XSAVES differs from XSAVEC in that it can save state components corresponding to bits set in the IA32_XSS MSR and that it may use the modified optimization."

caf · on March 1, 2023

In this case it seems like the "modified optimization" is where the bug lies.

alex_duf · on March 1, 2023

As an outsider from the hardware world, I find it astounding that it's possible to fix the behaviour a CPU instruction by changing code. (assuming I understand correctly)

In my mind a CPU instruction is hardwired on the chip, and it blows my mind that we keep finding workarounds to already released hardware.

Maybe someone could dumb that down for me?

mort96 · on March 1, 2023

Only one small part of the CPU actually understands the "x86_64 language". Most of the CPU executes a completely different, much simpler language, where instructions are called "micro-operations" (or µops). There's a hardware component called the "decoder" (part of what we call the "front-end") which is responsible for parsing the x86_64 instructions and emitting these µops. One x86_64 instruction often produces multiple µops.

You can change the mapping from x86_64 instruction to sequence of micro-operations during boot on modern CPUs. That's what we mean by updating the microcode.

At least that's my understanding, as someone who has implemented a few toy CPUs in digital logic simulation tools and has consumed a bunch of material on the topic as hobbyist but have no actual knowledge of the particulars of how AMD and Intel does stuff.

Const-me · on March 1, 2023

Micro-ops aren’t simpler than the AMD64 instructions, the complexity is about the same. For instance, the following instruction

    vfmadd231ps     ymm3, ymm1, YMMWORD PTR [rax-256]

does quite a few things (memory load, and 8-wide fused multiply+accumulate), yet it decodes into a single micro-op.

Most AMD64 instructions decode into a single micro-op. Moreover, there’s a thing called “macro-ops fusion”, when two AMD64 instructions are fused into a single micro-op. For example, scalar comparison + conditional jump instructions are typically fused when decoding.

mort96 · on March 1, 2023

That's an important detail, not all macro-ops are more complex than micro-ops, and most of our everyday x86 instructions are simpler than the more complex micro-ops.

But we can agree that the complexity ceiling is much higher on macro-ops than micro-ops, right? The µop you mentioned does one (vector) FMA operation on two (vector) registers and stores the result to RAM. While in x86, we have things like the rep instruction which repeats an action until ECX is zero, or the ENTER and LEAVE instructions to set up and tear down a stack frame. Those are undoubtedly implemented in terms of lots of micro-ops.

Const-me · on March 2, 2023

> complexity ceiling is much higher on macro-ops than micro-ops, right?

Other examples are crc32, sha1rnds4, aesdec, aeskeygenassist - the math they do is rather complicated, yet on modern CPUs they are single micro-op each.

> one (vector) FMA operation on two (vector) registers and stores the result to RAM.

It loads from there.

> Those are undoubtedly implemented in terms of lots of micro-ops.

Indeed, but I don't think it's complexity. I think they use microcode for 2 things: instructions which load or store more than 1 value (a value is up to 32 bytes on AVX, 64 bytes on AVX512 processors), or rarely used instructions.

nomercy400 · on March 1, 2023

So the decoder is like an emulator. If so it would theoretically be possible to provide a different ISA and have it execute by the cops as well. Not saying it would be fast, or practically possible due to how locked down it is.

nick__m · on March 1, 2023

Transmeta tried that approach, but it wasn't enough of a competitive edge for it to prosper.

imtringued · on March 1, 2023

Transmeta couldn't bring their product fast enough to the market because Intel was suing them. It had nothing to do with the quality of the product itself.

Symmetry · on March 1, 2023

Then NVidia re-used their design on ARM as Project Denver.

JosephRedfern · on March 1, 2023

Is this kinda what thumb mode on (some) ARM chips is?

alex_duf · on March 1, 2023

that makes a lot more sense now, thanks!

fhars · on March 1, 2023

The CPU only pretends to be a CPU. In reality, it is a small datacenter comprised of several small special purpose computers doing all the work. I gave up understanding CPUs in depth by the time I read an introducion to Intel's then-new i860 CPU in the April issue of a magazine and it turned out to be a real device.

psychphysic · on March 1, 2023

Best explanation in my opinion.

Almost like a CPU has an JIT emulator for x86.

yvdriess · on March 1, 2023

ucode cracking is relatively straightforward, I wouldn't call it a JIT. https://intelxed.github.io/ref-manual/

It's more like an intepreter for x86, seeing as it is actually executed on a dataflow architecture.

psychphysic · on March 1, 2023

Macro op fusion is more what I had in mind when I called it JIT and not interpretation.

Not sure what the distinction is but that's where it is to my mind

yvdriess · on March 1, 2023

It's true that the distinction is a bit vague, the term JIT is overloaded to enough that it stopped being a useful technical term.

Compared to 'JVM JIT' or 'LuaJIT': there is no instrumentation to detect what is hot or not. The CPU frontend will crack x86 instructions into micro-ops, while looking for some patterns to merge certain instructions or uops into larger micro-ops. The micro-coded instructions (like many of the legacy instructions) are likely just lookups.

Most of this is my speculation, mind. Modern CPU frontends are still kind of a black magic box to me, but I think they are limited by relatively simple transformations by virtue of being on the critical execution path.

to11mtm · on March 1, 2023

There is a Micro-op cache in play in many cases like Zen [0], but I'm not entirely sure what it does.

[0] - https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...

AstralStorm · on March 2, 2023

Micro-op cache is literally L1 cache after the instructions are broken into micro-ops.

It's not a magic arc profiling optimization either.

oblio · on March 1, 2023

Heck, at this point I think phone chargers run a "real" OS to manage USB-C charging.

adql · on March 1, 2023

The chip has quasi-compiler that compiles stream of assembler instructions into uops and dispatches it onto various parts to run, often in parallel.

That part is driven via microcode (kind of like firmware) and fixes in that can fix some CPU bugs.

Which also means that one core can essentially run multiple assembler instructions in parallel (say fetch memory at same time floating point operation is running, at the same time some other integer operation is running etc.) and it just makes it look like it was done serially.

mrjin · on March 1, 2023

It's nothing new. Most of us might think one instruction will only do one thing but it's actully much more complicated than that. Instructions can actually be broken down to multiple steps and some of those are in common. And thus modern CPUs have a concept of μOp which refers to aforementioned step. What and especially how an instruction does things can be updated by uploading a new firmware to the CPU.

dkersten · on March 1, 2023

I found these Ben Eater YouTube videos very useful to understanding what microcode is and how it works:

8-bit CPU control logic https://youtu.be/dXdoim96v5A

8-bit CPU reprogramming microcode https://youtu.be/JUVt_KYAp-I

In short, the microcode instructions are a bunch of flags that enable different parts of the processor during that clock cycle (eg is data being loaded off the bus into a register? Is the adder active? Etc), so to implement an instruction that says add value from memory a to value from memory b and store in memory c, the microcode might be: copy memory a onto bus, store bus to register, copy memory b to bus, store b to register, add both register and put result online bus, store value on bus to memory c. (In a hypothetical simple cpu like the one Ben built, a real one is obviously much more sophisticated). So in Ben’s toy CPU, the instructions are just indices nto an EEPROM that stored the control logic but pattern “microcode” for each instruction, and IIRC each instruction takes however many cycles the longest instruction requires (in real life that would be optimised of course).

This is also how some processors like the 6502 have “undocumented” instructions: they’re just bit patterns enabling parts of the processor that weren’t planned or intended.

So you can see that it may be possible to fix a bug in instructions by changing the control logic in this way, even though the actual units being controlled are hard wired. I guess it very much depends on what the bug is. Of course I only know how Ben’s microcode works and not how an advanced processor like the one in question does it, but I imagine the general theme is similar.

garblegarble · on March 1, 2023

Slightly off-topic, but I highly recommend Inside The Machine by Jon Stokes if you'd like to understand a bit more about how CPUs work... it's an extremely accessible book (I also knew next to nothing about the hardware world)

Gordonjcp · on March 1, 2023

The instructions that you the user see are in themselves little sequences of code. Think about it this way - you like code reuse, right? DRY? If you want a bit of hardware that can add two numbers in registers, why would you want to have another copy of the same thing that can add a value to the program counter? It's just a register, even if it's a bit special.

The thing is, the microcode is often using instructions that are a very different "shape" from sensible machine-code instructions, because quite often they have to drive gates within the chip directly and not all combinations might make sense. So you might have an instruction that breaks down as "load register A into the ALU A port, load register X in the the ALU B port, carry out an ADD and be ready to latch the result into X but don't actually latch it for another clock cycle in case we're waiting for carry to stabilise", much of which you simply don't want to care about. The instructions might be many many bits long, with a lot of those bits "irrelevant" for a particular task.

The 6502 CPU was a directly-wired CPU where everything was decoded from the current opcode. It doesn't really have "microcode" but it does have a state machine that'll carry out instructions in phases across a few clocks. It does actually have a lot of "undefined" instructions, which are where the opcode decodes into something nonsensical like "load X and Y at the same time into the ALU" which returns something unpredictable.

xenadu02 · on March 1, 2023

CPUs internally are made up of various components connected to various busses.

Take a simple example: the registers are made up of latches that hold onto values and have a set of transistors that switch their latches to connect to the BUS lines or disconnect from them, along with a line that makes them emit their latched value or take a new value to latch. This forms a simple read/write primitive.

If the microcode wants to move the result of an ADD out of the ALU into register R1 then it will assert the relevant control lines:

  1. The ALU's hidden SUM register WRITE high which connects the output of its latches to the lines of the bus. For a 64-bit chip there would be 64 lines, one per bit. Each bit line will then go high or low to match the contents of SUM.

  2. It will also set R1's READ line high, meaning the transistors that connect R1's bit latch inputs to the bus lines will switch ON, allowing the voltages on each bus line to force R1's latch input lines high or low (for 1 or 0).

In a real modern CPU things are vastly more complex than this but it is just nicer abstractions built on top of these kinds of simple ideas. Microcode doesn't actually control the cache with control lines, it issues higher level instructions to the cache unit that takes responsibility. The cache unit itself may have a microcode engine that itself delegates operations to even simpler functional units until you eventually get to something that is managing control lines to connect/disconnect/trigger things. Much like software higher level components offer their "API" and internally break operations down into multiple simpler steps until you get to the lowest layer doing the actual work.

caf · on March 1, 2023

This particular instruction - XSAVES - isn't the sort of simple building block that most user code is full of like ADD or MOV. It does quite a bit of work (saving a chunk of the CPU state) and is implemented more like calling a subroutine within the CPU than the way the normal number-crunching instructions are executed. These updates basically just change that subroutine code within the CPU.

Simran-B · on March 1, 2023

> I find it astounding that it's possible to fix the behaviour a CPU instruction by changing code.

Sometimes, CPU vendors run out of space for such bug fixes. They have to re-introduce another bug to free up space to fix a more serious one. That one kinda blew my mind.

mghfreud · on March 1, 2023

Do you have any source? I want to read more about this.

pabs3 · on March 1, 2023

Linux folks also plan to apply a workaround for systems running old microcode.

SethTro · on Feb 24, 2023

(1022362 + 82432) gpu-hours / 2048gpus / 5 months ~= 15% uptime.

That's only 0.08 nines of availability!

I remember in one of their old guidebooks a lot of struggle to keep their 64 machine (512 gpu) cluster running this was probably 4x the machines and 4x the number of cluster dropouts.

Tepix · on Feb 26, 2023

They may have thrown away some models that didn't turn out great.

foobiekr · on Feb 25, 2023

Poor GPU utilization even when available is the rule. Truly amazing. Staging of data is probably a huge part of it.

pavelstoev · on Feb 25, 2023

At CentML, we profiled GPU utilization on a larger AI/ML research institute cluster. 10% to 45% range, mostly in 10% utilization range. We then offered them software optimizers (which do not affect model accuracy) to get to the 90% utilization for GPUs

foobiekr · on Feb 25, 2023

90% sustained utilization is quite amazing, and 10% is shockingly typical. I am a quite skeptical that this holds for training and very large data sets, of the sort where data placement comes into play, but if so, congratulations, and I hope things go well for you.

bertday · on Feb 25, 2023

Is it failures or is this some backfill/budget scheduling while everyone is sleeping?

foobiekr · on Feb 25, 2023

A lot of it appears to be non-streaming approaches to data distribution resulting in actual job behavior that looks a lot more like stage-process-clear batch jobs than what you'd want to hide the latency of data moves.

SethTro · on Feb 7, 2023

Some discussion of remaining infinite BB(5) machines yesterday in [1][2]

[1] https://www.sligocki.com//2023/02/02/skelet-34.html

[2] https://news.ycombinator.com/item?id=34689068 and

SethTro · on Jan 31, 2023

No, they choose a percentage of income for housing, I think 30% or 35%, and calculate from that so 50K * 1.2 * 0.3 = 18K/year = $1500/month

SethTro · on Jan 24, 2023

Should have [2021]