Good question. A bit of experience, some guesswork, and a lot of testing and benchmarking along the way. I separated out global and local opcodes originally because I wanted individual variable accesses to be fast (why do at runtime what you can do at compile time). As I mention in the article, originally I had all the builtin functions as separate opcodes, but that was slower due to the sheer number of opcodes and Go's current binary tree approach to "switch". I also used Go's profiling tools a bunch to see where the hotspots where. I'm sure there's significant room for improvement, but it's hard, because sometimes when you improve one benchmark, something else suffers.
Just the Go benchmarking tools built into "go test", as well as a couple of scripts of my own just using "time ./goawk 'BEGIN { ... test script ... }'" and that kind of thing. Nothing fancy!