The problem is rather, that most data structure tutorials and books don't even get the idea, to introduce a purely functional version, but merely state the imperative versions. Coming up with the functional versions of data structures can be difficult. Papers about it can be hard to understand and often require one to already know some niche language, that a researcher used for the paper. Even if you can find a functional implementation of a data structure, there is often not a good explanation and you need to, sort of, reverse engineer it and translate it to the language you are using.
In short, it seems relatively few people have the skills to implement them and even fewer have the skills to come up with functional versions of ordinary data structures. They are almost completely absent from university lectures as well, as far as I am aware. For example for AVL trees I could only find a document from ETH from a lecture, that no longer exists or is taught. The language is Isabel and I need to understand its syntax first, before being able to translate it to Scheme.
If anyone has an obscure source for implementations one can learn from and implement oneself in another language, please share.
[1] are the exercises with solutions (!) of a lecture [2] at ETH Zurich, which include AVL trees in Isabel. The whole list of exercise PDFs is at [3]. I started porting the AVL trees to Scheme [4], but lately have not worked more on this. I also try to make it easy to understand the implementation by providing explanatory comments.
Okasaki's is basically the Bible for this stuff. Anyone writing data structure libraries in a functional language will have read this or have it on their to-read list.
I have the book, but it also doesn't contain that many data structures. Some of the maybe most used (AVL tree?) are not in there.
Not all exercises have solutions. I get stuck on some exercise and have a hard time finding solutions to them, that I can compare with my implementations in Scheme. Not being a Haskell user (yet), I also have some issues translating Haskell to Scheme. In languages like Haskell one often controls flow by pattern matching on type. In Scheme one needs to make structures or records explicitly and use their predicates to explicitly check the type. It's all possible, but not exactly great for learning. Sort of the road has many sticks and stones to stumble.
You can even build them with basically one line of code by sorting points using Morton / Z-curve order. It's linear time if you use a counting/radix sort.
Edit: lol, downvoted for this post. Never change, HN.
Here's an example from AWS, where lat/long pairs are put into a Z-index, which is used as a DynamoDB sort key, letting you efficiently query for items near a point.
I would like to know about this more, too. Is there a code anywhere, ideally with comments? But I am fine without comments, too, I would just like to see the code and possibly with an example usage.
Okay, I was intrigued and I did some digging. Morton / Z-order is all about interleaving the individual bits of the x and y coordinates. You end up grouping by quadrants. Python one liner:
points.sort(key=lambda p: sum(((p[0]>>i&1)<<(2*i))|((p[1]>>i&1)<<(2*i+1)) for i in range(16)))
Yeah good point, they are downsides for sure but it's a simple enough approach and most of all it can be shoved in a database (or b-tree or any 1d-sorted data structure).
And for a z-curve, the order is basically a depth-first traversal of a quadtree.
It’s super easy to click the downvote button by accident on mobile when you meant to upvote. And this UI will never be fixed because this is HN after all.
This is precisely the reason that I do not log in to HN on my phone. My phone is read-only and if I want to upvote or comment then I have to switch to my laptop. Pretty easy with firefox because I can send tabs to other devices.
There are couple of tricks you can do if you fiddle with the bits of a floating point value using integer arithmetic and binary logic.
That was a thing back in the 90th..
I wonder how hard the performance hit from moving values between integer and float pipeline is nowadays.
Last time I looked into that was the Cortex-A8 (first I-Phone area). Doing that kind of trick costed around 26 cycles (back and forth) due to pipeline stalls back then.
There are basic integer operations in the FP/SIMD units on most CPUs, so there’s no generally need to “move back and forth” unless you need to branch on the result of a comparison, use a value as an address, or do some more specialized arithmetic.
On x86 there is sometimes (depending on the specific microarchitecture) an extra cycle additional latency when using an integer operation on a xmm register last used with a float operation.
I have seen it explained as the integer and foat ALUs's being physically distant and the forwarding network needing an extra cycle to transport the operands.
This is correct, but it’s happily pretty rare for it to matter in practice (because the domain bypass penalty is small and does not directly impact throughput, only latency).
(For that matter, though, most modern FP/SIMD units have a direct approximate-reciprocal instruction that is single-cycle throughput or better and much more accurate--generally around 10-12 bits, so there's no need for this sort of thing. See, e.g. FRECPE on ARM NEON and [V]RCPP[S/D] on x86.)
These kinds of tricks are still used today. They're not so useful if you need a reciprocal or square root, since CPUs now have dedicated hardware for that, but it's different if you need a _cube_ root or x^(1/2.4).
The big cores do. They essentially pump division through something like an FMA (fused multiply-add) unit, possibly the same unit that is used for multiplication and addition. That's for the Newton-Raphson steps, or Goldschmidt steps.
In hardware it's much easier to do a LUT-based approximation for the initial estimate rather than the subtraction trick, though.
It's common for CPUs to give 6-8 accurate bits in the approximation. x86 gives 13 accurate bits. Back in 1975, the Cray 1 gave 30 (!) accurate bits in the first approximation, and it didn't even have a division instruction (everything about that machine was big and fast).
I've worked in the payment industry and among other things built a payment terminal, so I know a thing or two about it.
1st: The message overhead during communication is not an issue. It is tiny compared to the time it takes for the credit card to do it's crypto thing.
2nd: This won't be adapted ever. There is simply way to much working infrastructure out there build on the APDU protocol. And there are so many companies and stakeholders involved that doing any change will take years. If they start aligning on a change, it would be something that makes a difference, not something that safes time in the order of 5 to 20 milliseconds. per payment.
Hi there, author here! I think you've highlighted a couple of things worth clarifying in the doc:
1. Apple having just announced it is opening up NFC to developers means that both major mobile platforms can now act as responding devices; so widely distributing new NFC protocols to consumer devices has become very fast and inexpensive through an update to the OS or installing a third party app.
2. Mobile consumer hardware is sufficiently fast for the application operations (eg. Cryptographic operations) so that these roundtrip and metadata overheads of APDU do actually make a meaningful contribution to the total time it takes to complete a transaction. Experiencing this in my development efforts here was the motivation for designing this alternative.
3. A1 is interoperable with APDU infrastructure and can therefore be adopted by terminals immediately, since reader terminals can attempt an A1 initiation and any APDU response from a legacy device is considered a failure; at which point the terminal can fall back to its default APDU message exchange.
I will update the doc to clarify these points, what do you think?
Given your experience I'd be interested in your detailed feedback, maybe we could jump on a call soon if you have time?
> 1. Apple having just announced it is opening up NFC
> to developers means that both major mobile
> platforms can now act as responding devices;
> 2. Mobile consumer hardware is sufficiently fast for the application
> operations (eg. Cryptographic operations)
You are right here. It is possible to emulate a card using mobile phones. We've been able to shim/emulate any card for much longer.
The thing is: To connect to the payment system you need a certificate. And you simply don't get it unless you can prove that you have all kinds of security measures applied.
For android and apple, the actual payment stuff runs inside a isolated little micro-controller which has been certified and is temper proof/protected. This little thing is fortified so far that it will destruct itself when you try to open it up. There are alarm meshes, light sensors and much more inside the chip to detect any intrusion just to protect the certificate.
If you don't have that security, the payment providers will deny you their certificate, plain and simple.
You can build your own thing using card emulation via apps, but you will take all the risks.
How it works in practice is the following: These temper proof micro-controllers can run java-card code. You can write an applet for them, get it installed (if you have the key from apple/google - not easy). Then install it and you have a two-way communication: Your applet inside the secure world communicating with the payment world, and your ordinary mobile app showing things on the screen.
"You can build your own thing using card emulation via apps, but you will take all the risks." right! This is exactly what I've developed. I've built a new NFC application, with its own PKI infrastructure, that's deployed onto users' mobile devices via an app install. It works over APDU, but it would be more efficient if A1 was made available by the two major mobile OS. It would likely shave off >10% of the total time to complete which makes a material difference to the failure rate (ie. customer's lifting away too early or other interference) and therefore improves the overall UX.
CRC32 is the same speed as an integer multiply, going all the way back to Nehalem (2008). 3 cycles latency, and you can start a new one each cycle (or more than one, on Zen 5).
Another bonus quirk, from the 486 and Pentium area..
BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.
However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.
On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.
Unfortunately I have not found any evidence nor reason to believe that this "inner swap" behaviour you mention exists in some CPU --- except perhaps some flawed emulators?
Sure you can mess up your performance by picking bad compiler options, but most of the time you are fine with just default optimizations enabled and let it do it's thing. No need to understand the black magic behind it.
This is only really necessary if you want to squeeze the last bit of performance out of a piece of code. And honestly, how often dies this occur in day to day coding unless you write a video or audio codec?
* mtune/march - specifying a value of native optimizes for the current machine, x86-64-v1/v2/v3/v4 for generations or you can specify a specific CPU (ARM has different naming conventions). Recommendation: use the generation if distributing binaries, native if building and running locally unless you can get much much more specific
* -O2 / -O3 - turn on most optimizations for speed. Alternatively Os/Oz for smaller binaries (sometimes faster, particularly on ARM)
* -flto=thin - get most of the benefits of LTO with minimal compile time overhead
* pgo - if you have a representative workload you can use this to replace compiler heuristics with real world measurements. AutoFDO is the next evolution of this to make it easier to connect data from production environments to compile time.
* math: -fno-math-errno and -fno-trapping-math are “safe” subsets of ffast-math (i.e. don’t alter the numerical accuracy). -fno-signed-zeros can also probably be considered if valuable.
Agreed. I like to compile most translation units with -O3 and then only compile the translation units that I want to debug with -O0. That way I can often end up with a binary that's reasonably fast but still debuggable for the parts that I care about.
Yup that’s what I’ve resorted to (in Rust I do it at the crate level instead of translation unit). The only downside is forgetting about it and then wondering why stepping through is failing to work properly.
And every graphic programmer worth it's salt had a copy of "Computer Graphics - Principles and Practice" on the desk and followed whatever came out of the "Graphics Gems" series.
We knew about BSP, Octrees and all that stuff. It was common knowledge.
They knew: If Linux makes it in Munich, it will likely spread over and they loose tons of contracts with other German states.
reply