Great video! Love the concepts discussed, and the tigerstyle mentioned by adityaathealye. This is such a hard problem.
For Halo 4, I architected a system for our telemetry
that that had to support subsecond roundtrip user lifetime aggregation. With up to 400ms one-way latencies, that left us with 200ms client + server time to manage lifetime counts. 60ms on the client for batching and network stack delays, and 80ms in the cloud for aggregation after routing delays.
On the client side we leveraged a lot of those TigerStyle ideas (static-allocation, simplified function surface area, explicit limits, strongly asserted). And for the cloud we effectively wrote Diagonal Scaling (Stage 7).
For gamedevs to use this, it had to be as lightweight as possible and introduce no risk of priority inversion or hitching. I wrote a many-reader/many-writer lockless circular queue with performance measured in the microseconds, it was a beast.
I was so proud of our tiny team for pulling off the vision. When we launched, there was a bug in the game: someone left in telemetry in a tight render loop during a cutsceen, so our 20k/s client-side throttles kicked in.
We hit an average of 700k/s on launch day, with peak ingestion throughput of 831k/s (2 weeks to hit a trillion). And we didn't break a sweat.
Partner teams that were trying to provide our reporting capabilities started slowing down, though, haha, and brought it to our attention ("We, uh, can't scale anymore, there's no more servers on the eastcoast data center we can allocate")... so we hit the killswitch on that category of event.
That was another piece I was proud of writing: each instrumentation point did a quick binary mask check for 64 categories and 64 subcategories to see if it should emit... one reason why the instrumentation times were so blazing fast, we had minimal branching based of a hotly-cached variable that would hang around L1 because it was touched so frequently.
Querying that day for insights, though... aggregation queries touching the launch day (that weren't on the per-user hotpath layer that was our primary use-case) would 30x query duration. XD
OH! And I forgot to mention. The tech was impressive enough, that it was quickly adopted as the backbone by every title that Microsoft published, from Gears of War to Solitaire, Forza to Minecraft.
And for comparison: The official, initial Xbox One telemetry stack used pipe-delimited strings, and supported 1-5 transactions/console/second (vs our 20k).
Thanks! It was definitely a highlight working on such bleeding-edge technologies with crazy smart people. Pouring over custom PowerPC CPU Errata was divine.
Oh, and we only had (if I recall) 1ms per frame on one core to do all our payload packaging and dequeue messages from the circular buffer. Thats' where the 20k/s hard limit came in... we could have handled SO much more. Our entire message usually landed around 100-150 bytes if I recall, using bitpacked structures.
One thing I didn't anticipate: Memory stomping would result in everyone pointing fingers at our department, because we would inevitably be the ones that would crash (usually with our hardening asserts). We had to start flagging our memory blocks as unwritable when our thread was idle during debug mode, so that offenders would crash when they touched our memory.
For Halo 4, I architected a system for our telemetry that that had to support subsecond roundtrip user lifetime aggregation. With up to 400ms one-way latencies, that left us with 200ms client + server time to manage lifetime counts. 60ms on the client for batching and network stack delays, and 80ms in the cloud for aggregation after routing delays.
On the client side we leveraged a lot of those TigerStyle ideas (static-allocation, simplified function surface area, explicit limits, strongly asserted). And for the cloud we effectively wrote Diagonal Scaling (Stage 7).
For gamedevs to use this, it had to be as lightweight as possible and introduce no risk of priority inversion or hitching. I wrote a many-reader/many-writer lockless circular queue with performance measured in the microseconds, it was a beast.
I was so proud of our tiny team for pulling off the vision. When we launched, there was a bug in the game: someone left in telemetry in a tight render loop during a cutsceen, so our 20k/s client-side throttles kicked in.
We hit an average of 700k/s on launch day, with peak ingestion throughput of 831k/s (2 weeks to hit a trillion). And we didn't break a sweat.
Partner teams that were trying to provide our reporting capabilities started slowing down, though, haha, and brought it to our attention ("We, uh, can't scale anymore, there's no more servers on the eastcoast data center we can allocate")... so we hit the killswitch on that category of event.
That was another piece I was proud of writing: each instrumentation point did a quick binary mask check for 64 categories and 64 subcategories to see if it should emit... one reason why the instrumentation times were so blazing fast, we had minimal branching based of a hotly-cached variable that would hang around L1 because it was touched so frequently.
Querying that day for insights, though... aggregation queries touching the launch day (that weren't on the per-user hotpath layer that was our primary use-case) would 30x query duration. XD
OH! And I forgot to mention. The tech was impressive enough, that it was quickly adopted as the backbone by every title that Microsoft published, from Gears of War to Solitaire, Forza to Minecraft.
And for comparison: The official, initial Xbox One telemetry stack used pipe-delimited strings, and supported 1-5 transactions/console/second (vs our 20k).