Hacker Newsnew | past | comments | ask | show | jobs | submit | more adrianco's commentslogin

When we migrated Netflix to AWS in 2009-2011 we setup a separate archive account on AWS for backups and also made an extra copy on GCP as our “off prem” equivalent. We also did a weekly restore from archive to refresh the test account data and make sure backups were working. I’ve documented that pattern many times, some people have even implemented it…


This is a water-cooled system, so AWS is going to have to figure out a new datacenter architecture to install and run it. It’s a big investment and commitment from AWS and I expect that part of the deal was that NVIDIA would be a guaranteed customer at scale for the system, in return for things like AWS getting preferential early access to the technology.


Based on the 4x 66kW (albeit an A+B design) power shelves in these, this # of systems is basically an entire datacenter for AWS assuming they stuck to ~35MW buildings as they’ve done historically. It will likely be a little larger even than a normal building…

It’s only 288 racks of GPUs, based on 72/rack as NVIDIA disclosed elsewhere. Then whatever ancillary network is required. Pretty wild to use that much power over that few racks.

Seems like critical load when at 100% for the GPU racks is supposed to be 34.5MW and assuming 10% for 800GE networking to plumb it all together and a 1.2 PUE gets you to 45MW overall?


I use OVMS on my 2010 Tesla Roadster and it’s likely a good option for people who don’t mind a bit of DIY to get it installed and to buy a SIM card. It supports most of the functionality of the car so I assume it can do that on a Leaf as well. Leaf is on the supported car list.


Thanks! That’s cool. I’m glad I found time to write down what I remembered and the names of many of the people I met along the way. I have notes but haven’t had time to tell the story of the Netflix years yet…


These were simpler RISC implementations in those days, the compiler optimizer stage was in charge of deciding whether to set the branch bit or not, and the prefetch would do what the bit said, and stall if it got it wrong.


for example the mmap read test used /opt/iozone/bin/iozone -w -s 2G -r 4k -i 2 -B -t 60 -F test0 test1 test2 test3 test4 test5 test6 test7 test8 test9 test10 test11 test12 test13 test14 test15 test16 test17 test18 test19 test20 test21 test22 test23 test24 test25 test26 test27 test28 test29 test30 test31 test32 test33 test34 test35 test36 test37 test38 test39 test40 test41 test42 test43 test44 test45 test46 test47 test48 test49 test50 test51 test52 test53 test54 test55 test56 test57 test58 test59


The point you should take away from this benchmark is that a 3 year old Centos 5 kernel with no tuning, no filesystem tuning and default mdadm options did 100K IOPS or 1GByte/sec pretty much regardless of the benchmark options in use. It just needed enough concurrency to load up the disk. There are many other options on iozone for using different variants of posix IO and threading. I showed the basic options and the mmap option. So with careful tuning you could do IOPS more efficiently, but what I found was very accessible out-of-the-box performance.


"Technical debt" is a nice way of saying it had bugs. It was mostly a configuration problem, if it had been setup better we would have had no outage or a much shorter one. The work to test all our zone level resilience (Chaos Gorilla) was underway but hadn't got far enough to uncover this bug.


Visit slideshare.net/netflix and read my architecture slides, there's plenty of detail about how Netflix works available if you have a few hours to look through it.


Fascinating stuff... the latest architectural overview is particularly interesting (http://www.slideshare.net/adrianco/netflix-architecture-tuto...) If I had one criticism, I'd love to see a separate overview of the fundamental (CS?) problems, vs. the ephemeral engineering problems (AWS). We all know AWS will go the way of the mainframe (though we may disagree as to timeframes!), but I think e.g. content recommendation algorithms and architectures will forever remain an interesting problem.

Though I'd love to see the monitoring solution open-sourced :-)


Monitoring is done with two systems, one in-house in-band that we might open source one day (was called Epic, currently called Atlas). The other is AppDynamics running as a SaaS application with no dependencies on AWS. There is some useful overlap for when one or the other breaks, we merge the alerts from both (plus Gomez etc) but they have very different strengths as tools.


I ran one of the recommendation algorithm teams for a few years before we did the cloud migration. The techblog summaries of the algorithms are pretty good. The implementation is lots of fine grain services and data sources, changing continuously. Hard to stick a fork in it and call it done for long enough to document how it works.


Some stuff is in real time, some is pre-calculated. There is an enormous amount of research and testing going on in this space all the time, its complex and it's evolving fast.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: