(SpiderOak / Nimbus.io cofounder here) In addition to supporting the founders pe...

hemancuso · on Nov 8, 2011

I don't see how a parity based implementation can work in a meaningful way across multiple datacenters. You certainly couldn't rebuild if you lost an entire datacenter due to disaster. Replication is the only way here.

So any comparison to S3 in that regard is meaningless - Nimbus can't achieve that level of durability, correct?

Additionally, if you're just doing parity across multiple chassis in a single datacenter and lost a couple racks do to a power outage it would seem the network would likely shit the bed trying to rebuild, potentially bringing the whole system down. Have you guys worked through nastier failure cases that architectures like S3 can avoid?

rarrrrrr · on Nov 8, 2011

Excellent points.

Geographic redundancy with parity compliments the network topology we find in many cities: a metro area fiber ring connecting many data centers with low cost site-to-site (not internet) bandwidth. It's even lower cost to just buy excess capacity with lower QOS.

Every archival storage provider I've talked to has a write-heavy workload. Write traffic maybe more than 3x read traffic. So for example in this situation replicating between two sites requires a site-to-site connection equal to the size of the incoming data. Since site-to-site connections are full-duplex, in the parity system the bandwidth for reads and writes is provided at a similar price to what would be spent on replication bandwidth for writes.

That said, the first iterations of Nimbus.io won't provide geo redundancy beyond the geo-redundancy that creating an offsite backup inherently provides. We expect to add on geo redundancy storage as an upgrade option at a slightly higher price (still way under S3.)

Replying to your second point: If transient conditions like only a couple racks lost power, the system wouldn't trigger an automatic rebuild right away. It would continue to service requests with parity and hinted-handoff until the machines come back online. In any case, when the system decides a full rebuild is needed, the rebuild rate is balanced with servicing new requests (similar to how a RAID controller can give tunable priority to rebuild vs. traffic.)

wmf · on Nov 8, 2011

I don't see how a parity based implementation can work in a meaningful way across multiple datacenters. You certainly couldn't rebuild if you lost an entire datacenter due to disaster.

Sure you can. Given a system that can tolerate loss of N shares, you need to ensure that no datacenter holds more than N shares. In practice, this means you need many smaller datacenters, not two or three; whether that is economically feasible depends on the provider.