Distributed Systems

October 03, 2007

"Dealing with failures...is our standard mode of operation"

Nick Carr discusses Amazon's Dynamo system, "used to support many of the most critical elements of Amazon's operation including shopping-cart processing", with a focus on a paper by Amazon's CTO, Werner Vogels, and a number of coauthors titled Dynamo: Amazon’s Highly Available Key-value Store.

I mention it here due to this wonderful quote that appears in The Introduction on page 1:

Dealing with failures in an infrastructure comprised of millions of components is our standard mode of operation; there are always a small but significant number of server and network components that are failing at any given time. As such Amazon’s software systems need to be constructed in a manner that treats failure handling as the normal case without impacting availability or performance.

Dealing with failures as a standard mode of operation.

That's something worth thinking about.