Regarding Heroku's plan to continue relying on a company, Amazon, that failed them before:
If Heroku evolves to an architecture in which they utilize multiple AWS regions (as they mention in lesson #1 of their post-mortem) and if each region has a distinctly partitioned API "control plane," this should result in a materially improved availability situation for Heroku. EC2 Availability Zones guard against machine, power, and building failures. EC2 Regions should theoretically guard against API infrastructure and AWS software code failures.
Heroku need not necessarily ditch their current single-IaaS-provider architecture in order to achieve significantly better control over their service's uptime.
On the other hand, when downtime does occur, the ability for Heroku to prioritize their incident response manpower to first handle paying customers has its limits based on their downstream dependencies. If all the broken bits are within Amazon's black box, Heroku doesn't have much control over prioritization (Amazon fixes your stuff whenever it gets around to fixing your stuff). If Heroku operated over multiple cloud providers, even with the added complexity of such an approach, at least Heroku would have control over choosing which of their most important customers to migrate first to a working cloud, away from a broken and black box cloud.
In the end, I certainly don't see these considerations as simple. It's easy to cry when things go wrong, but I think the level of scalability and availability that has been achieved up to the present is quite noteworthy.
If Heroku evolves to an architecture in which they utilize multiple AWS regions (as they mention in lesson #1 of their post-mortem) and if each region has a distinctly partitioned API "control plane," this should result in a materially improved availability situation for Heroku. EC2 Availability Zones guard against machine, power, and building failures. EC2 Regions should theoretically guard against API infrastructure and AWS software code failures.
Heroku need not necessarily ditch their current single-IaaS-provider architecture in order to achieve significantly better control over their service's uptime.
On the other hand, when downtime does occur, the ability for Heroku to prioritize their incident response manpower to first handle paying customers has its limits based on their downstream dependencies. If all the broken bits are within Amazon's black box, Heroku doesn't have much control over prioritization (Amazon fixes your stuff whenever it gets around to fixing your stuff). If Heroku operated over multiple cloud providers, even with the added complexity of such an approach, at least Heroku would have control over choosing which of their most important customers to migrate first to a working cloud, away from a broken and black box cloud.
In the end, I certainly don't see these considerations as simple. It's easy to cry when things go wrong, but I think the level of scalability and availability that has been achieved up to the present is quite noteworthy.