No redundancy what-so-ever? What an amateur operation. I still say Joel is a fra...

tinco · on Oct 30, 2012

Ofcourse they have redundancy, just not cross-datacenter redundancy. And if you knew anything about cross-datacenter redundancy you'd know that cross-datacenter redundancy is something you do not decide upon lightly.

Then again, having cross-datacenter backups that can easily be taken online would be a bit more professional than 'we want to physically move the servers'.

pc86 · on Oct 30, 2012

I'll be the first to admit I don't really know anything about cross-datacenter redundancy; however, I always thought that was pretty high on the list once you had SaaS products that were pulling in enough revenue to warrant full-time employees outside of the founders. What are the reasons why you would choose not to do it? Are they all financial or are there other implications?

rscale · on Oct 30, 2012

I think the biggest argument against complex cross-DC redundancy is that it can add complexity and failure modes, not just during the emergency, but every day.

As a simple example, I've seen at least a half dozen people who had issues because they thought it was as simple as throwing a mysql node into each datacenter, only to discover (much later) that the databases had become inconsistent and that failing over created bigger problems than it solved.

Similarly, I've seen complex high-availability infrastructures where the complexity of that infrastructure created more net downtime than a simpler infrastructure would've, it just went down at slightly different times.

And you really need to think about the implications of various failure modes. If you go down in the middle of a transaction, is that a problem for your application? Is it okay to roll back to data that's 3 hours old? 3 minutes? 3 seconds?

There are any number of situations where it's reasonable to say "we expect our datacenter will fail once every couple decades and when it does, we'll be down for a couple days."

pc86 · on Oct 30, 2012

Great explanation, thank you.

Supreme · on Oct 30, 2012

Are you kidding me? If you run big sites like FogBugz then ofcourse you have cross-datacenter redundancy. It's not complicated to host your staging site in another physical location and point the DNS records to it when things go pear-shaped.

tinco · on Oct 30, 2012

Yes, so this staging site of you has exactly the same databases as your production site? Without customer data Fogbugz and Trello are useless. This means that this simple staging site of yours needs to have all data replicated to it, which means it also needs the same hardware provisioned for it, effectively doubling your physical costs, your maintenance cost and reducing the simplicity of your architecture. Ofcourse, if you're big enough you can afford to do this, and one could argue fogcreek is big enough. I'm just saying it's not a simple no-brainer.

What is a simple no-brainer how ever is to have offline offsite backups that can easily brought online. A best practice is to have your deployment automated in such a way that deployment to a new datacenter that already has your data should be a trivial thing.

But yeah, if you're running a tight ship something things like that go overboard without anyone noticing.

Remember the story of the 100% uptime banking software, that ran for years without ever going down, always applying the patches at runtime. Then one day a patch finally came in that required a reboot, and it was discovered that in all the years of runtime patches without reboots, it was never tested if the machine could actually still boot, and ofcourse it couldn't :)

Supreme · on Oct 30, 2012

Data should be backed up to staging nightly anyway. There should also be scripts in place to start this process at an arbitrary point in time and to import the data into the staging server. You do not need to match the hardware if you use cloud hosting since you can scale up whenever you want.

Here's where it gets really simple. Resize the staging instance to match live. Put live into maintenance mode and begin the data transfer to staging (with a lot of cloud providers, step #1 and #2 can be done in parallel). As soon as it finishes copying, take live down, point the DNS records at staging and wait for a few minutes. Staging is now live, with all of live's data. Problem solved. Total downtime: hardly anything compared to not being prepared. Total dataloss: none.

tinco · on Oct 30, 2012

I fully agree that this is how it could, and perhaps should be done. But you assume they are already on cloud hosting, which they obviously aren't. Ofcourse this is also a choice that has to be made consciously. Especially since fogcreek has been around a lot longer than the big cloud providers.

You can look to Amazon to see that cloud architecture brings with it hidden complexity that also increases risk of downtime while you relinguish a lot of control on for example the latency and bandwidth between your nodes.

What I don't know by the way, is wether the total cost of ownership is larger for colocation or for cloud hosting.

dbecker · on Oct 30, 2012

Why do you think they aren't doing this?

Possible explanations

1) Their engineers never thought of it

2) They considered it, and it is as simple as you think... but they don't care about uptime.

3) Implementing geographic redundancy is harder than you think given whatever other constraints or environment they face.

4) Some other explanation

#3 seems like the most likely explanation to me.

almost · on Oct 30, 2012

So which of your big sites have cross-datacenter redundancy? Why don't you talk about the decision process that lead to that and costs associated?

Unless you're just talking out of your arse of course and you have no experience with that sort of thing at all.

michaelhoffman · on Oct 30, 2012

The relationship between willingness to opine on a topic and knowledge of that topic:

http://www.smbc-comics.com/?id=2475

burningion · on Oct 30, 2012

There's a huge difference between code you've written in your spare time, and code that exists in production.

Code that exists in production is often buggy and unwieldy, and doesn't necessarily make a lot of sense. Because when you have a product that makes money, your priorities also change.

You need to become more defensive about your maneuvers, and you have to have a real reason to justify changing code.

To commit to doing redundancy well, you need a lot of resources, and you need to have a justify diverting resources that could otherwise be used to build a better product.

There's a common misconception that you can just throw stuff at the cloud (AWS, Heroku, etc), and things will just stay up. In practice, between cacheing, database server backups, heavy writes, and crazy growth, there's a lot to deal with. It's not nearly a solved or a simple problem.

So people are probably down voting you because your opinion seems naive to them. I've personally migrated a top 80,000 global eCommerce operation, and everything broke in a million different places, and we spent 2 weeks afterwards getting things working properly again.

There's a big difference between the way things are in your head, and the way things are in the production. Don't say people don't know what they're doing because they don't have a perfect system. No system is perfect.

benjaminwootton · on Oct 30, 2012

FWIW I agreed with you but downvoted because of the posting style.

The decision to avoid cross data center replication was probably a carefully considered one instead of amateurish. They probably have multiple layers of redundancy in their setup and decided that the cost and overhead of cross data centre replication was not justified.

In hindsight this doesn't seem like such a good decision, but I don't see how that makes someone an amateur or a fraud.

Supreme · on Oct 30, 2012

Sorry, should have linked to previous evidence of the fact: http://www.codinghorror.com/blog/2006/09/has-joel-spolsky-ju...

Tyrannosaurs · on Oct 30, 2012

Quoting Jeff in an attack on Joel has got to be irony yes?

Whatever this post says Jeff clearly didn't share your view of Joel being an amateur and a fraud given that he went on to start a pretty successful business with him.

Supreme · on Oct 30, 2012

Zynga has (up until now) made a lot of money and they write shit code. Hell, most of the companies I've seen have made money while writing shit code. Making money indicates that a person knows how to make money. Writing good code indicates that a person knows how to write good code. Since the two are disconnected, I stand by my statement that this man is a fraud. You simply don't start a programming blog when you created a new language just to address a small concern in the project spec. Start a blog on how to make money or run a business, sure, but don't tread into a field where people are trying to produce something of quality and try to 'teach' them something.

The argument you have just presented is irrational since it's central point rests upon the fallacy of false cause.

ding!

Another satisfied customer. Next!

Tyrannosaurs · on Oct 30, 2012

You're right, why would we want Joel to step into the field and teach us stuff when we've got you with vast knowledge and your winning manner?

After all, all we get from Joel is a decade of sharing what he's worked on and why he's done stuff a particular way, in a relatively transparent manner that allows us to maybe learn stuff but importantly to put it all in a context that allows us each to make a judgement on whether what he says is useful / interesting to us.

By contrast with you we have the rich tapestry of an anonymous account on an internet message board, a superior manner bordering on trolling and a series of aggressively worded posts.

I don't know what I was thinking. Death to Spolsky!

Just one thing. Now that you too have taken to the internet to teach the rest of us how things should be done, if someone spots any errors in what you say it's fine to term you a fraud I take it? What's good for the goose and all.

Supreme · on Oct 30, 2012

I'm not running a blog or expecting anyone to take what they read in a comment on some site on the internet seriously.

Label me however you want, it's a free internet (for now, anyway).

I still find it funny how anyone can start a blog and become famous for it. Maybe I should do the same and cash in on all that buttery goodness of advertising revenue...

dbecker · on Oct 30, 2012

Not just anyone can start a blog and become famous for it. People have to want to read your blog.

Supreme · on Oct 31, 2012

So it's essentially a marketing problem. I get the feeling that you were trying to suggest that it's worthy because it's popular. If so, argumentum ad populum is irrational. If not, apologies.

Tyrannosaurs · on Oct 30, 2012

So are you saying that your comments shouldn't be taken seriously?

dreish · on Oct 30, 2012

I would suggest the following mental exercise the next time you want to make a comment on HN:

Imagine you are at a dinner party at Paul Graham's house. He's there, obviously, along with several startup founders, aspiring founders, and a few established industry figures, including the person you are about to disagree with or criticize.

It will undoubtedly take more effort to figure out how to frame your criticism so that it doesn't make you a pariah, but the advantage will be that you will leave open the possibility of forming beneficial business and personal relationships.

In this case, I would try describing your own successes with building redundant services, and describe some of the other approaches you found while researching ones that you have built.

Supreme · on Oct 30, 2012

I've outlined how we solved this problem in another comment - http://news.ycombinator.com/item?id=4717713

Incidentally, I'm not here to form relationships - personal or otherwise. The primary goal of social media sites is to indulge in procrastination while advertisers bombard us with new products, not to improve one's life. For the latter, there are books, actions and real people made of flesh and blood. This reminds me a lot of some of the people I encountered in my gaming days - they tend to forget about the context of the platform they are using.