Scaling PostgreSQL at Braintree: Four Years of Evolution

jacques_chester · on Oct 17, 2012

> Unlike most applications, we are much heavier on writes than reads.

I think what they meant to say is that unlike most web applications, they are heavier on writes than reads.

Write-heavy workloads -- aka OLTP -- are the bread and butter of relational databases. To the point that people complain they aren't as good at read-heavy workloads.

JoelJacobson · on Oct 17, 2012

What they probably meant is the do more writes than web applications, not more writes than reads.

jacques_chester · on Oct 17, 2012

Which was my point.

Credit card transactions are literally the textbook example of OLTP.

nathan_f77 · on Oct 16, 2012

I found this article incredibly interesting, because I've only worked on low-traffic Rails apps so far. I've always wanted to learn about sharding, but haven't needed to, and didn't know where to start.

Now I'm inspired to start up a few VMs and have a play with the gems they mentioned, and try to get replication and failover happening.

sync · on Oct 16, 2012

Do play, but also consider not sharding at all, e.g. http://37signals.com/svn/posts/1509-mr-moore-gets-to-punt-on...

cagenut · on Oct 16, 2012

Its not scaling if its not chock full of numbers (and maybe even graphs). There's not one mention of requests per second or latency in this. Users/capacity per shard, bottlenecks run into, etc.

0xbadcafebee · on Oct 16, 2012

Scarcasm/trolling aside, it's not four years of scaling because they didn't need more than one server until late last year.

Also, people, please stop trying to do "automatic failover". There is no such thing, in the total sense. You have to have humans checking to make sure everything ok, and lots and lots of tripwires to prevent one of a multitude of catastrophic problems from occurring. Just automate pieces in totally reliable and non-harmful ways and let a human oversee the cut-over.

pgr0ss · on Oct 16, 2012

For the first few years, we scaled vertically rather than horizontally. We still went through many rounds of scaling our single server, changing both hardware and software configs.

We were reluctant to do automatic failover as well, since we all hear horror stories about automatic failover causing more problems than it solves. However, we changed our minds this summer for a couple of reasons. One, our traffic continues to grow and people rely on us for payments. If our database server crashes in the middle of the night, we don't want our customers to have to wait for us to do manual failover. Two, we felt a lot more comfortable with Pacemaker and PostgreSQL streaming replication failover than we used to with DRBD. We did extensive testing and tweaking, and we believe we understand the failure scenarios.

jeffdavis · on Oct 17, 2012

"Scarcasm/trolling aside, it's not four years of scaling because they didn't need more than one server until late last year."

There are many axes on which you can scale and many different interpretations. People have reported that postgres scales very well to many cores and many concurrent users (something like 300K read queries per second or something ridiculous[1]). It also scales nicely for concurrent writes and avoids locking in situations where most other databases don't.

There may be cases where postgres adapted plans (automatically using statistics and the cost-based optimizer, of course) for them as their data grew. Responding to data growth is a kind of scaling -- if using a database without that capability, it may be stuck with a horrible plan that falls over as the data grows.

Scaling to multiple servers is, no doubt, important; but it's not the only important definition of "scaling" (if it were, then multi-core would just be a fad). Lots of technologies scale to many servers beautifully, but lose focus on other aspects of scalability or are just plain inefficient.

[1] http://rhaas.blogspot.com/2012/03/performance-and-scalabilit...

timc3 · on Oct 17, 2012

It is totally possible. It is just very very difficult and almost never done 100%

falcolas · on Oct 16, 2012

Yeah, DRDB is terrible for write heavy performance. It was probably the source of many of your MySQL issues as well, but it's good that you've found a new home regardless.

JoelJacobson · on Oct 17, 2012

To author: For how long could you have avoided the need to chard if you would have off-loaded all the read-only queries to multiple hot-standby slaves instead? Your scaling story is almost identical to ours, same pg versions on same years, DRBD, etc. But we haven't split the database into multiple chards yet. I'm hoping to post-pone it for at least a year by off-loading read-only queries to a lot of synchronous slaves.

frankwiles · on Oct 17, 2012

I would imagine given their product they have a very write heavy work load and the read load, while non-zero, isn't a major concern. In typical web app environments however, you can definitely increase your "time to sharding" with read slaves and increased caching.

hahainternet · on Oct 16, 2012

One thing that isn't mentioned which we are currently facing problems with is the change of timeline once a failover occurs. The code isn't yet merged to use the stream to push the timeline update so all slaves must read from the shared archive.

Maybe I've implemented things incorrectly, but this is a frustrating issue for the moment.

pgr0ss · on Oct 16, 2012

To be safe, we always rebuild the old master after we fail over to a synchronous standby server. We have capistrano tasks to automate it, so it's not too bad. The async servers merely follow the new server after the IP moves, so we don't need to do anything there.

Edit: We also don't increment the timeline on failover. Instead, we stop PostgreSQL, remove the recovery.conf, and restart.

wiredfool · on Oct 16, 2012

I've got a 3 level setup, with a primary/secondary pair that are big machines, and a tertiary that's enough for an emergency, but not much more. They're all running hot spare.

I didn't feel safe blatting the new primary back to the secondary till we had the third level in there, since once you've failed over, you're at the mercy of that one machine and your latest backup is on the one that you're overwriting.

The third level one will follow the change in the recovery timeline if it's incremented in recovery.conf, and you make sure that it gets the appropriate history file.

hahainternet · on Oct 16, 2012

Ah I see. We use a keepalived pair to shoot the old node in the head and promote a synchronous slave to a master in case of failure.

There is a patch to go into 9.3 I believe which allows you to handle timeline shifts through walsender instead of the archive. We're using gluster for the shared archive at the moment and it will be nice to reduce reliance upon it.

Thanks for the useful answer.

jkahn · on Oct 16, 2012

Thanks for the article. What do you use to measure your PostgreSQL performance? Do you use any visual dashboards to track down bottlenecks, etc?

fdr · on Oct 17, 2012

If you are lucky enough to be able to use Postgres 9.2, consider running pg_stat_statements all the time. It is one of my favorite pieces of work: it canonicalizes queries and can tell you a number of useful statistics on it.

One of the advantages is that it can help you identify queries that are both very short but frequent (and could use a cache, or whatnot), and would fly below the radar of most log aggregations, because most people turn off logging off for very short statements. See also: identifying n+1 query pathologies.

It really blows away log analysis and pg_fouine for all but the most heavyweight tasks, and its approach is a lot more sound than log analysis tools because it uses the semantic representation -- not the syntactic one -- of the query submitted.

JoelJacobson · on Oct 17, 2012

If you need even more fine-grained statistics, check out https://github.com/johto/call_graph

It allows you to get statistics per top-level function, which can reveal abnormalities otherwise impossible to detect, such as a function which is usually very fast, but very slow for a small number of calls relative to the total number of calls. Consider fa()->fx() and fb()->fx(), if fa()->fx() always is fast and you have a million such calls, but for some reason fb()->fx() is very slow, but you only have a few such calls, fx() will look like a very fast function without any performance problems, if you only look at pg_catalog.pg_stat_user_functions, but call_graph will reveal fx() being slow when called by the top-level function fb().

This is only relevant for systems always accessing the database through Stored Procedures though.

pgr0ss · on Oct 16, 2012

One of the best metrics is the log of slow queries. We set our threshold at 250 milliseconds, and we investigate queries that take longer. Sometimes, we'll see INSERT or COMMIT statements show up, which is usually an indication that our write performance isn't what it needs to be. We generate graphs from these logs (using graphite) and track them over time. We also use Munin (including the PostgreSQL plugins) to monitor disk IO and various PostgreSQL stats.

trung_pham · on Oct 17, 2012

Why do you put audit logs in postgresql? Wouldnt riak be a better storage?

jacques_chester · on Oct 17, 2012

PostgreSQL is a pretty mature, reliable piece of kit. Audit logs in a financial sense are generally a matter of extreme importance and are not performance-sensitive on the read side.

Conservatism is the watchword when $$$ are involved.

pgr0ss · on Oct 17, 2012

We have a few different types of auditing information. At that point in the timeline, it was all in PostgreSQL. Today, we keep only the critical auditing information in the main database.

raverbashing · on Oct 17, 2012

Anything special about audit tables? Different machine? (maybe this is sensitive info)