Our current architecture works extremely well for the type of business our company currently conducts (a number of bespoke solutions and some non-saas products). Unfortunately, for our new product ventures, this architecture just doesn’t really fit, so we need to come up with an entirely new “Infrastructure V2″.
For me, the best way to do this is to look at where we want to 5 years down the line, assuming one of our products takes off. Obviously, we’re not going to start with all of this on day one, but it should stop us making choices which will prevent us doing any of this further down the line.
Right now, we have “warm” standby databases with manual failover. Any successful product is going to need more than this, specifically, I’d be looking for
- Hot standby to a database in the same data centre. This would allow for automatic failover in the case of primary database failure, and with a database in the same data centre should be do-able without incurring too much overhead from log shipping, apply, etc.
- Cold standby to a database in a remote data centre. This would remove any single point of failure in terms of the data centre we are running in. Given this is going to involve log shipping over a WAN, I wouldn’t expect this to be “hot” and certainly wouldn’t look to implement automatic failover.
Whatever database technology we pick therefore has to support hot and cold failover. Other mandatory requirements
- Hot backups with throttling so we can backup without impacting production too much
- Instrumentation so we can see any bottlenecks (i.e. statspack/awr equivalent)
To allow for simple scalability and high availability, really we want to be looking at as decoupled services as possible. As an example, long term we could look at
- Several load balancers, receiving traffic and distributing the load evenly between
- Several “app servers” which run the web servers and application software needed to generate the responses. Load balancers should be able to detect that a machine is down and route to another machine instead (so single machine failure doesn’t cause unavailability) and monitoring should be able to provision another machine automatically if load grows. These app servers connect to
- Several database machines. Long term, we may need “sharding” if any single product became popular enough, but I would avoid for as long as possible, as it adds significant complexity. Instead, to start off with, I’d assume a single database machine but replicated to a hot standby with automatic failover.
- This configuration is then set up in a second data centre / availability zone ready to be brought up in the case of data centre failure. Databases will need to ship logs or use a block replication technology to keep the remote database in sync and consistent (although probably not up to the point in time).
All this is well and good, but it makes the technology stack a lot more complex so we need to check this redundancy works as expected. This means we need to look into tools like
- The netflix chaos monkey – the best way to make sure your infrastructure copes with failures is to cause failures yourself on a regular basis.
- Auto Scaling Groups – Amazon’s solution for automatically provisioning new servers based on demand
- Scriptable or push button server deployment – at the moment, we can get away with this being a document explaining how to set up a server, as we do it once every couple of years ( http://xkcd.com/1205/ ) but if machines need to be provisioned quickly, we need an automated build system.
- Built in metrics on our own “application specific” metrics to help us do this. Knowing the CPU is at 10% is all well and good, but if the problem is locking at the application level, your users might still be waiting 30 seconds per page.
In the next post I’ll cover the sort of things we might want to look at for the application layer (frameworks, architectures, etc) and then hopefully use these posts as a way to store thoughts and analysis as we review various potential technologies.