I was recently assigned to review a customers’ availability SLA (Service Level Agreement) against their current infrastructure, I noticed that a number of their systems relied upon single points of failure. And after exploring their infrastructure with a fine tooth-comb; from WAN links to servers, I found that the worst protected area of their IT estate was their vital LOB (Line of Business) applications. Without these applications, any customer would struggle to operate properly and this would affect their ability to grow as a business.
The issue with LOB applications is that they’re often installed by third parties without the involvement of the IT department or their MSPs (Managed Service Provider), pre-sales/design consultants. This leads to them often being installed on non-resilient infrastructure, and they’re often installed to get the application up and running as fast as possible.
Nowadays most LOB applications rely on a database of some form, whether that be MS SQL, MySQL, Postgre or one of the many other flavours of databases. The main issue was that each LOB application either shared a single SQL Server Instance, or had its own singular SQL Server dedicated to the application. This could eventually be major issue, because if one of these standalone SQL servers goes down for any reason, the entire application goes offline until the fault with the SQL Server is resolved and the SQL instance is back online.
Acora have been working on an infrastructure transformation project recently, which involved the build of an entirely new MS SQL environment alongside multiple new Hyper-V 2012 R2 clusters managed by SCVMM (System Centre Virtual Machine Manager) 2012 R2 that spanned multiple data centres all over the globe.
The customer had standalone MS SQL servers with daily full backups alongside 15 minute backups of the transaction logs. At Acora, we see this as a fairly common deployment strategy of clients where we inherit their infrastructure and it’s not a greenfield site which Acora build up from the ground. This however, does not provide any production resiliency for the customer’s environment in the event of an issue with one of their many standalone SQL servers.
To bring the SQL Databases back online along with their LOB application would require either bringing the current SQL server back online and resolving whatever issue it had, or restoring the backups of the affected Databases to another SQL server and then reconfiguring all LOB applications affected to point at the new SQL instance.
As you can guess, this is not a 5-minute task! In reality, recovery may not work if your backups haven’t been checked recently, and they haven’t been running, or they have but they’re corrupt; the possibilities are endless!
On our recent deployment for one of our clients, the above painful, time consuming and nerve racking process has been avoided quite easily and the ‘recovery’ process is all automated. I say ‘recovery’ but I actually mean resiliency, and I will explain how we did this below.
How to create production resiliency in your LOB applications
We have deployed a 4 node SQL Server 2012 R2 cluster on a 6 node Hyper-V 2012 R2 cluster. In the 4 node cluster we have installed 3 SQL instances, which means any of the 4 nodes can own and run any of the SQL instances at any time (obviously one instance can only be running on one node at a time, this is more an active/passive resiliency method). We have also installed 2 SQL Server 2012 R2 nodes at a data centre which is connected to the another data centre of a layer 2 stretched circuit, and also connected into their MPLS cloud, so all remote sites can access both data centres if required at any time.
This allowed us to unlock a new feature made available in SQL 2012, SQL AAG’s (AlwaysOn Availability Groups). AAGs effectively allow database mirroring which has been around since SQL Server 2005, but all of the pain and limitations have been resolved, and it is now much more flexible.
The 2 nodes in the other data centre each have their own standalone SQL instance on them, and are also members of the 4 node cluster in the production data centre. We then created 3 AAG’s, one per production SQL Instance, which are then setup to partner with one of the two SQL instances in the other data centre. After this, we configured each DB to be part of their respective AAG, and after getting the initial synchronisation of the database completed to the partner side, the DBs then constantly replicate the transactions between them, keeping the databases up-to-date and available in multiple locations.
Please see below diagram for server setup overview:
Once the AAGs are configured, we point all LOB applications to use the SQL AAG name to access its DB, and then that’s it- everything is configured.
So, in the event of a server failure in the production site (left hand side of diagram) because we are using Windows Failover Clustering of the SQL Instances, the Instance will just start right back up on another node in the production site. Each node is configured to take the load of all Instances if required.
And in the event of a data centre failure, whether that be a communications issue or power etc, the nodes in the other data centre can bring their copies of the databases online in just a few clicks and they will take ownership of the AAG objects in the cluster, so that all DNS requests for the AAG name will now go to the respective node in the other data centre.
Please see below a diagram for a high level overview of an example AAG configuration:
As you can see, this provides a lot more production resiliency to the client in the event of different types of failures and, more importantly, the speed in which their LOB applications can be brought back online. Therefore keeping the client working and able to function as a business, whilst IT worry about why its failed over.
I hope this has provided an insight into the design and engineering thoughts of those at Acora and, as always, if you have any questions please comment below.