Professional Documents
Culture Documents
05/03/2013
Sign in
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
SQL Server 2008
16 out of 16 rated this helpful - Rate this topic
SQL Server Technical Article Writer: David P. Smith (ServiceU Corporation) Contributors: Ron Talmage (Solid Quality Mentors); Sanjay Mishra, Prem Mehra Technical Reviewers: James Podgorski, Mike Weiner, Peter Carlin Published: August 2009 Applies to: SQL Server 2008 Summary: ServiceU Corporation is a leading provider of event management software. An essential part of our IT mission is maintaining high availability and disaster recovery capability. This technical case study shows how we use Windows failover clustering and SQL Server 2008 database mirroring to eliminate single points of failure in our data centers and enable fast recovery from a possible disaster at our primary data center. These strategies and solutions will be of interest to database administrators, senior IT managers, project leads, and architects.
Introduction
ServiceU Corporation, based in Memphis, Tennessee, is a leading provider of online and on-demand event management software. Our software services are used by churches, schools, universities, theaters, and businesses to manage events such as concerts and conferences as well as online payments. We have customers in all 50 states of the United States and in 15 countries worldwide. Our software services are built and deployed using the Microsoft Application Platform, including the Microsoft .NET connection software, the Microsoft SQL Server 2008 database software, and the Windows Server 2008 operating system. The Microsoft Application Platform helps us provide a seamless user experience and maximum availability of our applications to users. The applications use both the Software as a Service (SaaS) model and the Microsoft Software + Services architecture. From a security standpoint, we maintain Payment Card
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 1 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Industry (PCI) Level 1 compliance to protect credit card holder and Automated Clearing House (ACH) information. (Details of our PCI Compliance are not covered in this case study.) Achieving maximum availability and near immediate recovery from a disaster is essential for maintaining our revenue stream. We have worked hard to eliminate all single points of failure in our architecture, and we have developed procedures for patching servers, upgrading software, and implementing application changes that preserve high availability. Based on these efforts, we have achieved 99.99 percent uptime, including both planned and unplanned downtime. This case study examines the decisions that we made and the procedures we employed to maintain maximum availability with minimal downtime. This information will be of interest to senior IT managers, project leads, architects, and database administrators (DBAs).
Figure 1: A logical view of the ServiceU application architecture showing the application tier layers Note the following about our architecture: Our customers can access our application directly through their browsers, their own Web servers, and from their own e-commerce servers. All customer activity is processed through our Web farm that holds the middle-tier layer. The end-user application is built with Microsoft technologies through a series of layers, all of which eventually go through the Data Access Layer to contact the application databases. The data layer consists of SQL Server 2008 databases. In order to maintain Level 1 PCI Compliance, rigorous security measures are enforced to protect user cardholder data. In addition, we maintain service level agreements (SLAs) with customers that specify required levels of performance and availability of the application.
Availability Goals
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 2 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Our revenue stream is based on customer activity. Consequently, it is vitally important that our application maintain maximum uptime and availability. We keep the following general goals in mind for our availability solutions: Ensure that all PCI Compliance security measures are applied throughout the network: If a standby data center is used for disaster recovery, it must also be PCI compliant. Eliminate all single points of failure: from the Internet presence to the data center, including network, Web and database servers, and data storage. To help achieve our uptime goals and meet desired service level agreements (SLAs), we created specific guidelines for allowable data loss and service downtime. These objectives were defined by recovery point objectives (RPOs) and recovery time objectives (RTOs) as discussed in the following list: Unplanned downtime: Loss of a database server: RPO = 0; that is, no data loss RTO = 60 seconds maximum Loss of the primary data center, or the entire database storage unit in the primary data center: RPO = 3 minutes maximum; lost data may be recovered if the primary data center can be made available RTO = 15 minutes total, including evaluation of the issue; 5 minutes maximum for making the necessary changes to bring the standby data center online Planned downtime: RPO = 0 (no data loss) RTO = 60 seconds maximum; some database changes may require a longer downtime than 60 seconds; in those cases every effort is made to minimize the service interruption
High Availability
To implement high availability within the data center, we decided to implement Windows failover clustering and with storage area network (SAN) database storage: Windows failover clustering provides database server redundancy within a data center. Each failover cluster has fully-redundant SAN storage for data protection within a data center. We use three nodes in each cluster to preserve high availability during patches and upgrades. Figure 2 shows our high availability architecture.
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx
3 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Figure 2: ServiceU uses a three-node Windows failover cluster with one clustered SQL Server instance The next two sections describe the redundant server and storage strategies illustrated in Figure 2.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
only one vote. This means that the entire cluster will be offline. We require that even if only one node of the cluster is available, the cluster must remain online. As a result, we chose to configure our three-node clusters using the No Majority: Disk Only quorum mode. In this mode, a cluster will still run with only one node available. This is equivalent to the quorum disk model of failover clustering on previous versions of Windows Server. To protect the quorum disk, we place it on the SAN with its own logical unit number (LUN). The database server connects to a SAN with fully redundant hardware by using multiple redundant paths. On the SAN, the quorum disk's LUN volume is mirrored using RAID. Because we have protected the quorum disk, we ignore the warning Windows Server 2008 gives in the Failover Cluster Management utility, stating that the Quorum Disk Only option may be a single point of failure. We made the following decisions when building our clusters: On each three-node cluster, two nodes are designated as preferred owners and have identical memory and CPU configuration. The third node, used primarily as a backup during patches and upgrades, has less memory. We implement a startup stored procedure to set the SQL Server memory based on detection of the active node. All resources are set to fail over if a restart is unsuccessful. Failover between cluster nodes is automatic, but the cluster is set to , which prevents failback to the preferred node. We will fail back to the preferred node manually when convenient.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
We have converted most Web servers to virtual servers using the Hyper-V technology. We have found that for configuring virtual servers, performance is better when the virtual guest machine VHD files reside on a physical disk volume that is separate from those used for the host operating system. This observation has led us to host our Web server VHDs on a local mirrored RAID disk volume that is separate from the operating system disk volume. In our Web farm, several servers can fail or be removed with little or no impact. As a result, upgrades can be applied to one Web server at a time: 1. 2. 3. 4. The Web server is removed from the Web farm. Code is applied. Testing is performed on the Web server. The server is placed back in the Web farm.
Disaster Recovery
To protect against the potential loss of a primary data center, we located a second standby data center in a different geographical location. The standby data center serves as the disaster recovery (DR) site should a natural disaster or other disruptive event result in the primary data center becoming nonoperational. The standby data center is used only in the case of emergencies, when the primary data center is unavailable. When the primary data center becomes available again, we reestablish it as the primary data center and the standby data center takes on its role of protecting the primary data center. Data from the primary data center is sent to the standby data center in near real-time, and the standby data center is a functional duplicate of the primary data center's hardware, software, and infrastructure. In the event of the loss of the primary data center, the standby data center can be brought online almost immediately, with minimal disruption of customer activity. The following sections detail our disaster recovery strategies.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Figure 3: Each data center contains redundant hardware and multiple connectivity paths We use redundant hardware and multiple connectivity paths to eliminate single points of failure within each data center. Figure 3 shows some, but not all, of the efforts that we have made to remove single points of failure: Active/Passive firewalls exist between each network. An NLB cluster balances incoming traffic across the Web farm. Multiple Web servers host the application code. At least two Domain Name System (DNS) servers exist at each data center. Each Web server uses a DNS alias for the server name when connecting to the SQL Server 2008 instance. The SQL Server 2008 instances are clustered using Windows Server 2008 failover clustering. If one of the nodes fails, the SQL Server 2008 instance will fail over to another node in the cluster. Database data and the Windows failover cluster quorum disk resource are stored on a SAN. The failover cluster has duplicate paths to the SAN, and the SAN LUNs are provisioned using RAID striping and mirroring. In addition: The database servers and SAN have redundant power supplies with uninterruptible power supplies (UPSs). Each data center has its own local power generator to protect against temporary loss of power from the electrical grid. For more information about the data centers, see Appendix A, "Data Center Infrastructure Details".
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx
7 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Figure 4: ServiceU implements disaster recovery between data centers by using asynchronous database mirroring Database mirroring ensures a near real-time copy of all mission critical database data at the standby data center: The principal databases are located in the primary data center in Memphis, and the mirror databases are in the standby data center in Atlanta. Low latency between the data centers benefits database mirroring in two major ways: The principal server can send a large volume of transaction log records to the mirror quickly. The time required to ship transaction logs to the mirror server is decreased when initializing database mirroring. We chose asynchronous database mirroring because we do not want bandwidth-related delays of synchronous database mirroring to affect application performance. In asynchronous database mirroring, the principal server does not wait for acknowledgement from the mirror server before committing transactions. Therefore delays in sending transaction log records to the mirror databases will not affect the completion of user transactions. Because we have chosen asynchronous database mirroring, in the event of the loss of the primary data center, some unsent transactions may not be present on the mirror database. If that happens, we will retrieve unsent data from the old primary data center at Memphis if the data can be recovered from it when the Memphis data center databases come back online. The previously unsent data can be loaded into the standby (Atlanta) data center databases from the Memphis databases without any primary key conflicts. This is possible because care was taken in assigning the keys in Atlanta during the period it assumed the production server role. After the failover, when the Atlanta data center assumes the production role, we use a script to skip a generous range of keys that may have been used by the transactions whose log records were unsent from the primary Memphis data center at the time of the failure (for more information, see the section "Identity Column Increment" later in this paper). A gap in the keys' sequencing will exist between the last used at Memphis and the new ones assigned at Atlanta, but this is acceptable to the application. If the databases at the primary data center cannot be recovered, the unsent data will be lost. There are nearly 30 databases on the main database instance, some of them interrelated, and all are mirrored to the standby data center. We base our database mirroring configuration on extensive testing before deployment.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Transact-SQL
IF dbo.cf_IsMirrorInstance() = 1 RETURN
The user-defined function cf_IsMirrorInstance() accesses the sys.database_mirroring Dynamic Management View (DMV) and returns a value of 1 when executed on the mirror instance. As a result, the SQL Server Agent jobs on the mirror instance that reference mirrored databases can remain active. They will succeed but not do anything while their server remains the mirrored server. (See Appendix D "Scripts", for the cf_IsMirrorInstance() source code.) New or changed database permissions are also scripted, and scripts are kept in a secured location. If a failover to the standby data center occurs, these scripts are run on the standby data center's SQL Server instance after the databases have been recovered.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Mirroring wizard (or running a script). 8. Remove log shipping for each database. 9. Enable the hourly transaction log backup jobs on the principal server. For more information about using database mirroring with log shipping, see the Microsoft white paper Database Mirroring and Log Shipping Working Together (http://sqlcat.com/whitepapers/archive/2008/01/21/databasemirroring-and-log-shipping-working-together.aspx).
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
becomes available following a disaster, we will first find and retrieve any unsent data in the primary data center's database tables, and then load that into the appropriate databases at the standby data center. Because the identity columns of the tables at the standby data center will have used values from a higher range, the unsent primary data center data can be loaded directly, keeping all identity values intact. If the primary data center's data cannot be accessed because of damage due to disaster, that unsent data will be lost. The standby data center is not meant as a permanent substitute for the primary data center, but as a backup in case of emergencies. When the primary data center comes back online, we will perform the previous steps to reverse the roles so that the Memphis data center becomes the primary data center, and then we will reestablish the Atlanta data center to the role of standby. We will use log shipping from the standby data center to the primary data center in order to prepare the databases for database mirroring. After database mirroring is operational from the standby data center to the primary, during a lowtraffic period the direction of mirroring will be reversed, and all DNS aliases adjusted so that the primary data center again assumes its original role.
Monitoring
We have implemented appropriate monitoring to send alerts quickly when potential problems are detected.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Age of oldest unsent transaction (set up at the principal server): reports the age in minutes of the oldest unsent transaction in the send queue at the principal server. We set an alert at three minutes for each mirrored database. We run an initial script on the principal that looks for any databases participating in database mirroring, andsets a baseline value for the "Age of oldest unsent transaction" for each database. All mirrored databases initially get the same setting. We run a second script also on the principal server that adjusts the threshold value for any databases which may need a different value. We may assign differing threshold values for databases based on varying patterns of update activity. We use this counter to monitor potential data loss in the event of an unplanned loss of the primary data center. The "Age of oldest unsent transaction" counter helps us ensure that it stays within its recovery point objective (RPO) of three minutes (see "Availability Goals" earlier in this paper.) Unrestored log threshold (set up at the mirror server): helps estimate how long it would take the mirror server to roll forward the log records remaining in its redo queue. We send an alert if the redo queue exceeds a certain threshold, usually between 250 kilobytes (KB) and 500 KB. The actual value may change for each database depending upon the database's workload and behavior patterns. We run an initial script on the mirror that looks for any databases participating in database mirroring. It sets a baseline log threshold value for each database. Each mirrored database gets the same initial setting. We run a second script on the mirror server that adjusts the threshold value for any databases that may need a different value due to differing patterns of update activity. Because we use asynchronous mirroring, we do not monitor the "Mirror commit overhead" counter. In addition to the database mirroring monitoring counters, we also monitor the space used and available free space for log and data volumes at each server. We use Windows Management Instrumentation (WMI) Alerts to monitor lock escalation and deadlocks. To minimize lock escalation issues that occur during reporting, we are currently testing Read Committed Snapshot Isolation (RCSI).
Suspended Mirroring
We have observed that under certain conditions, database mirroring sessions may enter a suspended state. When a database mirroring session is suspended, the principal database's transaction log records cannot be sent to the mirror. Because transaction log backups on the principal will no longer truncate the transaction logs if transaction log records cannot be sent to the mirror, the log files grow. We have established SQL Server Agent jobs that monitor and alert if any database participating in mirroring is in a state other than SYNCHRONIZING or SYNCHRONIZED (see the query in Appendix D.) After the underlying issue has been addressed, the mirroring session can easily be resumed using a script or SQL Server Management Studio.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
prevented from being applied to the mirrored databases. This could result in inconsistencies between databases that are interrelated from an application perspective. The missing data still exists in the primary data center in Memphis, but it may be days or weeks before it can be applied to the standby data center in Atlanta. Our databases have many tables that include identity columns. Because the databases are interrelated, one database may refer to identity values in another database. After a failover to the standby data center due to a disaster, unapplied log records that can no longer be sent from the principal could mean that a database may refer to an identity value that does not exist in the table of another database. We consider this extremely significant and have developed a methodology to prevent data integrity issues that could be caused by using the identity values that have not been sent or applied. During recovery of the mirrored databases on the failed over site, a script runs on every table in every database, reseeding the identity value to increment it by a certain number, and logging the change with new and highest old values. When the primary data center comes online again, we can query the former principal database server's data (assuming it is readable) and retrieve appropriate rows to populate the missing values to bring the tables into consistency across all the databases on the new principal.
Steps for Upgrading to SQL Server 2008 from SQL Server 2005
When we decided to upgrade from SQL Server 2005 to SQL Server 2008, we also decided to upgrade from Windows Server 2003 to Windows Server 2008. After extensive planning, we accomplished the upgrade with minimal downtime. When upgrading from Windows Server 2003 to Windows Server 2008, we decided also to reformat the storage LUNs for the Windows Server 2003 failover cluster at the primary data center, and to upgrade to new database servers at the standby data center. As a result, we chose to rebuild the failover cluster at the primary data center, and build a new failover cluster at the standby data center. We built a temporary SQL Server 2008 clustered instance on spare servers and used it to keep the SQL Server databases available while the primary data center's failover cluster was rebuilt. This section lists the steps we took to perform the upgrade, showing how we were able to preserve high availability while minimizing user downtime. In these steps, the following abbreviations will be used. Each of the following SQL Server instances is clustered: primarySQL2005: the legacy SQL Server 2005 instance at the primary data center standbySQL2005: the legacy SQL Server 2005 instance at the standby data center tempSQL2008: a temporary SQL Server 2008 instance at the primary data center primarySQL2008: the new SQL Server 2008 instance at the primary data center standbySQL2008: the new SQL Server 2008 instance at the standby data center The following steps illustrate the process our team used to upgrade to SQL Server 2008.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
2. 3.
4.
5. 6. 7.
8.
9. 10.
11.
12.
cluster. For the temporary SQL Server 2008 cluster, called tempSQL2008, only two cluster nodes were used. The instance would only be online for a few off-peak hours. The servers for tempSQL2008 were configured with Windows Server 2008, clustered, and a clustered instance of SQL Server 2008 installed. For tempSQL2008 data storage, we added Disk Access Enclosures (DAEs) with additional disk drives to the existing EMC CX-Series array. The tempSQL2008 server used a Fibre Channel path via the same equipment as the production SQL Server 2005 clustered instance, going through the same fiber optic switches. The tempSQL2008 server level settings were configured and thoroughly tested. Stopped database mirroring to the standby data center. Set up log shipping from primarySQL2005 to tempSQL2008. We use log shipping to help prepare databases for database mirroring. (For more information, see "Using Log Shipping to Help Set Up Database Mirroring" earlier in this paper.) We could not use backup compression to assist in setting up log shipping in this step, because backup compression is only available between SQL Server 2008 instances, and the primarySQL2005 instance was running SQL Server 2005. Initialized asynchronous database mirroring from primarySQL2005 to tempSQL2008. Accomplished by converting from log shipping to database mirroring. Waited for a very low traffic period before beginning the upgrade process. Converted all database mirroring sessions to synchronous database mirroring Waited for synchronization to occur. Used the firewall to redirect all incoming traffic to a scheduled downtime Web site. All Web servers have the same configuration, and each hosts a Web site for the purpose of handling downtime messages. This Web site responds appropriately to Web service requests. Application downtime now starts. Removed all Web Servers from Web farm except one. This Web server continued to serve the "scheduled downtime" Web site. Because this Web site was not immediately rebooted, it is temporarily called the StaleWebServer in these steps. Rebooted all the remaining Web servers to remove any cached or pooled connections. Simultaneously changed the DNS connection alias and reversed the database mirroring roles. Changed the DNS connection alias to redirect the application to the tempSQL2008 instance. For details about how we use DNS connection aliases, see "Data Center Infrastructure" earlier in this paper. At the same time, we ran an SQL script to manually fail over database mirroring, reversing the database mirroring roles and making tempSQL2008 the principal for all database mirroring sessions. Removed database mirroring. We ran a script to remove all database mirroring sessions because tempSQL2008 could not mirror to primarySQL2005. (A SQL Server 2008 instance cannot mirror to a SQL Server 2005 instance.) Tested all systems with one of the rebooted Web servers. We now took one of the rebooted Web servers currently outside the Web farm (call it the TestWebServer), and used it for testing the application that now connects to the tempSQL2008 database server. This was the final test to ensure that all application functionality was present when connecting to the new tempSQL2008 database. If the testing failed, we could have reverted back to the SQL Server 2005 instance (by issuing a RESTORE command on each of the SQL Server 2005 databases on primarySQL2005 to bring each database from a loading, nonrecovered state
14 / 22
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
into a recovered state). This was effectively the Go/No-go decision point. After we made the decision to allow users back into the application and to connect to tempSQL2008, user updates to the databases would start. After that point, new data in the SQL Server 2008 database would be lost if we decided to roll back to SQL Server 2005 based on restoring from database backups. Because we made the decision to proceed, we put the remaining rebooted Web servers back into the Web farm. We performed the following two actions simultaneously and as quickly as possible: Placed TestWebServer into the Web farm, making it an active Web server. Removed StaleWebServer from the Web farm, rebooted it in order to remove any cached or pooled connections, and placed it back in the Web farm. All the Web servers were now active, in the Web farm, and ready to connect to tempSQL2008. 13. Redirected traffic (via the firewall) back to the application IP addresses. The Web servers now were connecting to tempSQL2008. At this point the system was back up, and users were now able to use the application. This first downtime period lasted approximately 10 minutes.
Phase 2: Redirected Application Users to the Permanent SQL Server 2008 Instance at the Primary Data Center
1. Built a new SQL Server 2008 cluster (primarySQL2008) at the primary data center. Reconfigured the original primarySQL2005 servers with Windows Server 2008 and SQL Server 2008, applying the appropriate drivers and critical updates. Other IT personnel continued to monitor and test tempSQL2008, currently the production instance. Reconfigured the primarySQL2005 server's LUNs on the SAN and reformatted them using Windows Server 2008. We reconfigured the LUNs because we changed the number of disks from the older Windows Server 2003 configuration. If reconfiguration had not been required, just a Quick Format using Windows Server 2008 to clean up the drives and maintain proper LUN disk partition alignment would have been sufficient. Created the new Windows Server 2008 cluster as a three-node cluster (using an integrated install), and then installed SQL Server 2008 Enterprise. We added the first SQL Server node using the SQL Server Setup program interactively, and added the other SQL Server nodes using Setup's command-line installation options. We found this faster than using Setup interactively for all nodes. We then configured the SQL Server settings and tested a variety of failover situations to make sure everything was functioning correctly. 2. Set up log shipping from tempSQL2008 to primarySQL2008. We were able to use backup compression when setting up log shipping between these two SQL Server 2008 instances, making the log shipping setup process faster when compared with the previous setup of up log shipping from primarySQL2005 to tempSQL2008. 3. Initialized asynchronous database mirroring from tempSQL2008 to primarySQL2008. Accomplished by using log shipping to initialize database mirroring. Converted the mirroring sessions to synchronous database mirroring, and waited for all mirror databases to synchronize. At this point, we were ready to move the application to the primarySQL2008 instance, but it required a second downtime period. 4. Used the firewall to redirect all incoming traffic to the scheduled downtime Web page. Users were effectively offline again at this point. The second downtime period starts.
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 15 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technicalpoint. Case Study The second downtime period starts.
05/03/2013
5. Simultaneously changed the DNS connection alias and reversed the database mirroring roles. We changed the DNS connection alias at the data center DNS servers to point connections to the primarySQL2008 server. At the same time, we ran a script to reverse the database mirroring roles, making primarySQL2008 the principal and tempSQL2008 the mirror for all database mirroring sessions. We then repeated the processes for testing the application, as well as rebooting all Web servers, as outlined in step 12. 6. Redirected traffic (via the firewall) back to the application IP addresses. Users could now access the application, and the Web servers were connecting to the primarySQL2008 instance. Downtime duration for this second phase was about six minutes. At this point, the major part of the upgrade process was finished and the application was now using the desired primarySQL2008 instance. The following steps in Phase 3 did not need to occur immediately and no user downtime was required.
Phase 3: Prepared a New SQL Server 2008 Instance at the Standby Data Center and Set Up Database Mirroring to it from the Primary Data Center
1. Prepared the standby data center SQL Server 2008 instance (standbySQL2008). We left mirroring from primarySQL2008 to tempSQL2008 active temporarily, in case any issues arose with the primarySQL2008 cluster. We then replaced the standbySQL2005 cluster with new servers, installing Windows Server 2008 and SQL Server 2008, as well as upgrading to a new SAN. This was part of a planned equipment upgrade process. 2. Set up log shipping from primarySQL2008 to standbySQL2008. We again were able to use backup compression to improve the speed of the log shipping setup process. 3. Established database mirroring to the standby data center. Removed database mirroring from primarySQL2008 to tempSQL2008 instances. Removed log shipping and set up asynchronous database mirroring from primarySQL2008 to standbySQL2008. At this point, both data centers were live with SQL Server 2008 and the upgrade process was complete. For several weeks after the upgrade, we left the databases in SQL Server compatibility mode 90. This allowed us to troubleshoot potential database issues without the additional concern of having changed to the new SQL Server 2008 compatibility level as a factor in troubleshooting. After no issues were found, we changed the compatibility level of the databases to 100.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
4. Run the patch installation on the other unused node, Node 3, and when finished, reboot Node 3. 5. Move the SQL Server 2008 resource group from Node 1 to the other preferred node ( Node 2). This normally takes 30-60 seconds, and is the only downtime in this process. 6. Run the patch installation on Node 1, and when finished, reboot the node. 7. Verify that the SQL Server instance has the correct version number for the patch by running SELECT @@VERSION on the SQL Server 2008 instance. 8. Repeat steps 2-7 for the principal SQL Server 2008 instance (the failover cluster at the primary data center). For Windows Server 2008 updates (including patches, drivers, and other software updates), we use the following steps: 1. Start at the standby data center, on the mirror instance failover cluster. 2. Again, assume the cluster nodes are named Node 1, Node 2, and Node 3, and that the SQL Server 2008 instance is running on Node 1. 3. Pause an inactive node, for example, Node 2. 4. Install any updates and make any required changes. 5. Reboot the node. 6. Resume the node. 7. Repeat steps 2-6 for Node 3. 8. Move the SQL Server 2008 resource group from Node 1 to Node 2. 9. Repeat steps 2-6 for Node 1. 10. Repeat steps 2-9 on the principal instance failover cluster at the primary data center.
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
change to the databases must be coordinated with changes to the application. Generally updates to the application can be done in a matter of seconds. In general, we have two strategies for deploying changes to Web servers. In our application, Web content is replicated using Windows Distributed File System (DSF), so deployment strategies change depending on whether there are changes to Web content or not. We perform the following steps using a team of people: When there are no changes to Web content, we remove all Web servers except one from the Web farm. Then we simultaneously deploy the code on one of the removed servers and to SQL Server. After testing, we swap these two Web servers by putting the one with changed code into the farm and removing the existing one. Then we deploy the new code to all the remaining servers and put them back into the farm. The case is more difficult if the content get replicated. In this case we go down to a single Web server, taking all of the other web servers out of the web farm. We apply the changes to that Web server and simultaneously apply the SQL Server changes. At this point, the Web applications and the SQL Server databases should work together. We ensure that the new Web content has replicated and then add the other servers back into the Web farm. When only SQL Server changes are being deployed, we determine how long the SQL Server database changes will take. If only stored procedures or views will be changed, and there are no schema changes, the SQL Server changes typically finish in a matter of seconds. In those cases, it is not necessary to redirect users to a "scheduled downtime" Web site. If the SQL Server deployment is more time-consuming due to schema changes, we will direct users to a "scheduled downtime" site, as illustrated in "Steps for Upgrading to SQL Server 2008 from SQL Server 2005" previously, until the changes are successfully deployed and verified.
Index Maintenance
We rebuild and reorganize indexes in a selective and balanced manner. Rebuilding or reorganizing indexes generates large amounts of transaction log records, and those log records must then be sent to the remote mirror. With asynchronous mirroring, such a condition can cause the mirror to fall behind the principal significantly. As a result, we allow index maintenance only during low traffic times. We have a periodic Transact-SQL job to reorganize as well as rebuild indexes. Both actions cause transaction log load that must be sent to the mirror. To reduce and even out this load, the script does not pick all tables and indexes at once, but spreads the task across multiple days and at low usage times. The script uses a threshold to determine whether to rebuild or reorganize an index, depending on the use of the table (lookup tables, for example, would not require frequent rebuilds) as well as fragmentation percentages. This is a multitenant system, and all customers have data in the same tables. Different usage patterns by different customers can cause some tables to require index rebuilding or reorganizing. The script rebuilds indexes online whenever possible. Maintenance on tables that cannot be reindexed online is done only during the lowest traffic times.
Conclusion
ServiceU has successfully implemented a sophisticated highavailability and disaster-recovery solution for our applications. Database high availability within each data center is achieved by placing a clustered SQL Server 2008 instance on a threenode Windows Server 2008 failover cluster. A three-node failover cluster maintains high availability during cluster patches and upgrades. Disaster recovery is achieved using SQL Server 2008 database mirroring from a primary to a standby data center. Thoroughly tested procedures are used to
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 18 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
maintain maximum availability with minimal downtime during both planned and unplanned downtime scenarios. The application is continuously tested and scanned for vulnerabilities at the standby data center, ensuring it is always ready to go into production should the primary data center become unavailable. For more information, see the following documents: Database Mirroring and Log Shipping Working Together How to: Minimize Downtime for Mirrored Databases When Upgrading Server Instances Using Warning Thresholds and Alerts on Mirroring Performance Metrics
Power
All power is filtered to provide reliable current. Data centers have backup generators with large fuel tanks. Both have emergency contracts in case of a disaster to assure continued fuel. Batteries provide additional backup in case of any delays or problems with the generators. Power-switching equipment detects not only the availability of utility company power, but the quality of that power before it switches back from the generator. Multiple electrical circuits are used to further mitigate any power risk. Each device's power supply is connected to a different electrical circuit. Detailed power diagrams help ensure that mistakes are not made. All equipment, when available from the manufacturer, has redundant power supplies.
Air Conditioning
Multiple air conditioning units provide redundant temperature and humidity control. Capacity is oversized so that even if up to 50 percent of the units are nonfunctional, the data center still maintains acceptable temperature and humidity.
Security
Multiple badges, or badge plus code access, are required to enter the facility. Badge access and video logs are kept for a minimum of 90 days (a PCI Data Security Standard (DSS) requirement). Both data centers maintain PCI Compliance 24x7x365. In this way, if we ever have to fail over to the standby data center, we do not have to make any changes to be PCI compliant. It is a PCI Compliance requirement that a standby data center facility also be compliant before beginning to process any transactions. Key access is required to access the servers. Servers use a password protected KVM. Servers lock after 15 minutes of inactivity.
Offsite Backups
Databases and transaction logs are backed up to disk files. Those backup files are included in the daily tape backups that are transported offsite and stored in a climate-controlled vault. In the case of a disaster, those tapes are flown to the nearest major airport. They are then shipped overnight to the standby data center. While the tapes should never be needed, this is an additional layer of protection.
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 19 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
05/03/2013
Satellite Phones
Key members of the company carry satellite phones at all times in the case of a disaster. This allows them to communicate with vendors, service providers, and other employees. Key service providers have the satellite phone numbers.
Firewalls
We use firewalls in a passive high availability configuration: Active/Passive firewalls If a firewall fails, the other firewall takes over with no loss of connectivity to the client. The client never realizes that there was a problem. The firewalls share state, so if one goes down or has to be rebooted, the user should not notice any packet loss or connectivity problems.
Switches
All switches have very few moving parts (this is part of the company specification). All have redundant power supplies.
Backup
We have established the following backup procedures: Databases are backed up daily. Database transaction logs are backed up hourly during the active part of the day. Backups are made to disk and kept on disk for three days. Tape backups are stored off-site and on a regular rotation.
Appendix C. Documentation
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 20 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study
Documentation Procedures
ServiceU has implemented documentation procedures that are crucial to both SLAs and high availability.
05/03/2013
Policies
We have a policy that no updates are applied past noon on Thursday. This helps to prevent errors or problems occurring over the weekend, when the response time by IT personnel may be longer. Extensive code reviews and testing processes ensure accuracy of code before its deployment to the production environment.
Appendix D. Scripts
The cf_IsMirrorInstance() Function
Transact-SQL
CREATE FUNCTION [dbo].[cf_IsMirrorInsta nce] () RETURNS bit AS BEGIN -- This function determines whether a server is the mirror -instance or the principal. -- Assumption: All databases reside at either one location -or the other. -- Replace <db name> with an actual database name. DECLARE @mirroring_role_desc nvarcha r(60) DECLARE @IsMirrorInstance bit -- Choose a single critical database and test to see whether -it is PRINCIPAL or MIRROR -- Because the databases are interre lated, all must be the same; -other jobs test for this SELECT @mirroring_role_desc = mirror ing_role_desc FROM sys.database_mirroring m JOIN sys.databases d ON m.database _id = d.database_id WHERE d.name = '<db name>' -- Evaluate the result IF (@mirroring_role_desc IS NULL) OR (@mirroring_role_desc = 'PRINCIPA L') SET @IsMirrorInstance = 0 ELSE SET @IsMirrorInstance = 1 -- Return the result RETURN @IsMirrorInstance
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx 21 / 22
High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study END
05/03/2013
Appendix E. Feedback
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example: Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason? Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing? This feedback will help us improve the quality of white papers we release. Send feedback.
Yes
No
SERVERS Windows Server Exchange Server SQL Server Biz Talk Server Data
DEVELOPER RESOURCES MSDN Subscriptions MSDN Magazine MSDN Flash Newsletter Code Samples MSDN Forums
GET STARTED FOR FREE MSDN evaluation center BizSpark (for startups) DreamSpark (for students) School faculty
http://msdn.microsoft.com/en-us/library/ee355221(v=SQL.100).aspx
22 / 22