Table of contents Executive summary ............................................................................................................................... 2 Getting started with VCEM profile failover .............................................................................................. 3 Key components .............................................................................................................................. 3 Profile failover example .................................................................................................................... 4 Initiating profile failover .................................................................................................................... 5 Profile failover internal operations ...................................................................................................... 5 Capabilities and limitations of profile failover ...................................................................................... 6 Recommendations and best practices ................................................................................................. 7 Important notes concerning hardware compatibility ................................................................................. 8 Installation and configuration checklist ................................................................................................... 8 All installations ................................................................................................................................ 8 Pre-failover installation and configuration checklist ............................................................................... 8 Configuring and managing spare bays and servers ................................................................................. 9 Initiating failover manually .................................................................................................................. 10 Initiating Failover using HP SIM Automatic Event Handling ..................................................................... 12 Concepts....................................................................................................................................... 12 Choosing HP Systems Insight Manager Events for failover ................................................................... 13 Using HP Systems Insight Manager action on events with failover ........................................................ 16 Configuring HP SIM Automatic Event Handling script to initiate failover ................................................ 17 Using customized scripts to initiate failover ....................................................................................... 24 Using a remote system to initiate failover .............................................................................................. 25 Remote monitoring and VCEM profile failover ................................................................................... 25 Initiate Failover using enclosure and bay .......................................................................................... 25 Limitations of using a remote system to monitor systems: ..................................................................... 25 Summary .......................................................................................................................................... 26
2 Executive summary Virtual Connect Enterprise Manager (VCEM) profile failover provides a way to quickly move a Virtual Connect (VC) profile from one system to another. Profile failover can be triggered using the VCEM GUI or via the VCEM failover command line. The failover command line may also be used in conjunction with Systems Insight Manager event monitoring to automatically trigger a failover if a particular event is received for a specific system. VCEM profile failover is not disaster recovery but rather a fast-recovery tool that can help you minimize unplanned downtime.
The basic steps of VCEM failover are:
1. Spares are designated via the VCEM GUI. 2. VCEM profile failover is invoked via the CLI or the GUI. 3. VCEM profile failover moves the VC profile from the source system to a designated spare system that is the same blade model type. The following occurs: a) The source system is powered down. b) An appropriate spare is chosen from the designated spares within the same Virtual Connect Domain Group as the source system. c) The spare system is assigned the VC profile. d) The spare system is powered up.
The outcome is that the failed system is powered off and the OS and application stack is now running on the spare blade.
VCEM failover can be invoked by:
CLI The CLI may be invoked using: o The failed systems hostname. o The failed systems IP address. o The failed systems enclosure and bay number. GUI The GUI has failover buttons on: o The Profiles tab. o The Bays tab.
This whitepaper describes how you can use VCEM failover. It explains the following:
Usage: Covers preparation, initiation, and expected results. Limitations: Includes constraints and dependencies. The most important limitation is that the boot LUN associated with the VC profile must be portable enough to function on the original system as well as potential spare systems. Pre-testing a boot LUN across possible spares is the best way to address the risk associated with this limitation.
3 This failover functionality is meant to help you in your day-to-day administration tasks and in particular, the fast recovery of problematic systems. It is built to be flexible and robust to adapt to different needs and environments.
Getting started with VCEM profile failover This document provides detailed information on how to use the Virtual Connect Enterprise Manager (VCEM) profile failover feature and how to initiate failover with HP SIM Automatic Event Handling. VCEM profile failover provides a fast method to recover a SAN-boot server that has sustained a critical hardware failure. The process of profile failover moves the failed servers SAN and IP network resources to a spare server. The spare server is then started with the failed servers boot LUN. VCEM allows failover to be initiated manually or automatically through events. VCEM profile failover is dependent on the ability of an image to run on different systems. VCEM profile failover does some basic checks for compatibility but it is important that you understand the limitations of profile mobility. Key components A brief description of the key components of failover and their respective roles:
Virtual Connect (VC) virtualizes server-attached Ethernet and fibre channel networks from an individual server blade using a server profile. When a server profile is assigned or moved to an HP BladeSystem c-Class server bay, any resident server blade will present the MAC and WWN addresses contained by the profile to the external Ethernet and fibre channel networks. The profile also contains the virtual serial number and virtual UUID which the server will use. VCEM allows you to control failover and perform profile failover operations. VCEM is part of the Central Management System (CMS). It is installed as a standalone component or as a plug-in to HP Systems Insight Manager. VCEM aggregates multiple VC Domains into one or more Virtual Connect Domain Groups, allowing single ranges of MAC and WWN addresses to be shared across all VC Domain Groups. Further, all server blades in any VC Domain Group share the same set of Ethernet and fibre channel connections. For more information see the HP Virtual Connect Enterprise Manager User Guide 1
1 See the latest version of the HP Virtual Connect Enterprise Manager User Guide at http://h20000.www2.hp.com/bizsupport/TechSupport/DocumentIndex.jsp?lang=en&cc=us&taskId=101&prodClassId=10008&contentType= SupportManual&docIndexId=64255&prodTypeId=18964&prodSeriesId=3601866 , Introduction
4 Profile failover example An example of profile failover follows. The infrastructure depicted in Figure 1 includes: One VC Domain Group with two HP BladeSystem enclosures. Virtual Connect with Ethernet and SAN interconnects. Red box represents a production server blade (left enclosure). Blue box represents a spare server blade (right enclosure). Server profile that contains the MAC and WWN assignments. You can assign a profile to a bay or leave it unassigned.
Figure 1: VCEM profile failover example
The following describes the profile failover operation shown in Figure 1.
A workload, a SAN-boot system drive containing an operating system, or an application and the applications data, is running on the production server (red box). There is a need to move the workload to another server. For example, there is a hardware problem or there is server maintenance scheduled. A profile failover on the production server is initiated (red box). As a result VCEM executes the following automated actions: The production server is powered off (red box). Spare server blade (blue box) is selected.
5 The production servers profile is moved from the production bay (red box) to the spare bay (blue box). After the move, the spare server has the Ethernet and SAN assignments previously held by the production server. The spare server is powered up. It boots the system drive image that had been running on the production server and the production workload is restarted on the spare server. Initiating profile failover VCEM provides both the GUI and the CLI interfaces for initiating failover. The GUI is accessible only to VCEM users. The CLI is accessible to users and as an API for other software, such as user-developed scripts and HP Systems Insight Manager Automatic Event Handling. To perform a failover, you must have at least VCEM Limited Group Operator (or higher) permission for the group where the failover will occur. Using the GUI, the source server is always identified by its enclosure and bay location. Using the CLI, the source server can be identified by its enclosure name and bay location, by its hostname, or by any of its IP addresses. Full Systems Insight Manager server discovery must be running to identify the source server by either hostname or IP address. There are three basic operational approaches to initiating failover: Initiate failover from the GUI or CLI. Receive an alert about a possible problem from HP SIM Automatic Event Handling. After verifying the problem, you initiate failover using the GUI or CLI. HP SIM Automatic Event Handling automatically initiates failover via the CLI and sends you an alert that this action has occurred. Initiation using HP SIM Automatic Event Handling is discussed below. See Initiating Failover using HP SIM Automatic Event Handling for more information. Profile failover internal operations The same internal actions are performed by VCEM whether failover is initiated by the GUI or the CLI. This section explains what profile failover does when invoked. The VCEM Job Manager The Job manager is a VCEM internal resource that executes a series of configuration actions as a unified process. Only one failover job per VC Domain Group can be in concurrent execution. Most of the work of failover is done within the context of the failover job. The progress of a failover job can be monitored in the GUI or by using the CLI as described in Failover initiation and Actions performed by the failover job below. Failover initiation When failover is initiated through the CLI, the command line is validated and, if used to identify the source server, the hostname or IP address is translated into its unique enclosure- bay location.
6 The failover job is then scheduled and its job number is returned so you can check the job status. Actions performed by the failover job The failover job performs the following actions. Evaluates VC Domain status for any abnormal conditions such as an unmanaged or unlicensed domains or a domain that has a status of incompatible. The failover job then locks the VC Domain Group of the source server to other operations in VCEM for the remainder of the job. This prevents possible conflicting actions by other users or processes. Checks source server product name (different treatment for ProLiant and Integrity products). If the product is not identified then the job will exit Powers off the source server using press and hold (hard power off). Picks a qualified spare server from the source servers VC Domain Group. If no spare server is found, then the job will exit. A qualified spare server is defined as the first spare found that meets the following criteria: The spare bay has no assigned profile. The spare bay server is physically present. The model of the spare server is the same as the model of the source server. Note: The server model generation is not a differentiator; for example, a BL465-c G1 and a BL465-c G5 are considered to be the same model. The spare server is powered off. Moves the profile from the source to the spare bay. If errors are encountered during the move operation, the job will exit. As part of the move operation, VCEM: Removes the source servers host name and IP address information from Systems Insight Manager. Removes the spare designation from the spare bay. Powers on the spare server, thereby booting the SAN-attached system drive image. As part of this operation HP Systems Insight Manager discovery is scheduled to be run on the spare server in ten minutes. This will associate the host name and IP addresses of the spare server with HP Systems Insight Manager. To be effective, the operating system booted on the spare server must have established its network communications by the time HP Systems Insight Manager discovery runs. If your system takes longer than 10 minutes to boot due to large disk or memory configurations, you may need to rerun discovery again once the system has fully completed its boot process Completes job by unlocking the resources of the VC Domain Group. Capabilities and limitations of profile failover An understanding of the capabilities and limitations of profile failover is helpful to make the best use of this feature. When profile failover is complete, the workload of the source (production) server has been restarted on a different (spare) server blade and the memory resident software and data have been refreshed from disk. Profile failover can correct the following types of problems: Server hardware malfunctions which can be fixed by replacing server hardware.
7 Aberrant software or data states which can be corrected by refreshing software and data from disk. Ensure that you are aware of the following points during failover: User sessions established at the time of the failover might be interrupted. In HP Systems Insight Manager, hostname, and IP address associations are removed from the source server and reestablished with the spare about ten minutes after failover completes. Problems that profile failover does not address include: o Loss of connectivity with dependent resources residing on an attached SAN or Ethernet. o Data or software corruption or related loss of state integrity. o Operating errors and system configuration errors. Profile failover depends on the following: Servers must be pre-configured to boot from SAN on power-up. Servers must have no dependency on a local drive. The source servers system drive image must be able to run on the hardware and firmware configuration of any spare server in the VC Domain Group that is the same model as the source server. This includes server models of different generations, for example, the BL465- c G1 and BL465-c G5. Recommendations and best practices It is possible that a small difference in the hardware or firmware configuration of two different servers can create an incompatibility, such that a system drive image that works on one server does not work on the other, or performs at a degraded level. Therefore to ensure the best failover results: Standardize on one hardware configuration for all servers of the same model for each VC Domain Group. For each model, implement procedures to keep all firmware at the same revision levels across the server population. It is best to upgrade the OA, iLO, and server firmware to the latest revisions. Avoid placing different generations of the same model server in the same VC Domain Group. The internal composition of the hardware and firmware items across different generations of the same model is often significantly different. Test to ensure boot images reliably start and run on all the server configurations that could possibly be used, especially if you need to use different hardware and firmware configurations for the same server model within a VC Domain Group. Take a step-by-step approach to direct initiation of failover using HP Systems Insight Manager Automatic Event Handling. If you choose to implement event-initiated failover, be certain that your selected collection of failover events has a level of problem detection accuracy that is consistent with your operating procedures and service level objectives. For more information see Initiating Failover using HP Systems Insight Manager Automatic Event Handling.
8 Important notes concerning hardware compatibility VCEM profile failover is built on the concept of VC Domain Groups. VC Domain Groups ensure that a VC server profile running within them will have access to the same IP and SAN networks, regardless of what VC Domain they currently reside in. However, within any given VC Domain Group, the HP BladeSystem servers may not be interchangeable. A VC Domain Group can include servers which have different processor types and memory configurations. Servers may also be equipped with different mezzanine cards, potentially providing different LAN and SAN connectivity. Other differences between source and target servers may impact the operation of the operating system or applications after a move. When moving an image-based Matrix OE logical server on to a blade server, HP recommends the following: Run the latest versions of firmware on each source and target system component, particularly system ROM, iLO, and HBA firmware. Include drivers in each system image for all mezzanine cards on each system. Mezzanine cards on the target server which are not on the source server can interfere with a boot process by causing the OS to prompt for the installation of different drivers. Mezzanine cards should be in the same slot for both the source and target systems. Standardize hardware configurations as much as possible. Subtle differences in system hardware which are not reflected in the model or version number, such as changes within network interface cards or HBAs, can interfere with a boot process by requiring different drivers or configuration changes. For maximum portability, some datacenter architects choose to configure each server blade in the enterprise according to a single specification chosen to accommodate the demands of most application workloads. The architect can then test likely profile moves before putting the systems into production. HP recommends testing your workload on all target hardware platforms prior to production deployment. Installation and configuration checklist This section covers what is installed and configured in advance of running VCEM. All installations All items must be installed and configured as described in the HP Virtual Connect Enterprise Manager User Guide. See chapters 2 Installing and Configuring VCEM, and 3 Managing VC Domains for specific information. Pre-failover installation and configuration checklist Installation and configuration requirements for profile failover depend on which methods of initiation you wish to use. If you wish to use all methods of initiating failover, configuration items 2 through 7 must be implemented. Table 1. Profile failover configuration support and requirements by initiation method Configuration item Method of failover initiation GUI CLI: VCEM failover host HP SIM
9 CLI: VCEM failover bay CLI: VCEM failover ip Automatic Event Handling 1. VCEM is installed standalone Supported Supported Not supported 2. VCEM is installed as a part of HP Insight Control Supported Supported Required 3. Full HP Systems Insight Manager server discovery is in operation N/A Required Required 4. DNS is properly configured in the environment N/A Required Required 5. CLI user has HP Systems Insight Manager administrative privilege Required (CLI only) Required N/A 6. Windows Administrator account is usable on the VCEM system N/A N/A Required 7. HP Systems Insight Manager agents running on managed systems N/A N/A Required
Configuring and managing spare bays and servers Designating spare bays and populating them with servers is the final step before user- initiated failover is ready to use. Profile Failover always selects a qualified spare from the source servers VC Domain Group. A qualified spare is a bay designated as a spare that: Has no assigned profile. Has a server blade that is physically present and powered off. Is the same model as the source server. If failover finds no qualified spare then the failover operation fails. For more granular control of spares selection you can create multiple VC Domain Groups, each having an identical VC configuration, and populated with the desired combinations of servers and spares. For more information, see Recommendations and best practices above. To designate spares use the VCEM GUI Bays tab: 1. Set the filter for the VC Domain to which spares are to be applied. 2. Click the Show more details check box in the lower left of the screen to see the blade model and power status of installed servers. The display refreshes. 3. Click the Spare check box for each bay to be designated as a spare. A bay with a profile cannot be designated a spare. A bay with a non-server blade, such as a storage or expansion blade, cannot be designated as a spare. HP recommends that a server be present in the bay when the bay is designated a spare, but this is not mandatory. Note: Removing the check from the Spare check box removes the bays designation as a spare. Assigning or moving a server profile to a bay will also remove its spare designation 4. Click Apply Spares. Note: Clicking twice on the Spare column heading brings the spare bays to the top of the list.
10
Figure 2: Designating spares
Initiating failover manually Profile failover may be initiated from the Server Profiles tab. Select a profile and click Failover in the bottom right hand corner. You must have at least VCEM Group Limited Operator permissions to be shown the Failover button.
Figure 3: Press Failover
Profile failover may also be initiated from the Bays tab. Check a specific bay with an assigned profile and then select Failover.
11
Figure 4: Pressing Failover from the Bays tab
12 VCEM profile failover may also be initiated using the CLI. Select one of the following command parameters options: Using the failed systems host name vcem failover host hostname Using the failed systems IP address vcem failover ip ip_address Using the failed systems enclosure and bay number vcem failover bay enclosure_name:bay_number Once the command has been executed, a job ID number is displayed and can be used for input to another VCEM CLI to monitor the status of the failover. An example command showing how to display job details follows vcem list details job job_id_number Initiating Failover using HP SIM Automatic Event Handling This section describes how to configure and use HP SIM Automatic Event Handling to automatically initiate failover and also for user notification. Concepts To initiate failover, configure the following items for Automatic Event Handling: Select events that you want to use to trigger failover. HP recommends a set of events for you to consider below. See The set of failover events installed with VCEM for more information. You may use these events or select your own. The selected events are automatically placed into an HP Systems Insight Manager event collection. Create an action-on-event task in HP Systems Insight Manager. In creating this task, the failover event collection is applied to a collection of managed systems. One or more actions are configured to execute whenever any event in the collection occurs. These actions include e-mail notification and execution of a custom tool. VCEM provides two custom tools to initiate failover. o VCEM Profile Failover by Hostname o Profile Failover by IP address These custom tools are used as the argument to the VCEM failover CLI command. One of these custom tools is selected and configured with action-on-event. This procedure for configuring Automatic Event Handling is described in detail below. The following figure depicts how the automatic failover initiation works once configured.
13
Figure 5: Automatic initiation of profile failover using HP SIM Automatic Event Handling
In Figure 5 above: HP Systems Insight Manager events originate from the managed systems and are posted to HP Systems Insight Manager via Ethernet. HP Systems Insight Manager processes the events and acts on the selected failover events. On the first occurrence of any failover event, HP Systems Insight Manager runs the custom tool. Note: It is possible to modify the first occurrence behavior through user-provided scripting, as described below. For more information, see Configuring HP Systems Insight Manager Action on Events script to initiate failover The custom tool invokes the VCEM failover command and passes either the host name or IP address of the system that posted the failover event. The host name or IP address is mapped to its enclosure and bay location. A failover job is started for that enclosure and bay. The failover job runs as described above. Communicating via Ethernet with the VC interconnects for the enclosures containing the source and spare bays, the job moves the server profile from the source to the spare bay and then powers on the spare server. Choosing HP Systems Insight Manager Events for failover You must select the HP Systems Insight Manager events you wish to use to initiate failover. HP has recommended a collection of events for your consideration. These recommendations are listed after several practical topics on using events to automatically initiate Failover. See How HP selected the set of installed failover events for more information.
14 HP Systems Insight Management monitoring HP Insight Management (IM) agents monitor the health of ProLiant systems. These agents require a host operating system, such as Windows or Linux. IM agents monitor server hardware, largely at the component level, and send events 2 In addition to receiving events from servers, HP Systems Insight Manager polls each server for its health status every few minutes. This server health reflects a combination of all the subsystems monitored by the IM agent. For each server it monitors, HP Systems Insight Manager displays an icon indicating its health status, as well as reflecting the category of the most severe event received. (SNMP traps or WBEM events) to HP Systems Insight Manager to report changes in the servers health status. For this communication HP Systems Insight Manager requires a working IP network connection to the server. HP Systems Insight Manager groups events by category, such as server, storage, NIC and so forth. HP Systems Insight Manager also assigns a severity level to each event. The component level health information reported does not consider the importance of the individual components to the overall ability of the system to meet its service level objectives. System administrators must review the events to determine appropriate actions. Determining a critical server hardware failure When a server hardware component fails: the server may fail, thereby ceasing to deliver services; or the server may continue to operate and meet its service level objectives. For purposes of failover, when the failure of a hardware component materially impacts a servers ability to meet its service level objective, the component is critical and its failure becomes a critical server hardware failure. Failover should provide effective remediation. On the other hand, if the server continues to operate acceptably, then the component is not critical and immediate failover is most often unnecessary. For example, failure of a redundant power supply does not indicate a critical failure since the server continues to operate and with HP BladeSystem, the power supply can be replaced without impacting the servers operation. Since HP Systems Insight Manager does not know which components might be redundant or unused, it rates all component failures as critical events, leaving you to determine the most effective remediation. When a component is operating in a degraded state that threatens the server or threatens the integrity of its retained data, there is cause to failover the server. Examples of this component degradation state can be certain CPU and memory error conditions. HP Systems Insight Manager rates conditions that indicate impending failures as major events; however for failover, these events can also be considered critical. Also, the server configuration and workload can further qualify what a critical component is for any individual server.
2 See the latest version of the HP Virtual Connect Enterprise Manager User Guide at http://h20000.www2.hp.com/bizsupport/TechSupport/DocumentIndex.jsp?lang=en&cc=us&taskId=101&prodClassId=10008&contentType= SupportManual&docIndexId=64255&prodTypeId=18964&prodSeriesId=3601866. Also see the HP Systems Insight Manager document, Part Number: 347870-003: The Microsoft Windows Event ID and SNMP Traps Reference Guide, http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00293064/c00293064.pdf.
15 For example, a local array controller fails but the server is SAN-boot configured and the workload does not access local drives, so the array controller is not a critical component. When components are configured redundantly, a partner component can assume a failed partners load. Reporting critical events from a server When a non-redundant component, such as a single CPU, fails instantly, there is no opportunity for the HP Systems Insight Manager agents on that server to report the failure. There are many examples of component failures that are not usually reported by HP Systems Insight Manager agents. Reporting component status also requires that the servers operating system and network communications are working between the system and HP Systems Insight Manager. A server can fail to deliver services because its hardware or its operating system (hang or crash) failed critically. But if the root cause of the failure is unknown, you cannot be certain that replacing the server will remedy the problem. For example, if the boot image has become unusable, replacing the server will not remedy the problem. In other cases, replacement may not be necessary at all, for example, where a reboot would restore functionality. In any case, a system experiencing this type of issue is not able to send an event to HP Systems Insight Manager. The HP Systems Insight Manager System Unreachable event System Unreachable is an optional but commonly used system status change event. When configured in HP Systems Insight Manager, it causes the System Unreachable event whenever a system does not respond to a ping issued by the HP Systems Insight Manager Hardware Status Polling task. The System Unreachable event can be configured to cause the system to perform failover. The drawback is that System Unreachable is not a root cause event. It occurs whether the root cause is a critical hardware failure, workload software failure, or failure of the intervening communications network. (In the case of a network failure many servers could be triggered to perform failover.) Failover events and service level ratings The ideal goal for failover events is 100% accuracy, so that failover is triggered for: All situations in which a critical server hardware failure occurs. Zero conditions in which the server remains capable of meeting its service level objectives. If event detection overlooks a critical server hardware failure, then services will not be restored until the failure is discovered by other means such as a call to the help desk. If the need to failover is falsely detected, services are interrupted during the failover and system restart, adversely impacting the service level and perhaps making the service unavailable during a critical period. Before relying on event initiated failover, ensure that the accuracy of the selected failure detection events will result in acceptable service level ratings, given your local installation, configurations, workloads, service level objectives, and operations policies.
16 How HP selected the set of installed failover events The collection of recommended events includes those events HP believes will be consistently useful to customers. You may customize this collection or add additional collections for your local installation. HP used the following factors and rationale to select the recommended events: Exclude events that do not indicate a server hardware failure. Exclude events related to components that are usually configured redundantly. (For example, the HP BladeSystem c7000 Enclosure accommodates redundant power supplies and fans. Enclosure events for power and fan failures are reported to HP Systems Insight Manager and immediate action should be taken to hot-fix failed components. In most data centers timely replacement is completed without impacting service delivery.) Exclude events related to system components that are not needed to deliver services. It is assumed that system configurations do not depend upon local storage for failover. Therefore, related events will not impact service delivery capabilities. Exclude events that are not likely to be received by HP Systems Insight Manager because the reporting system is not healthy enough to transmit them. These events include HBA failures and most CPU and NIC failures. The set of failover events installed with VCEM The following recommended events are pre-configured in the VCEM Failover Events collection. All of these events can be characterized as pre-failure events, meaning that a failure has not yet occurred but should be expected at any time. Table 2 List of recommended, consistently useful failover events Event number Event type Event category 1001 (SNMP) CPU error threshold passed ProLiant System and Environmental Events 2005 (SNMP) Excessive Correctable Memory Errors ProLiant System and Environmental Events 6015 (SNMP) Correctable Memory Error Occurred ProLiant System and Environmental Events 6029 (SNMP) Corr Mem Errors Require a Replacement Memory Module ProLiant System and Environmental Events 6056 Corrected Memory Errors Replace Memory Module ProLiant System and Environmental Events
For more information, see Configuring HP Systems Insight Manager Action on Events script to initiate failover.
Using HP Systems Insight Manager action on events with failover HP Systems Insight Manager has the ability to monitor SNMP traps and WBEM events. These events can be used to notify you or to do more complex operations such as triggering executables on the Central Management System or on remote systems. This section describes how you can take advantage of this functionality to trigger VCEM profile failover. VCEM provides ease-of-use enhancements for administrators to take advantage of this functionality. Included are links to wizards, custom tool enabled CLIs, and collections of
17 events that you can customize. All of these combine to create automated failover in a VCEM environment. In summary there are three major ways to use HP SIM Automatic Event Handling with the VCEM failover feature: Event notifies the administrator but does not initiate failover. Event initiates failover but does not notify the administrator. Event both notifies the administrator and initiates failover. Configuring HP SIM Automatic Event Handling script to initiate failover This section explains how to create a sample HP SIM Automatic Event Handling script to invoke a VCEM CLI. While this is an example, these steps will be the same regardless of which of the aforementioned three Event Handling approaches you select. The VCEM home page contains a link under Failover to HP SIM Automatic Event Handling Manage Tasks. This is a convenient link to the HP SIM Automatic Event Handling script.
Figure 6: Using failover with HP SIM Automatic Event Handling
The link displays the following page. From this GUI you can configure Automatic Event Handling. To create a new action on event: 1. Click New to initiate the wizard.
18
Figure 7: Creating a new action
2. Provide a meaningful name for the task. The second step allows you to select the events which will trigger the task. This means that if these events are received for any of the monitored systems, the task will be triggered. The example shown in figure 7 will use the VCEM provided collection of events entitled Server Profile Failover Trigger Events. However, you can create a collection of events or choose individual events to use for failover initiation.
19
Figure 8: Selecting events
3. Use View Definition to display what is contained in an event collection. This collection contains CPU and Memory error SNMP traps.
20
Figure 9: View definition
4. Choose the second radio button labeled use event attributes that I will specify to select individual events. The wizard will advance to a new page that allows you to choose events by severity, event category, or even the event type. Selecting the event type displays all the available SNMP traps which can be used for failover purposes. A common category is ProLiant System and Environmental Events which contains many of the system specific warnings and failures. The following example uses the Server Profile Failover Trigger Events collection.
21
Figure 10: Selecting individual events
5. Select systems or collections of systems that this particular task applies to. The following example uses the second radio button in order to designate a specific system.
Figure 11: Selecting events
6. Select Next to provide you with a number of system attributes to select specific systems. The following example uses the system name which in this case is IP address 170.50.0.69.
22
Figure 12: Using system attributes
7. Check any actions you want to initiate. There are a number of actions available. When one of the previously selected events occurs for the selected systems, all of the actions checked here will be initiated. You can choose to send a page or email if you want to be notified of the event occurrence. These actions can be combined with a custom tool or they may be used as monitoring tools
Figure 13: Selecting actions
23 The following example shows the two custom tools installed with VCEM. These custom tools execute the VCEM CLI using the parameters available when a selected event is received on the Systems Insight Manager CMS. When an event is received, the originating system is indicated and can be used to execute custom tools. This example is using the VCEM Profile Failover by IP Address custom tool. Choose either IP Address or Hostname and click Next.
Figure 14: VCEM profile failover by IP address
8. Enable or disable Automatic Event Handling during certain time periods. For example, you may want to enable VCEM profile failover during weekends, when you are not in the office.
Figure 15: Enabling actions during certain time periods
24 9. Review the Automatic Event Handling task summary. The automatic event handler will be invoked whenever one of the trigger events occurs for the specified system. Click Finish to return to the Manage Tasks screen.
Figure 16: Reviewing the summary
Using customized scripts to initiate failover You may choose to create customized scripts or batch files to initiate failover. These scripts can be logically constructed to perform tasks before and after a VCEM profile failover. The basic construction of a custom script might be: Pre-Failover commands. Some examples include: o Evacuate Virtual Machines from an ESX Host o Gracefully shutdown an application o Write volatile data to disk VCEM failover command. Use one of the following: o vcem failover host hostname o vcem failover ip ip_address o vcem failover bay enclosure_name:bay_number
25 Post-Failover commands. Some examples include: o Importing Virtual Machines to an ESX Host o Creating Trouble ticket for source system maintenance o Sending email to administrators notifying them that a failover has occurred
These custom scripts can be used to create a Systems Insight Manager Custom Task and combined with the aforementioned Action-on-Events so that they can be initiated in response to system health events. See the HP Systems Insight Manager User Guide for more detailed information concerning custom tasks. Using a remote system to initiate failover Remote monitoring and VCEM profile failover If you are using a remote system to monitor servers, VCEM profile failover is still possible. An example could be where one system is being used to monitor server health and another system is running the VCEM implementation. Monitoring systems remotely indicates that Systems Insight Managers discovery is not set to run automatically at periodic intervals on the VCEM system. This means that VCEM will not have access to reliable information about servers such as host names or IP addresses. For this example, you can still use the VCEM CLI, specifically the enclosure:bay nomenclature. Initiate Failover using enclosure and bay The enclosure:bay CLI command can be applied if another system is used to monitor and inventory systems information. You will need to derive some way to access the CLI running on the VCEM system. SSH would be a logical choice. Once the remote system has logged into the VCEM system, it can invoke the VCEM CLI with the enclosure and bay of the defective server. The following example uses an enclosure named Databases with a defective server in bay 8: vcemf ai l over bay Databases: 8 This would initiate a VCEM profile failover on the profile that is assigned to bay 8.
Limitations of using a remote system to monitor systems: Accurate data is crucial to a successful execution of VCEM failover from a remote system. The system monitoring and data collection system needs to be able to: Identify that a problem has occurred. Have the ability to invoke the VCEM CLI remotely. Have accurate data concerning the enclosure name and the appropriate bay number of the affected server. If the remote system has difficulties with the above tasks, HP recommends that Systems Insight Manager be used for system monitoring as VCEM is designed to closely integrate with Systems Insight Manager monitoring and data collection capabilities.
26 Summary VCEM profile failover is a capable tool to help you maintain system uptime and reliability. It has the ability to manually and automatically failover VC profiles to designated spares. Furthermore, it can be incorporated into customized scripts so that you can build even more powerful solutions using VCEM as a trusted component. The key limitation to the failover functionality is that the associated VC profile must be able to operate on different blades. HP highly recommends that you test this portability prior to relying on VCEM profile failover. If the OS image associated with the profile is portable, the VCEM profile failover proves to be an excellent addition to your server management toolbox. For more information see the HP VCEM website, http://www.hp.com/GO/VCEM To help us improve our documents, please provide feedback at www.hp.com/solutions/feedback
Share with colleagues
Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Trademark acknowledgments, if needed. 460924-003, February 2012