Ten Technical Points That Every System Admin Should Know When Joining Into A New Team

Ten Technical points that every system admin
should know when joining into a new team

24 Comments
You might have already read my old post, 7 social engineering tips that new system
admins should know in new team ,where I have discussed the social engineering points
that helps you to quickly sync with the new team. And this post is the continuation post
for the same topic, and I will be discussing the technical aspects that every system
admin should know during his initial phase in his new job. And these rules are same for
every one, irrespective of experience that he/she has in the old organization.
Before going to actual topic I would like to highlight one important point: whenever we
work for any organization for long time we feel that we are comfortable with the job, and
most of the times we will assume that it is our technical skill that is making us
comfortable in our existing job but the actual fact is it is not our technical skill alone
that makes us comfortable, it is our historical knowledge about the current environment
added to our technical skill which makes us more comfortable with the current job.
In simple terms, if you assume your technical skill as 1 then every piece of information
that you know about the history of the environment will add a 0 ( zero) next to the 1
and having more 0s next 1 will improve your value in the job. Whenever you join to a
new job, you will be carrying only the 1 with you from old job to new job and the rest of
the 0s you have to re-gain from the new job. So, during initial stage of new job keep
your focus to understand the historical information about current environment , from
the existing team, whenever you get a chance to discuss about it.
1. Know job scope of your team
Team scope is something which is very important to know right immediate you join to a new
job because it will give you an idea to decide your priorities of learning related to the
new job.
For example, if you join into a team in a large organization where the scope of the team is
to support a set of servers which have only database but nothing else then your priority
will immediately change to understand how the Database works on Unix , and basics of DB
terminology , at the same time your team not supporting any DNS, NIS, DHCP servers and
all of them were under control of different team so you will not worry about those servers
in your initial learning.
2. Know about Technical architecture of environment
Technical Architecture of Environment talks about below points :

a. How many Total servers( commonly called as Server FootPrint ) we are supporting and where
they are actually located ( i.e. Data center information ) ?
b. What Operating Systems are in use right now, and what are the supported hardware models?
c. What are the Operating environments that team supporting now? e.g. Production , Testing or
Developement
d. What are the applications currently running on our server environment, and who is using them?
e.g. sybase, clearcase, weblogic .. etc.
e. What Storage is in use right now, and What sort of Console systems we are using to connect to
the Servers remotely? EMC, Netapp, Cyclades Consoles ..etc.
f. What storage management software is in use in which operating systems? e.g. LVM, VxVM , ZFS
etc.
3. Know about procedures and escalation
Ideally, any system administrator should deal with three types of operations:
a. Break / Fix activities ( Widely known as incidents )
This mainly involves in fixing the issues that encountered in a properly working environment. e.g.
disk failure on a server, unix server crashed due to overload, network failed due to network port
problemetc.
b. Changes and Service Requests
Change operations mainly involves, introducing configuration/hardware/application change in the
currently running environment either for the purpose of improved stability or for the purpose of
improved security, in the current environment.
Service Requests involves performing operations on specific user requests like creating user
accounts, changing permissions, installing new server ( called server commission), removing a server(
called server decommission) etc.
c. Auditing the Server environment to identify the Quality of Service (QoS)
This mainly involves periodic checking of all the servers to identify if there are any configuration or
security vulnerabilities which compromises the stability of server environment. And remediation of
such vulnerabilities by requesting changes in the configuration.
To perform above three kinds of operations , every organization will have internal rules to
identify how to act ? , when to act? , what to act? . And these rules will vary from job
to job, during the initial stage of your job you should understand these rules and perform
your duties accordingly.
Note : ITIL ( Information Technology Infrastructure Library) talks about the guidelines
to define the above rules in a standard way in any IT related organization. Now a days,
major companies streamlining their procedures to meet with these ITIL guidelines so that
it will be easy to manage the environment although the people who created that
environment leaves the organization. Learning ITIL is always beneficial to system
admins( or any Infrastructure Support person).
4. Supporting tools/applications and your access to them
To Perform the Support operations discussed in the above point, organizations needs to
have proper tools/applications to facilitate their employees and support people to request
and respond in automated way as per the procedures defined in the organization. E.g.
Remedy Ticketing tool , HP Service Manager ..etc.
Once you join to a new team, just make sure you have requested your access to all the
related tools in time and tested the access.
5. Intercommunication Procedures with Other Support Teams and Vendors
Being a System Admin, major part of our day job involves communication with other
support teams like. Database Team, Network Team, Application Team, Hardware Vendors,
Data Center Support Team etc.
For successful service delivery, it is important to system administrators to have all of
their contact details ( .. like Phone, email and Internal Chat IDs ) handy. So gather the
information and make a good document which you can use in your job. It is very important
to write down this information and keep it safe, because most of the times the minor
issues turns into major problems if we dont know whom to contact right immediate we
noticed the issue.
6. Know where to find the information
Every Team will have some kind of documentation which explains the operations performed
by the team, and this documentation gives you more information than any individual can
share to you. Unfortunately, reading all these documents doesnt help us to understand
what is actually going in the job during our initial stage in the team, but the same
documents might save your life once you actively start working in the team.
During Initial stage, just gather the information about where the documentation is saved
and get the access to it. And quickly go through entire documentation( you dont need to
remember everything you read) , so that you will know where to find the information
when you are looking for a specific piece of information related to a specific issue.
7. Know Important infrastructure servers Details
Ideally, System administrators will classify their servers in two groups , first set is the
servers which are used by users ( e.g. Database Servers / Application server ) and second
set is the infrastructure Servers which are used to manage the first set of servers
effectively ( e.g. Jumpstart Remote Installation Servers, DHCP , DNS , NIS , LDAP
servers ..etc) .
As i explained in the point 1, you may or may not manage these infrastructure servers
depending on the scope of your team, but you must know the details of these servers
because every other server in your environment depends on these infrastructure servers.
Below are the important question you can try to find answers, during the initial stage of job:
a. What Name servers( DNS / NIS / LDAP ) we are using, and what are the names / aliases / IPs
of those servers ?
b. What remote installation ( jumpstart/ kickstart) servers we are using and our access to them ?
c. Whether there is any DHCP server available in the environment or is it managed by customized
tools? E.g. QIP etc.
8. Get Ready with appropriate logistics
Every Unix administrator starts his work by requesting his access to a Windows product
( Desktop Access / Outlook )
. The moment you join into a new job, start requesting
your access to your desktop PC login, Voip phone ( with international dialing if your job
requires to call overseas ), Email account, internal Chat messenger access, Data center
Access ( if your job requires physical access to DC) , and smart cards / Security tokens
etc.
The moment you get your email access, you may have to manage the flood of emails that is
coming to your team every day, you might have to create appropriate Outlook rules to
filter out emails which you dont have to respond during the first one or two months of new
job. Later, you can slowly start reading and responding them once you actual ready to work
on the floor.
9. Areas of Automation, and the specific details
System Administrator cannot survive his job if he doesnt know how to automate the
work ( using scripting) that he is doing repeatedly. And whenever you join a new team,
you should specifically ask for the information about any automated scripts which in place
and used to perform day-to-day job.
Most of the time, system admins make scripts to perform daily/weekly system health
checks and they might be running regularly from some specific servers using Cron
scheduler. It is better to know them before hand, so that it will help you if you want to
introduce your own scripts for the teams benefit.
10. Understanding monitoring alerts and response procedures
As I explained in the point 8, you will receive tons of mail the moment you added your
email id to team DL ( email distribution list), and major part of the mails could be from
automated monitoring system which checks health status of your server environment and
informs the system admin team, right immediate it notices an issue. If you are start
receiving such mails, dont just ignore them because you dont know what to do with them.
Actually you have note these alerts and keep raising questions with your team to know how
to respond these alerts.
And also keep auto notice reminders in your outlook, for some of the important are alerts
which are critical and urgent in their nature, so that you wont miss them.
What your experience says about this, just share with us

if you see this post useful then share it back, so that some of your
friends who are changing their jobs can benefit from this
Republished by Blog Post Promoter
Network Physical Connectivity Check for Solaris

and Linux
12 Comments
Network connectivity checks For the server without OS ( just Racked hardware and
powered up)
For X86 Hardware

We have to make LED checks like when we disconnect the cable at the server side / switch side
the Link / Act LEDs should lit up on powered on machines.
For Sparc based Servers:
We can watch individual interfaces to see if they have a connection and can see the network. At
the OBP prompt on the client, use the watch-net-all command to test and see the network
devices.
Example :
ok> watch-net-all
/pci@7c0/pci@0/network@4,1
1000 Mbps full duplex
Link up
Looking for Ethernet Packets.
. is a Good Packet.
X is a Bad Packet.
Type any key to stop.

.
/pci@7c0/pci@0/network@4
Timed out waiting for Autonegotiation to complete
Check cable and try again
Link Down
/pci@7c0/pci@0/pci@8/network@1,1
Link Down
/pci@7c0/pci@0/pci@8/network@1
Link Down
ok>
From the above output you can notice that /pci@7c0/pci@0/network@4,1 is shown with
network connectivity. All other ports are not connected to the network.
Network Checks for the Hosts running with Solaris or Linux
Use the system-appropriate commands to watch the interface, to ensure that packets are being
seen on the network.
Examples include:
Linux ( X86 )
# ethtool eth0
Settings for eth0:

Supported ports: [ MII ]
Supported link modes:
10baseT/Half 10baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:
Advertised auto-negotiation: Yes
Speed: 100Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Current message level: 0x000000ff (255)
Link detected: yes
Ping the the system from a other host of same network in one session and then watch the IP
traffic on the X64 system withtcpdump in another session.
A successful ping has a request and a reply. Iphost option used to filter the specific host
related network data, otherwise tcpdump will flood the console.
# tcpdump -i eth0 -n ip host 10.16.8.63
tccpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
16:34:53.133085 IP 10.16.8.21 > 10.16.8.63: ICMP echo request, id 13641, seq
0, length 64
16:34:53.133250 IP 10.16.8.63 > 10.16.8.21: ICMP echo reply, id 13641, seq 0,
length 64
Solaris Operating System

Using snoop
# /usr/sbin/snoop -d eth0 -o network.log
The above example will find the network traffic on eth0 and sends the output to a file
network.log in the current directory. You have to terminate the command manually with ^C
otherwise it will keep an adding info to the file. And later you can view the file output with the
below command
# /usr/sbin/snoop -i snoop.out -D |grep v drops: 0
And D option displays about the number of packet drops, and with above command you can
easily figure out that if all the packets from the interface dropping, then there is a patch issue .
Once physical connectivity has been established, the interface can be observed for errors.
Solaris Operating System provides the kstat interface for this type of monitoring.
For example, to watch hme0s interface statistics, on an hme interface, one would use the
following:
# kstat -m hme0 -i 0 5
and monitor statistics such as collisions, alignment errors. If these error counters are increasing
on a switched network, it would indicate that further investigation is warranted. The most likely
cause of such issues would be a bad cable, or incorrect switch settings. Replace the cable,
ensure the switch is set to auto-negotiate and re-test.
Cable tester
A cable tester is one of the quickest and easiest ways to check a cable, and its connections
through patch boards to the target switch. If one is available, connect the tester to each end of
the cable, and verify that the cable has connectivity through all 8 pins. Normally Datacenter
Operations team will have these devices.
If you dont have one then you should go through the traditional approach, to verify from server
end.
http://www.gurkulindia.com/main/2014/10/ten-technical-tips-that-every-system-admin-shouldknow-when-joining-into-a-new-team/
http://www.gurkulindia.com/main/2012/05/network-physical-connectivity-check-for-solaris-andlinux/
Home / Solaris Networking / How to configure Solaris 10 IPMP ?
How to configure Solaris 10 IPMP ?

June 26, 2013 in Solaris Networking
This article describes how to configure link based IPMP interfaces in Solaris 10. IPMP
eliminates single network card failure and it ensures system will be always accessible via
network.You can also configure the failure detection seconds in /etc/default/mpathd
file and the default value is 10 seconds.In this file there is an option called FAILBACK
to specify IP behavior when primary interface recovered from the fault. in.mpathd is a
daemon which handles IPMP (Internet Protocol Multi-Pathing) operations.There are two
type of IPMP configuration available in Solaris 10.
1.Link Based IPMP

The link based IPMP detects network errors by checking the IFF_RUNNING flag.Normally
it doesnt require any test IP address like probe based IPMP.
2.Probe Based IPMP
The probe based IPMP detects network errors by sending ICMP ECHO_REQUEST
messages.It requires test IP Addresses unlike link based IPMP.
1.Link Based IPMP

Request:
Configure IP address 192.168.2.50 on e1000g1 & e1000g2 using Link based IPMP.
Step:1
Find out the installed NICs on the systems and its status.Verify the ifconfig output as
well.
Make sure the NIC status are up and not in use.
Arena-Node1#dladm show-dev
e1000g0
link: up
speed: 1000 Mbps
duplex: full
e1000g1
link: up
speed: 1000 Mbps
duplex: full
e1000g2
link: up
speed: 1000 Mbps
duplex: full
Arena-Node1#ifconfig -a
lo0: flags=2001000849 mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=1000843 mtu 1500 index 2
inet 192.168.2.5 netmask ffffff00 broadcast 192.168.2.255
ether 0:c:29:ec:b3:af
Arena-Node1#
Step:2
Add the IP address in /etc/hosts and specify the netmask value in /etc/netmasks like
below one.
Arena-Node1#cat /etc/hosts |grep 192.168.2.50
192.168.2.50
arenagroupIP
Arena-Node1#cat /etc/netmasks |grep 192.168.2
192.168.2.0
255.255.255.0
Arena-Node1#eeprom "local-mac-address?=true"
Step:3
Plumb the interfaces which you are going to use for new IP address. check the status in
ifconfig output.
Arena-Node1#ifconfig e1000g1 plumb
Arena-Node1#ifconfig e1000g2 plumb
inet 0.0.0.0 netmask 0
ether 0:c:29:ec:b3:b9
ether 0:c:29:ec:b3:c3
Step:4
Configure IP on Primary interface and add the interfaces to IPMP group with your own
group name.
Arena-Node1#ifconfig e1000g1 192.168.2.50 netmask 255.255.255.0 broadcast + up

Arena-Node1#ifconfig e1000g1
Arena-Node1#ifconfig e1000g1 group arenagroup-1

Arena-Node1#ifconfig e1000g2 group arenagroup-1
groupname arenagroup-1
Step:5
Now we have to ensure IPMP is working fine.This can be done in two ways.
i.Test:1 Remove the primary LAN cable and check it.Here i have removed the LAN cable
from e1000g1 and let see what happens.
e1000g2:1: flags=1000843 mtu 1500 index 4
Again i have connected back the LAN cable to e1000g1.

Arena-Node1#dladm show-dev
e1000g0
link: up
speed: 1000 Mbps
duplex: full
e1000g1
link: up
speed: 1000 Mbps
duplex: full
e1000g2
link: up
speed: 1000 Mbps
duplex: full
Here the configured IP is going back to original interface where it was running before.
Here I had specified FALLBACK=yes . Thats why IP is moving back to original
interface.The same way you can also specify failure detection time to mpathd using
parameter FAILURE_DETECTION_TIME in ms.
Arena-Node1#cat /etc/default/mpathd |grep -v "#"
FAILURE_DETECTION_TIME=10000
FAILBACK=yes
TRACK_INTERFACES_ONLY_WITH_GROUPS=yes
Arena-Node1#
ii.Test:2 Normally most of the Unix admins will be sitting in remote site. So you will be
not able to perform the above test.In this case ,you can use if_mpadm command to
disable the interface in OS level.
Fist i am going to disable e1000g1 and let see what happens.
Arena-Node1#if_mpadm -d e1000g1
e1000g2:1: flags=1000843 mtu 1500 index 4
Now i am going to enable it back.

Arena-Node1#if_mpadm -r e1000g1
The same way you can manually failover the IP to one interface to another interface.
In the both tests,we can clearly see IP is moving from e1000g1 to e1000g2 automatically
without any issues.So we have successfully configured Link based IPMP on Solaris.
These failover logs will be logged in /var/adm/messages like below.
Jun 26 20:57:24 node1 in.mpathd[3800]: [ID 215189 daemon.error] The link has gone down on e1000g1
Jun 26 20:57:24 node1 in.mpathd[3800]: [ID 594170 daemon.error] NIC failure detected on e1000g1 of group
arenagroup-1
Jun 26 20:57:24 node1 in.mpathd[3800]: [ID 832587 daemon.error] Successfully failed over from NIC
e1000g1 to NIC e1000g2
Jun 26 20:57:57 node1 in.mpathd[3800]: [ID 820239 daemon.error] The link has come up on e1000g1
Jun 26 20:57:57 node1 in.mpathd[3800]: [ID 299542 daemon.error] NIC repair detected on e1000g1 of group
arenagroup-1
Jun 26 20:57:57 node1 in.mpathd[3800]: [ID 620804 daemon.error] Successfully failed back to NIC e1000g1
Jun 26 21:03:59 node1 in.mpathd[3800]: [ID 832587 daemon.error] Successfully failed over from NIC
e1000g1 to NIC e1000g2
Jun 26 21:04:07 node1 in.mpathd[3800]: [ID 620804 daemon.error] Successfully failed back to NIC e1000g1
To make the above work persistent across the reboot create the configuration files for
both the network interfaces.
Arena-Node1#cat /etc/hostname.e1000g1
arenagroupIP netmask + broadcast + group arenagroup up
Arena-Node1#cat /etc/hostname.e1000g2
group arenagroup up
Thank you for reading this article. Please leave a comment if its useful for you.
HTTP://WWW.UNIXARENA.COM/2013/06/HOW-TO-CONFIGURE-SOLARIS-10-IPMP.HTML
How to configure Solaris 10 Link Based IPMP

0 Share
5 Share
1 Tweet
1 Share
How to configure Solaris 10 Probe based IPMP

Troubleshooting Solaris 10 IPMP
Whats a Link Based IPMP ?
The failure detection and repair method used by the mpathd daemon differentiates
the IPMP as probe based or link based. In case of link based IPMP :
- The mpathd daemon uses the interface kernel driver to check the status of the
interface.
in.mpathd daemon observes the changes to IFF_RUNNING flag on the interface
to determine failure.
No test addresses required for failure detection.
Enabled by default (if supported by the interface).
One of the advantage of link based IPMP is it does not depend on external
sources to send ICMP reply to ensure the link status and it also saves IP addresses
as it is not require any test addresses for failure detection
mpathd Configuration file
mpathd daemon is responsible to detect an interface failure . It uses a configuration

file /etc/default/mpathd to set various IPMP parameters.
# cat /etc/default/mpathd
#
#pragma ident
"@(#)mpathd.dfl 1.2
00/07/17 SMI"
#
# Time taken by mpathd to detect a NIC failure in ms. The minimum time
# that can be specified is 100 ms.
#
#
# Failback is enabled by default. To disable failback turn off this option
#
FAILBACK=yes
#
# By default only interfaces configured as part of multipathing groups
# are tracked. Turn off this option to track all network interfaces
# on the system
#
The important parameters in mpathd configuration file are :

1. FAILURE_DETECTION_TIME : Time taken by mpathd to detect a NIC failure in
ms (default value 10 seconds)
2. FAILBACK : To enable or disable failback after the failed link becomes available
(default vaule yes)
3. TRACK_INTERFACES_ONLY_WITH_GROUPS If turned on interfaces

configured as part of IPMP are only monitored (default vaule yes)
The command for in.mpathd daemon to re-read the configuration file is :
# pkill -HUP in.mpathd
Meanings of FLAGs
You would see flags such as NOFAILOVER, DEPRECATED, STANDBY etc.. in the
output of ifconfig -a command. The meanings of these flags and parameters to
enable them are:
deprecated -> can only be used as test address for IPMP and not for any actual
data transfer by applications.
-failover -> does not failover when the interface fails
standby -> makes the interface to be used as standby
Testing IPMP failover
We can check the failure and repair of an interface very easily using if_mpadm
command. -d detaches the interface whereas -r reattaches it.
# if_mpadm -d ce0
# if_mpadm -r ce0
Most commonly used Link-Based IPMP configurations
1. Single interface Link based IPMP configuration

This is not a very useful configuration, as it does not actually provide high
availability. But can be used only to get intimated when an interface is failed.
Command line :
# ifconfig e1000g0 plumb 192.168.1.2 netmask + broadcast + group IPMPgroup up
For persistent configuration across reboots edit the files :
/etc/hostname.e1000g0
192.168.1.2 netmask + broadcast + group IPMPgroup up
Before failure :
# ifconfig -a
lo0: flags=2001000849[UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL] mtu 8232 index 1
e1000g0: flags=1000843[UP,BROADCAST,RUNNING,MULTICAST,IPv4] mtu 1500 index 13
groupname IPMPgroup
ether 0:c:29:f6:ef:67
After failure :
# ifconfig -a
e1000g0: flags=11000803[UP,BROADCAST,MULTICAST,IPv4,FAILED] mtu 1500 index 13
groupname IPMPgroup
2. Multiple interface Link based IPMP configuration

a. Active Active configuration
Command line :

# ifconfig e1000g1 plumb group IPMPgroup up
group IPMPgroup up
Before Failure :
# ifconfig -a
groupname IPMPgroup

groupname IPMPgroup
After Failure
# ifconfig -a
e1000g0: flags=19000802[BROADCAST,MULTICAST,IPv4,NOFAILOVER,FAILED] mtu 0 index 14
groupname IPMPgroup
groupname IPMPgroup
e1000g1:1: flags=1000843[UP,BROADCAST,RUNNING,MULTICAST,IPv4] mtu 1500 index 15
b. Active standby Configuration

Command line :
# ifconfig e1000g1 plumb group IPMPgroup standby up
group IPMPgroup standby up
Before failure
# ifconfig -a
groupname IPMPgroup
e1000g0:1: flags=1000842[BROADCAST,RUNNING,MULTICAST,IPv4] mtu 1500 index 20
e1000g1: flags=69000842[BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACTIVE]
mtu 0 index 21
groupname IPMPgroup
After failure
# ifconfig -a
groupname IPMPgroup
e1000g1: flags=21000842[BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY] mtu 1500 index 21
groupname IPMPgroup
e1000g1:1: flags=21000843[UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY] mtu 1500 index
21

1 Share
7 Share
1 Tweet
1 Share

Whats a Probe Based IPMP ?
The failure detection method used by the in.mpathd daemon differentiates the IPMP
as probe based or link based. Probe based IPMP uses 2 types of addresses in its
configuration.
1. Test address Used by in.mpathd daemon for detecting the failure (also called
as probe address).
2. Data Address Used by applications for actual data transfer.
In case of probe based IPMP :
- in.mpathd daemon sends out ICMP probe messages on test address to one or
more target systems on the same subnet.
in.mpathd daemon determines the available target systems to probe dynamically.
It uses a all hosts multicast (224.0.0.1) address to determine the target systems to
probe.
examples of target systems :
1. all default routes on same subnet.
2. all host routes on same subnet. (configured with route -p add command)
All test addresses should be in the same subnet.
mpathd Configuration file
mpathd daemon is responsible to detect an interface failure . It uses a configuration

file /etc/default/mpathd to set various IPMP parameters.
# cat /etc/default/mpathd
#
#pragma ident
"@(#)mpathd.dfl 1.2
00/07/17 SMI"
#
# Time taken by mpathd to detect a NIC failure in ms. The minimum time
# that can be specified is 100 ms.
#
#
# Failback is enabled by default. To disable failback turn off this option
#
FAILBACK=yes
#
# By default only interfaces configured as part of multipathing groups
# are tracked. Turn off this option to track all network interfaces
# on the system
#
The important parameters in mpathd configuration file are :

1. FAILURE_DETECTION_TIME : Time taken by mpathd to detect a NIC failure in
ms (default value 10 seconds)
2. FAILBACK : To enable or disable failback after the failed link becomes available
(default vaule yes)
3. TRACK_INTERFACES_ONLY_WITH_GROUPS If turned on interfaces
configured as part of IPMP are only monitored (default vaule yes)
The command for in.mpathd daemon to re-read the configuration file is :
Failure detection and repair detection time
The probing rate depends on the FAILURE_DETECTION_TIME set in the

/etc/default/mpathd configuration file (default value 10 seconds).
in.mpathd send 5 probes in every 10 seconds ( or the
FAILURE_DETECTION_TIME set ). If 5 consecutive probes fail, in.mpathd
considers the interface to be failed.
The minimum repair detection time is twice the failure detection time. So if the
failure detection time is 10 seconds, then the repaire detection time would be 20
seconds (with 10 probes).
# if_mpadm -d ce0
# if_mpadm -r ce0
Meanings of FLAGs
You would see flags such as NOFAILOVER, DEPRECATED, STANDBY etc.. in the
output of ifconfig -a command. The meanings of these flags and parameters to
enable them are:
deprecated -> can only be used as test address for IPMP and not for any actual
data transfer by applications.
Most commonly used Probe-Based IPMP configurations
1. Active Active configuration
Groupname:
ipmp0
Active interface(s):
e1000g0
e1000g1
Standby interface(s):
Data IP addresse(s):
192.168.1.2
Test IP addresse(s):
192.168.1.3
192.168.1.4
Command line :
# ifconfig e1000g0 plumb 192.168.1.2 netmask + broadcast + group ipmp0 up addif

192.168.1.3 netmask + broadcast + deprecated -failover up
# ifconfig e1000g1 plumb 192.168.1.4 netmask + broadcast + deprecated -failover group
ipmp0 up
To ensure persistent configuration across reboots edit the files :
/etc/hostname.e1000g0:
192.168.1.2 netmask + broadcast + group ipmp0 up \
addif 192.168.1.3 netmask + broadcast + deprecated -failover up
192.168.1.4 netmask + broadcast + deprecated -failover group ipmp0 up
Before failure :
# ifconfig -a
groupname ipmp0
e1000g0:1: flags=9040843[UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER]
mtu 1500 index 9
e1000g1: flags=9040843[UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER] mtu
1500 index 10
groupname ipmp0
After failure :
# ifconfig -a
groupname ipmp0
e1000g0:1: flags=19040803[UP,BROADCAST,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FAILED]
mtu 1500 index 9
e1000g1: flags=9040843[UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER] mtu

1500 index 10
groupname ipmp0
e1000g1:1: flags=1000843[UP,BROADCAST,RUNNING,MULTICAST,IPv4] mtu 1500 index 10
2. Active Standby
The only difference in case of a active-standby configuration is the interface
configured as standby is not used to send any out bound traffic. Thus disabling the
load balancing feature of an active-active configuration.
Groupname:
ipmp0
Active interface(s):
e1000g0
Standby interface(s):
e1000g1
Data IP addresse(s):
192.168.1.2
Test IP addresse(s):
192.168.1.3
192.168.1.4
Command line :
# ifconfig e1000g0 plumb 192.168.1.2 netmask + broadcast + group ipmp0 up addif

192.168.1.3 netmask + broadcast + deprecated -failover up
# ifconfig e1000g1 plumb 192.168.1.4 netmask + broadcast + deprecated -failover group

ipmp0 standby up
To ensure persistent configuration across reboots edit the files :
192.168.1.2 netmask + broadcast + group ipmp0 up \
addif 192.168.1.3 netmask + broadcast + deprecated -failover up
192.168.1.4 netmask + broadcast + deprecated -failover group ipmp0 standby up
Before failure :
# ifconfig -a
groupname ipmp0
e1000g0:1: flags=9040843[UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER]
mtu 1500 index 11
e1000g1:
flags=69040843[UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY,INACT
IVE] mtu 1500 index 12
groupname ipmp0
After failure :
# ifconfig -a
groupname ipmp0
e1000g0:1: flags=19040803[UP,BROADCAST,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FAILED]
mtu 1500 index 11
e1000g1:
flags=29040843[UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY] mtu
1500 index 12
groupname ipmp0
e1000g1:1: flags=21000843[UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY] mtu 1500 index
12

Troubleshooting Solaris IPMP
0 Share
1 Share
1 Tweet
2 Share

Solaris IP multipathing provides the high availability and load balancing capability to
the networking stack. It makes sure to avoid any single point of failure on network
side. We may face issues while configuring and even after configuring IPMP. Below
are some tips and tricks to troubleshoot issues in solaris IPMP configuration.
# if_mpadm -d ce0
# if_mpadm -r ce0
Check if in.mpathd daemon is running
in.mpathd deamon is responsible to detect and repair IPMP failures. Check if the
process is running on the system :
# ps -ef | grep mpath

root
2222
0 20:41:10 ?
0:06 /usr/lib/inet/in.mpathd
In case its not running simply run the below command to start it :
# /usr/lib/inet/in.mpathd
To make in.mpathd daemon to re-read the /etc/default/mpathd configuration file after

you do any changes to it use :
Check the messages file
First and foremost thing to do is to check the /var/adm/messages file and look for
mpathd related errors. You may find different errors ( as well as messages ) related
to IPMP as shown below. The errors in the messages file can easily tell you the
problem in the IPMP configuration.
1. interfaces configured for IPMP showing as "FAILED" in "ifconfig -a" output

2. "Successfully failed over from NIC xxxx to NIC xxxx", "NIC repair detected on "
"Successfully failed back to NIC ", "The link has come up on ", "The link has gone
down on "
3. "No test address configured on interface disabling probe-based failure detection on
it"
4. "Test address address is not unique; disabling probe based failure detection on "
Check the Flags in ifconfig command output
The ifconfig -a command output displays the various flags related to IPMP and
interface configuration.
1. interfaces configured for IPMP missing the "UP" and/or "RUNNING" flag in the
ifconfig -a output
2. interfaces configured for IPMP showing as "FAILED" in "ifconfig -a" output
The various flags related to IPMP and their meanings are :
deprecated -> can only be used as test address for IPMP and not for any actual data
transfer by applications.
In the case interface is not showing the RUNNING flag, Check the output of any of
the below commands to ensure that you have a working link between server and
switch port.
# ndd -get /dev/[interface] adv_autoneg_cap
-- make sure you have set the
interface first before getting the auto neg property value

# kstat -p |grep e1000g:0 |grep auto
# dladm show-dev
Ensure that the switchport is set to auto-negotiate. Disconnect and reconnect the
ethernet from server side to renegotiate link speed with the switchport.
In the case interface is not showing the UP flag use :
# ifconfig [interface in group] up
Determine if the default router is properly answering ICMP probes
Probe based IPMP will use any on-link routers to send ICMP probes to and listen for
responses. We can monitor the snoop command output to ensure that the onlink
router is responding to the pings. The in.mpathd daemon uses test addresses to
exchange ICMP probes, also called probe traffic, with other targets on the IP link.
Probe traffic helps to determine the status of the interface and its NIC, including
whether an interface has failed. The probes verify that the send and receive path to
the interface is working correctly.
In the first window :
geeklab # snoop -d hme0 icmp

Using device /dev/hme (promiscuous mode)
In the second window :
geeklab # ping 192.168.1.1

192.168.1.1 is alive
Here 192.168.1.1 is the default router. You can check the default router in the netstat
-nrv output.
Now in the first window you should be able to see the traffic :
geeklab -> 192.168.1.1
ICMP Echo request (ID: 1023 Sequence number: 0)
192.168.1.1 -> geeklab
ICMP Echo reply (ID: 1023 Sequence number: 0)
Here the first line is the outgoing ICMP request (the ping) and the second line is
the ICMP reply.
If you are using probe based IPMP ( an interface marked with -failover ), then use
pkill to provide a debug snapshot from in.mpathd and check for probes lost
messages output to /var/adm/messages:
# pkill -USR1 mpathd

# tail -20 /var/adm/messages
Are systems on the subnet able to respond to all-hosts multicast?
Use netstat and check for the interfaces membership in 224.0.0.1 :
geeklab # netstat -gn|grep 224.0.0.1

lo0 224.0.0.1 1
hme0 224.0.0.1 1
If the netstat -gn outputs show interfaces that cannot respond to ALL-SYSTEMS
multicast (224.0.0.1), then add the host route using the route -p command.
Is VCS Multi-NIC In use with IPMP?
VCS uses a resource type called Multi-NIC to configure the IPMP using the solaris
mpathd daemon. Make sure you are not using the VCS by checking
/var/adm/messages file for VCS related errors.
# ps -ef|grep -i multi
# grep -i LLT /var/adm/messages
# grep -i GAB /var/adm/messages
If you are using VCS check the main.cf file for the configuration details and hastatus
command to check if the MULTI-NIC resource is configured properly and is running
fine.
Contact support with data
The last option, if everything fails is to contact the oracle support. Provide below
data to oracle support for troubleshooting.
1. snoop
# snoop -d (first interface in the group) -o /tmp/ -s 60 -q
# snoop -d (second interface in the group) -o /tmp/ -s 60 -q
2. Explorer
Sun Explorer output :
# explorer
-- the command may vary with hardware
3. dladm
dladm show-dev > show-dev.out

dladm show-link > show-link.out
dladm show-aggr -L > show-aggr.out
HTTP://THEGEEKDIARY.COM/TROUBLESHOOTING-SOLARIS-IPMP/
BLOG FOR UNIX ADMIN, SOLARIS ADMIN, SOLARIS NETWORKING
Solaris IPMP Diagnosis and Troubleshooting

3 Comments
Symptoms:
* mpathd error messages in /var/adm/messages:
No test address configured on interface <interface_name> disabling probe-based failure

detection on it
Test address address is not unique; disabling probe based failure detection on
<interface_name>
The link has gone down on <interface_name>
Successfully failed over from NIC <interface_name1> to NIC <interface_name2>
NIC repair detected on <interface_name>
Successfully failed back to NIC <interface_name>
The link has come up on <interface_name>
* interfaces configured for IPMP missing an UP and/or RUNNING flag in the ifconfig -a output
* interfaces configured for IPMP showing as FAILED in ifconfig -a output
Diagnosis and Troubleshooting

Please validate that each troubleshooting step below is true for your environment. The steps will
provide instructions or a link to a document, for validating the step and taking corrective action
as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and
identify the proper resolution. Please do not skip a step.
STEP 1: Check
configuration.
and
validate
the
IPMP
For Solaris 10, link-based: Check Configuration

For Solaris 8, 9 and 10: Check Configuration
Ensure eeprom is configured to issue unique MAC addresses to all system interfaces.
STEP 2. Check the status

interfaces in the IPMP group.
of
the
the
The ifconfig -a output for the interfaces in the IPMP group MUST indicate UP *AND*
RUNNING.
If UP is missing from the output:
# ifconfig <interface in group> up

If RUNNING is missing:
Check the physical link between the interface and the switchport for faulty/disconnected cabling
and/or faulty/uninitialized switch port. Eliminate any misconfigurations affecting communication
by ensuring that auto-negotiation is enabled on the Sun interface (the default setting) and on the
switch side (consult the switch documentation):
(use ndd for older devices, like hme):
#
ndd
-get
/dev/<interface>
adv_autoneg_cap
(use kstat for most devices):

#
kstat
-p
|grep
e1000g:0
|grep
auto
(use dladm for GLDv3 devices like nxge, e1000g, bge):

#
dladm
show-dev
The proper setting for adv_autoneg_cap is 1, meaning that the Sun interface is advertising its
autonegotiation capability to the link partner (switch).
If adv_autoneg_cap is set to 0, correct with ndd for an immediate change:
Note: ce and hme device requires the instance to be set before any commands. Other devices
identify the instance in the /dev/ argument e.g. to retrieve information on the first instance of
bge: ndd -get /dev/bge0 adv_autoneg_cap.
ndd
-set
/dev/ce
instance
(device
instance)
to check:
# ndd -get /dev/ce adv_autoneg_cap
#
ndd
-set
ndd
/dev/ce
-get
instance
/dev/ce
adv_autoneg_cap
1
if the setting shows 1 after running the ndd command, but the link is not restored:
-ensure
the
switchport
is
set
to
autonegotiate.
-disconnect and reconnect the cable from the interface to the switch to allow the link partners to
re-negotiate.
Use
OBP
watch-net-all
to
test
Sun
interfaces
on
SPARC
hardware:
If you need further assistance to verify your network or switch connections, please consult your
local network administrator.
STEP 3. Determine if the default router is

properly answering ICMP probes.
If Solaris 8 or 9 or Solaris 10 probe-based (to determine, there must be an interface marked as
-failover in the ifconfig -a output):
#
pkill
# tail -20 /var/adm/messages
-USR1
mpathd
Mar 5 15:06:23 solarishost27 in.mpathd[6338]: [ID 942985 daemon.error] Missed sending total
of
Mar
0
5
probes
15:06:23
Mar
Mar
spread
solarishost27
15:06:23
15:06:23
over
in.mpathd[6338]:
solarishost27
Probe
solarishost27
15:06:23
solarishost27
Number
Mar
15:06:23
solarishost27
Number
Mar
15:06:23
Mar
15:06:23
Mar
Mar
Mar
Mar
Mar
5
5
5
solarishost27
solarishost27
15:06:23
15:06:23
15:06:23
solarishost27
15:06:23
of
probes
sent
419987
probe
acks
probes/acks
of
received
lost
probe
acks
419987
<<-
unacknowledged
stats
probes
received
on
(inet
aggr1)
of
probes
sent
419923
probe
acks
Number
Number
daemon.error]
aggr6)
Probe
Number
373034
(inet
valid
of
occurrences
on
ambiguous
Number
solarishost27
solarishost27
of
of
solarishost27
solarishost27
15:06:23
of
Number
Number
of
[ID
stats
Number
Mar
of
valid
probes/acks
received
lost
unacknowledged
123490
296324
probes
Mar 5 15:06:23 solarishost27 Number of ambiguous probe acks received 0

The pkill command can be repeated for ongoing checks or when troubleshooting link
failover/failback situations.
If configuration link-based (i.e. no interface marked as -failover in the ifconfig -a output) skip
to
step
#6.
STEP 4. Are systems on the subnet able to

respond to all-hosts multicast?
For Solaris, use netstat and check for the interfaces membership in 224.0.0.1 OR ALLSYSTEMS.MCAST.NET:
solarishost#
netstat
lo0
-g|grep
ALL-SYSTEMS.MCAST.NET
ALL-SYSTEMS.MCAST.NET
hme0 ALL-SYSTEMS.MCAST.NET 1
solarishost#
netstat
lo0
-gn|grep
224.0.0.1
224.0.0.1
hme0 224.0.0.1 1
If the netstat -gn outputs show interfaces that cannot respond to ALL-SYSTEMS multicast, the
configuration
MUST
be setup using host routes.
STEP 5. Is Veritas Multi-NIC in use along

with IPMP?
To determine:
#
#
ps
grep
-ef|grep
-i
-i
LLT
multi
/var/adm/messages
# grep -i GAB /var/adm/messages

Identify and clear any errors for LLT and/or GAB.
Consult Symantec for information and assistance with MultiNIC
STEP
6.
Gather
troubleshooting
configuration data specified below
contact Sun Support.
and
and
At this point, if you have validated that each troubleshooting step above is true for your
environment,
and
the
issue
still
exists,
further
troubleshooting
is
required:
I. packet capture using the snoop command. Follow these steps:

a. snoop -d (first interface in the group) -o /tmp/<interface name or instance> -s 54 -q
b. snoop -d (second interface in the group) -o /tmp/<interface name or instance> -s 54 -q
c. monitor for error condition in messages:
tail -f /var/adm/messages or otherwise reproduce the failure
d. then control-c the snoop commands and provide the output files /tmp/<interface name or
instance>
for
each
network
interface
in
the
IPMP
group.
note: explorer should be run with the -w localzones option to collect information on any
configured local zones.
II. collect the following outputs to a file using these commands:
#
dladm
show-dev
>
show-dev.out
dladm
show-link
>
show-link.out
# dladm show-aggr -L > show-aggr.out

The
following
commands
will
be
collected
for
machines
till
Solaris
10
update4
1.dladm_show-link.out
2.dladm_show-dev.out
3.dladm_show-aggr_-L.out
And the following commands will be collected for machines Solaris 10 update 4 onwards
1.dladm_show-link.out
2.dladm_show-dev.out
3.dladm_show-aggr_-L.out
4.dladm_show-linkprop.out
http://www.gurkulindia.com/main/2011/06/solaris-ipmp-diagnosis-andtroubleshooting/
BLOG FOR UNIX ADMIN, VCS FUNDAMENTALS, VERITAS CLUSTER SERVICES
Beginners Lesson Veritas Cluster Services for

System Admin
39 Comments
The purpose of this post is to make the Cluster concept easy for those young
brothers who have just started their career as System Administrator. while
writing this post I have only one thing in mind, i.e. explain the entire cluster
concept with minimum usage of technical jargon and make it as simple as
possible. Thats all about the introduction, let us go to actual lesson.
In any organisation, every server in the network will have a specific purpose in
terms of its usage, and most of the times these servers are used to provide
stable environment to run software applications that are required for
organisations business. Usually, these applications are very critical for the
business, and organisations cannot afford to let them down even for minutes.
For Example: A bank having an application which takes care of its internet
banking.
From the below figure you can see an application running on a standalone
server which is configured with Unix Operating System and Database( oracle /
sybase / db2 /mssql etc). And the organisation considered to run it as
standalone application just because it was not critical in terms of business,
and in other words the whenever the application down it wont impact the
actual business.
Usually, the application clients for these application will connect to the
application server using the server name , server IP or specific application IP.
Standalone Application Server
Let us assume, if the organisation is having an application which is very critical for its business
and any impact to the application will cause huge loss to the organisation. In that case,
organisation is having one option to reduce the impact of the application failure due to the
Operating system or Hardware failure, i.e Purchasing a secondary server with same hardware
configuration , install same kind of OS & Database, and configure it with the same application in
passive mode. And failover the application from primary server to these secondary server
whenever there is an issue with underlying hardware/operating system of primary server.
Application Server with Highly Available Configuration
What is failover?
Whenever there is an issue related to the primary server which make application unavailable to
the client machines, the application should be moved to another available server in the network
either by manual or automatic intervention. Transferring application from primary server to the
secondary server and making secondary server active for the application is called failover
operation. And the reverse Operation (i.e. restoring application on primary server ) is called
Failback
Now we can call this configuration as application HA ( Highly Available ) setup compared to the
earlier Standalone setup. you agree with me ?
Now the question is, how is this manual fail over works when there is an application issue due to
Hardware/Operating System?
Manual Faiover basically involves below steps:
1.
Application IP should failover secondary node
2.
Same Storage and Data should be available on the secondary node
3.
Finally application should failover to the secondary node.
Application Server failover to Secondary Server
Challenges in Manual Failover Configuration
1.
Continuously monitor resources.
2.
Time Consuming
3.
Technically complex when it involves more dependent components for the application.
Then, what is alternative?

Just go for an automated failover software which will group the both primary server and
secondary server related to the application, and always keep an eye on primary server for any
failures and failover the application to secondary server automatically when ever there is an
issue with primary server.
Although we are having two different servers supporting the application, both of them are
actually serving the same purpose. And from the application client perspective they both
should be treated as single application cluster server ( composed of multiple physical servers
in the background).
Wow. Cluster .
Now, you know that cluster is nothing but group of individual servers working together to server
the same purpose ,and appear as a single machine to the external world.
What are the Cluster Software available in the market, today? There are many, depending on
the Operating System and Application to be supported. Some of them native to
the Operating System , and others from the third party vendor
List of Cluster Software available in the market
SUN Cluster Services Native Solaris Cluster
Linux Cluster Server Native Linux cluster
Oracle RAC Application level cluster for Oracle database that works on different
Operating Systems
Veritas Cluster Services Third Party Cluster Software works on Different Operating
Systems like Solaris / Linux/ AIX / HP UX.
HACMP IBM AIX based Cluster Technology
HP UX native Cluster Technology
And In this post, we are actually discussing about VCS and its Operations. This post is not
going to cover the actual implementation part or any command syntax of VCS, but will cover the
concept how VCS makes application Highly Available(HA).
Note: So far, I managed to explain the concept without using much complex terminology, but
now its time to introduce some new VCS terminology to you, which we use in every day
operations of VCS. Just keep little more focus on each new term.
VCS Components
VCS is having two types of Components 1. Physical Components 2. Logical Components
Physical Components:
1. Nodes
VCS nodes host the service groups (managed applications). Each system is connected to
networking hardware, and usually also to storage hardware. The systems contain components
to provide resilient management of the applications, and start and stop agents.
Nodes can be individual systems, or they can be created with domains or partitions on
enterprise-class systems. Individual cluster nodes each run their own operating system and
possess their own boot device. Each node must run the same operating system within a single
VCS cluster.
Clusters can have from 1 to 32 nodes. Applications can be configured to run on specific nodes
within the cluster.
2. Shared storage
Storage is a key resource of most applications services, and therefore most service groups. A
managed application can only be started on a system that has access to its associated data
files. Therefore, a service group can only run on all systems in the cluster if the storage is
shared across all systems. In many configurations, a storage area network (SAN) provides this
requirement.
You can use I/O fencing technology for data protection. I/O fencing blocks access to shared
storage from any system that is not a current and verified member of the cluster.
3. Networking Components
Networking in the cluster is used for the following purposes:
Communications between the cluster nodes and the Application Clients and external
systems.
Communications between the cluster nodes, called Heartbeat network.
Logical Components
1. Resources
Resources are hardware or software entities that make up the application. Resources include
disk groups and file systems, network interface cards (NIC), IP addresses, and applications.
1.1. Resource dependencies
Resource dependencies indicate resources that depend on each other because of application or
operating system requirements. Resource dependencies are graphically depicted in a hierarchy,
also called a tree, where the resources higher up (parent) depend on the resources lower down
(child).
1.2. Resource types

VCS defines a resource type for each resource it manages. For example, the NIC resource type
can be configured to manage network interface cards. Similarly, all IP addresses can be
configured using the IP resource type.
VCS includes a set of predefined resources types. For each resource type, VCS has a
corresponding agent, which provides the logic to control resources.
2. Service groups
A service group is a virtual container that contains all the hardware and software resources that
are required to run the managed application. Service groups allow VCS to control all the
hardware and software resources of the managed application as a single unit. When a failover
occurs, resources do not fail over individually the entire service group fails over. If there is
more than one service group on a system, a group may fail over without affecting the others.
A single node may host any number of service groups, each providing a discrete service to
networked clients. If the server crashes, all service groups on that node must be failed over
elsewhere.
Service groups can be dependent on each other. For example a finance application may be
dependent on a database application. Because the managed application consists of all
components that are required to provide the service, service group dependencies create more
complex managed applications. When you use service group dependencies, the managed
application is the entire dependency tree.
2.1. Types of service groups
VCS service groups fall in three main categories: failover, parallel, and hybrid.
Failover service groups
A failover service group runs on one system in the cluster at a time. Failover groups are used for
most applications that do not support multiple systems to simultaneously access the
applications data.
Parallel service groups
A parallel service group runs simultaneously on more than one system in the cluster. A parallel
service group is more complex than a failover group. Parallel service groups are appropriate for
applications that manage multiple application instances running simultaneously without data
corruption.
Hybrid service groups
A hybrid service group is for replicated data clusters and is a combination of the failover and
parallel service groups. It behaves as a failover group within a system zone and a parallel group
across system zones.
3. VCS Agents
Agents are multi-threaded processes that provide the logic to manage resources. VCS has one
agent per resource type. The agent monitors all resources of that type; for example, a single IP
agent manages all IP resources.
When the agent is started, it obtains the necessary configuration information from VCS. It then
periodically monitors the resources, and updates VCS with the resource status.
4. Cluster Communications and VCS Daemons
Cluster communications ensure that VCS is continuously aware of the status of each systems
service groups and resources. They also enable VCS to recognize which systems are active
members of the cluster, which have joined or left the cluster, and which have failed.
4.1. High availability daemon (HAD)
The VCS high availability daemon (HAD) runs on each system. Also known as the VCS engine,
HAD is responsible for:
o
building the running cluster configuration from the configuration files
distributing the information when new nodes join the cluster
responding to operator input
taking corrective action when something fails.
The engine uses agents to monitor and manage resources. It collects information about
resource states from the agents on the local system and forwards it to all cluster members. The
local engine also receives information from the other cluster members to update its view of the
cluster.
The hashadow process monitors HAD and restarts it when required.
4.2. HostMonitor daemon
VCS also starts HostMonitor daemon when the VCS engine comes up. The VCS engine creates
a VCS resource VCShm of type HostMonitor and a VCShmg service group. The VCS engine
does not add these objects to the main.cf file. Do not modify or delete these components of
VCS. VCS uses the HostMonitor daemon to monitor the resource utilization of CPU and Swap.
VCS reports to the engine log if the resources cross the threshold limits that are defined for the
resources.
4.3. Group Membership Services/Atomic Broadcast (GAB)
The Group Membership Services/Atomic Broadcast protocol (GAB) is responsible for cluster
membership and cluster communications.
Cluster Membership
GAB maintains cluster membership by receiving input on the status of the heartbeat from each
node by LLT. When a system no longer receives heartbeats from a peer, it marks the peer as
DOWN and excludes the peer from the cluster. In VCS, memberships are sets of systems
participating in the cluster.
Cluster Communications
GABs second function is reliable cluster communications. GAB provides guaranteed delivery of
point-to-point and broadcast messages to all nodes. The VCS engine uses a private IOCTL
(provided by GAB) to tell GAB that it is alive.
4.4. Low Latency Transport (LLT)
VCS uses private network communications between cluster nodes for cluster maintenance.
Symantec recommends two independent networks between all cluster nodes. These networks
provide the required redundancy in the communication path and enable VCS to discriminate
between a network failure and a system failure. LLT has two major functions.
Traffic Distribution
LLT distributes (load balances) internode communication across all available private network
links. This distribution means that all cluster communications are evenly distributed across all
private network links (maximum eight) for performance and fault resilience. If a link fails, traffic is
redirected to the remaining links.
Heartbeat
LLT is responsible for sending and receiving heartbeat traffic over network links. The Group
Membership Services function of GAB uses this heartbeat to determine cluster membership.
4.5. I/O fencing module
The I/O fencing module implements a quorum-type functionality to ensure that only one cluster
survives a split of the private network. I/O fencing also provides the ability to perform SCSI-3
persistent reservations on failover. The shared disk groups offer complete protection against
data corruption by nodes that are assumed to be excluded from cluster membership.
5. VCS Configuration files.
5.1. main.cf
/etc/VRTSvcs/conf/config/main.cf is key file interms VCS configuration. the main.cf file
basically explains below information to the VCS agents/VCS daemons.
What are the Nodes available in the Cluster?
What are the Service Groups Configured for each node?
What are the resources available in each Service Group, the types of resources and
its attributes?
What are the dependencies each resource having on other resources?
What are the dependencies each service group having on other Service Groups?
5.2. types.cf
The file types.cf, which is listed in the include statement in the main.cf file, defines the VCS
bundled types for VCS resources. The file types.cf is also located in the folder
/etc/VRTSvcs/conf/config.
5.3. Other Important files
/etc/llthostslists all the nodes in the cluster
/etc/llttabdescribes the local systems private network links to the other nodes in the
cluster
Sample VCS Setup

From the below figure you can understand the VCS sample setup configured for an application
which is running with Database and Shared Storage.
Why we need Shared Storage for Clusters?
Normally, database servers were configured to store their database on SAN storage and it is
mandatory to these database to be reachable to the all other nodes, in the cluster, in order to fail
over the database from one node another node. And That is the reason both the nodes in the
below figure configured with common shared SAN storage, and in this
model all the cluster
nodes can see the storage devices from their local operating systems but at a time only one
node ( active ) can make write operations to the storage.
Why each server need two Storage Paths ( connected to two HBAs)?
To provide redundancy to the servers storage connection and to avoid single point of failure in
storage connection. When ever you notice multiple storage paths connected to any server, you
can safely assume that there is some storage multipath software running on the Operating
system e.g. multipathd, emc powerpath, hdlm, mpio etc.
Why each server need two network connection to physical network?
This is again , to provide redundancy for network connection of the server and to avoid single
point of failure in server physical network connectivity. When ever you see dual physical network
connection, you can assume that Server is using some king of IP multipath software to mange
dual path . e.g. IPMP in solaris, NIC Bonding in linux . etc.
Why we need minimum two Heart beat Connections, between the cluster nodes?
When the VCS lost all its heartbeat connections except the last one, the condition is
called cluster jeopardy. When the Cluster in jeopardy state any of the below things could
happen
1)
The
loss
of
the
last
available
interconnect
link
In this case, the cluster cannot reliably identify and discriminate if the last interconnect link is
lost or the system itself is lost and hence the cluster will form a network partition causing two or
more mini clusters to be formed depending on the actual network partition. At this time, every
Service Group that is not online on its own mini cluster, but may be online on the other mini
cluster will be marked to be in an autodisabled state for that mini cluster until such time that
the interconnect links start communicating normally.
2) The loss of an existing system which is currently in jeopardy state due to a problem
In this case, the situation is exactly the same as explained in step 1 forming two or more mini
clusters.
In case where both both the LLT interconnect links disconnect at the same time and we do not
have any low-pri links configured, then the cluster cannot reliably identify if it is the interconnects
that have disconnected and will assume that the other system is down and now unavailable.
Hence in this scenario, the cluster would consider this like a system fault and the service groups
will be attempted to be onlined on each mini cluster depending upon the system StartupList
defined on each Service Group. This may lead to a possible data corruption due to Applications
writing to the same underlying data on storage from different systems at the same time. This
Scenario is well known as Split Brain Condition .
Typical VCS Setup for an application with Database
This is all about introduction on VCS, and please stay tuned for the next posts , where I am
going to discuss about actual administration of VCS.
http://www.gurkulindia.com/main/2011/07/beginners-lesson-veritas-cluster-servicesfor-solaris/
How to add new filesystem in VCS cluster ?

July 24, 2012 in VCS
Normally after creating the filesystem, we will add it in vfstab to mount automatically
across the server reboot.But this will be different if your system is part of VCS
cluster.Normally all the application filesystem will be managed VCS to mount the
filesystem whenever the cluster starts and we shouldnt add it in vfstab.Here we are
going to see how to add new filesystem in existing VCS cluster on fly.
Its a tricky job because if filesystem resource is set as critical and its not mounted on the
system,it will bring down the entire service group once you enabled the resource.So
before enabling the resource,we need to make sure, resource attribute set as noncritical.
This is an example that how to add new filesystem on two node vcs cluster without any
downtime.
Environment:
Cluster Nodes: Node1,Node2
Diskgroup name:ORAdg
Volume Name:oradata01
Mount Point:/ORA/data01
Service Group:ORAsg
Volume Resource Name:oradata01_Vol
Mount Resource Name:oradata01_Mount
Diskgroup Resource Name:ORADG
Creating the new volume:
#vxassist -g ORAdg make oradata01 100g ORA_12 ORA_12 layout=mirror
Creating the vxfs filesystem

#newfs -F vxfs /dev/vx/rdsk/ORAdg/oradata01
Create mount point on both servers.
#mkdir p /ORA/data01
In Second node also create the mount point.
#mkdir p /ORA/data01
Modify the cluster configuration in read/write mode
# haconf makerw
Create a volume resource called oradata01_Vol specifying the diskgroup (ORAdg).
# hares -add oradata01_Vol Volume ORAsg
VCS NOTICE V-16-1-10242 Resource added. Enabled attribute must be set before agent
monitors
# hares -modify oradata01_Vol Critical 0
# hares -modify oradata01_Vol DiskGroup ORAdg
# hares -modify oradata01_Vol Volume oradata01
# hares -modify oradata01_Vol Enabled 1
Create a mount resource called oradata01_Mount with the mounting instructions for
the filesystem.
# hares -add oradata01_Mount Mount ORAsg
VCS NOTICE V-16-1-10242 Resource added. Enabled attribute must be set before agent
monitors
# hares -modify oradata01_Mount Critical 0
# hares -modify oradata01_Mount MountPoint /ORA/data01
# hares -modify oradata01_Mount BlockDevice /dev/vx/dsk/ORAdg/oradata01
# hares -modify oradata01_Mount FSType vxfs
# hares -modify oradata01_Mount FsckOpt %-y
# hares -modify oradata01_Mount Enabled 1
Unfreeze the SG
#hagrp unfreeze ORAsg
Bring the resource volume online on node1.
# hares -online oradata01_Vol -sys node1
# hares -online oradata01_Mount -sys node1
Verify the filesystem is mounted or not .If its mounted, then proceed.
Freeze the SG back
#hagrp freeze ORAsg -persistent
Make the resources are critical:

# hares -modify oradata01_Vol Critical 1
# hares -modify oradata01_Mount Critical 1
Creating the dependencies for new filesystem:
# hares link oradata01_Vol ORADG1
# hares link oradata01_Mount oradata01_Vol
# hares link ORA_Oracle oradata01_Mount
Make the cluster configuration in read-only
#haconf -dump -makero
Thank you for reading this article.Please leave a comment if you have any doubt ,i will
get back to you as soon as possible.
http://www.unixarena.com/2012/07/how-to-add-new-filesystem-in-vcs-cluster.html
How do you troubleshoot if VCS cluster is not

starting ?
July 11, 2014 in VCS, Veritas
How do you start VCS cluster if its not started automatically after the server reboot?
Have you ever faced such issues ? If not just see how we can fix these kind of issues
on veritas cluster. I have been asking this questions on the Solaris interviews but most
of them are fail to impress me by saying some unrelated things with VCS stuffs. If you
know the basic of veritas cluster, it will be so easy for to troubleshoot in real time and
easy to explain on interviews too.
VCS troubleshooting
Scenario:
Two nodes are clustered with veritas cluster and you have rebooted one of the server.
Rebooted node has come up but VCS cluster was not started (HAD daemon). You are
trying to start the cluster using hastart command , but its not working.How do you
troubleshoot ?
Here we go.
1.Check the cluster status after the server reboot using hastatus command.
# hastatus -sum |head
Cannot connect to VCS engine
2.Trying to start the cluster using hastart . No Luck. ? Still getting same message like
above ? Proceed with Step 3.
3.Check the llt and GAB service. If its in disable state, just enable it .
[root@UA~]# svcs -a |egrep "llt|gab"
online
Jun_27 svc:/system/llt:default
online
Jun_27 svc:/system/gab:default
[root@UA~]#
4.Check the llt(heartbeat) status. Here LLT links looks good.

[root@UA ~]# lltstat -nvv |head
LLT node information:
Node
0 UA2
* 1 UA
State
Link Status Address
OPEN
HB1 UP
00:91:28:99:74:89
HB2 UP
00:91:28:99:74:BF
HB1 UP
00:71:28:9C:2E:OF
HB2 UP
00:71:28:9C:2F:9F
OPEN
[root@UA ~]#
5.If the LLT is down ,then try to configure using lltconfig -c command to configure the
private links. Still if you have any issue with LLT links, then need to check with network
team to fix the heartbeat links.
6.check the GAB status using gabconfig -a command.
[root@UA ~]# gabconfig -a

GAB Port Memberships
===============================================================
[root@UA ~]#
7.As per the above command output, memberships are not seeded. We have to seed
the membership manually using gabconfig command.
[root@UA ~]# gabconfig -cx
[root@UA ~]#
8. Check the GAB status now.

===============================================================
Port a gen 6d0607 membership 01
[root@UA ~]#
Above output Indicates that GAB(Port a) is online on both the nodes. (0 , 1). To know
which node is 0 and which node 1 , refer /etc/llthosts file.
9.Try to start the cluster using hastart command.It should work now.
10.Check the Membership status using gabconfig.
===============================================================
Port a gen 6d0607 membership 01
Port h gen 6d060b membership 01

[root@UA ~]#
Above output Indicates that HAD(Port h) is online on both the nodes. (0 , 1).
11.Check the cluster status using hastatus command. System should be back to
business.
[root@UA ~]# hastatus -sum |head
-- SYSTEM STATE
-- System
State
A UA2
Frozen
RUNNING
A UA
RUNNING
-- GROUP STATE
-- Group
System
B ClusterService UA
B ClusterService UA2
Probed
Y
Y
AutoDisabled
N
N
State
ONLINE
OFFLINE
[root@UA ~]#
This is very small thing but many of the VCS beginners failed to fix this start-up issues.
In interviews too ,they are not able say that , If the HAD is not starting using hastart
command , I will check the LLT & GAB services and will fix any issues with that.Then i
will start the cluster using hastart As an interviewers , everybody will expect this
answers.
Hope this article is informative to you .
Share it ! Comment it !! Be Sociable !!!
http://www.unixarena.com/2014/07/troubleshoot-vcs-cluster-starting.html
cfg2html on Solaris OS configuration Backup

July 18, 2012 in Solaris 10
cfg2html is very use full script to take all the system configuration backup in text format
and html format.This script is available for Solaris,various Linux flavors and HP-Unix.
For more information about the script,please visit
http://groups.yahoo.com/group/cfg2html.
Once you run the script by default it will generate three files.
1. System configuration in text format
2. System configuration in html format
3. Script Error log
These configuration backup files are very useful to build the server from scratch.But we
have to make sure you have latest configuration backup by running cfg2hmtl periodically
and keep the output in other location or web portal for future reference.
Here is the script which you can download it and use it for Solaris 10.
Download cfg2html
From the google drive, Click on File tab- > Select Download
Download the cfg2html and keep in /var/tmp/

#cd /var/tmp
#tar -xvf cfg2html_solaris10_v1.0.tar
#cd cfg2html_solaris_10v1.0
bash-3.00# ls -lrt
total 56
-rwx------1 root root 24796 Jul 18 14:46 cfg2html_solaris_10v1.0

drwx------2 root root 11 Jul 18 14:46 plugins
bash-3.00# ./cfg2html_solaris_10v1.0
------------------------------------------------Starting
cfg2html_solaris_10 version 1.0 on a SunOS 5.10 i86pc
Path to cfg2html ./cfg2html_solaris_10v1.0

Path to plugins ./plugins
HTML Output File /Desktop/cfg2html_solaris10_v1.0/cfg2html_solaris10_v1.0/sfos_cfg.html
Text Output File /Desktop/cfg2html_solaris10_v1.0/cfg2html_solaris10_v1.0/sfos_cfg.txt
Errors logged to /Desktop/cfg2html_solaris10_v1.0/cfg2html_solaris10_v1.0/sfos_cfg.err
Started at
2012-07-18 14:46:51
------------------------------------------------------------------------------------------------Collecting: System Hardware and Operating System Summary ..

Collecting: Disk Device Listing ....
Collecting: Host-Bus Adapters (HBAs) ...
Collecting: Solaris Volume Manager (SVM) ......
Collecting: Local File Systems and Swap ......
Collecting: NFS Configuration .....
Collecting: Zone/Container Information ......
Collecting: Network Settings ..........
Collecting: EEPROM ....

Collecting: Cron ...
Collecting: System Log ..
Collecting: Resource Limits .....
Collecting: Services ....
Collecting: VxVM ...........
Collecting: VxFS ..
-------------------------------------------------
bash-3.00# ls -lrt
total 337
-rwx------1 root root 24796 Jul 18 14:46 cfg2html_solaris_10v1.0
drwx------2 root root
11 Jul 18 14:46 plugins
-rw-r--r-- 1 root
root 732 Jul 18 14:47 sfos_cfg.err
-rw-r--r-- 1 root
root 77295 Jul 18 14:47 sfos_cfg.html
-rw-r--r-- 1 root
root 65492 Jul 18 14:47 sfos_cfg.txt
bash-3.00# uname -a
SunOS sfos 5.10 Generic_142910-17 i86pc i386 i86pc
You can copy to windows and upload it in desired location .
you can add the cfg2hmtl in cronjob to run cfg2html periodically .

# export EDITOR=vi
# corntab -e
Add the below lines in the end of the file.
00 23 15 * * /var/tmp/cfg2html_solaris10_v1.0/cfg2html_solaris_10v1.0 > /dev/null 2> /dev/null
00 23 01 * * /var/tmp/cfg2html_solaris10_v1.0/cfg2html_solaris_10v1.0 > /dev/null 2> /dev/null
save the file & exit.The above job will run cfg2html 1st and 15th of the month at 11PM .
Thank you for reading this article.Please leave a comment if you have any doubt ,i will
get back to you as soon as possible.
http://www.unixarena.com/2012/07/cfg2html-on-solaris-os-configuration.html
How to reduce the Security risk in Solaris ?

Generic OS Hardening steps
June 7, 2013 in Security, Solaris 10
Is your Solaris environment is secure enough ? How can we tighten the system security ?
Here we will see some basic Hardening steps for Solaris OS.Every organization should
maintain hardening checklists of each operating systems which they are using it.Before
server is bringing to operation/production, hardening check list needs to be verified by
support team who supports the server.
Actually OS hardening part is begins before system built.Because you need to choose the
customized OS image according to your environment.By reducing the OS image
size,the possibility of risk(security and reliability) is very less and less size OS image
speeds up the boot process and consumes less disk space.
1.Apply Recommended Patch Cluster bundle regularly . It has very important bug fixes
and security fix patches. Visit https://support.oracle.com to check latest
additional security patches and install it if applicable to your environment.
2.Disable all the services which are not being used anymore.There are many services
which will make you system in high-risk.Disable services like RPC based services,NFS,NIS,
Sendmail,Apache,SNMP,printer services and internet based services if no longer used in
server.
3.Disable inetd services and use ssh for remote login and file-transfer.
Its better not to use telnet,ftp,rlogin services.
4.There are many parameters in the Solaris kernel which can be tuned to increase the
system security.Network parameters can be tuned using ndd
command.Other kernel parameters can modified using /etc/system file.
Network tweaks:
Disable IP forwarding on OS
Protect against SYN floods attacks
Reduce ARP timeouts

5.Restrict root to login only via console and remove un-used users from the system.
Restrict cron access to normal users and disable .rhosts.
6.Set warning banners in /etc/motd & /etc/issue.
7.Increase the level of logging in system accounting,process accounting,kernel level
auditing.
8.Create /etc/ftpd/ftpusers to restrict ftp to all users.
9.Remove the group writable from all files in /etc.
# chmod -R g-w /etc
10.Validate the OS start up scripts in all the run levels.Remove the start-up scrips which
no longer needed.(/etc/rc2.d & /etc/rc3.d)
11.Turns on stack protection which will help to protect your system from many buffer
overflow attacks.Add the below lines in /etc/system to turn on this feature.
set noexec_user_stack = 1
set noexec_user_stack_log = 1
12.Protect File Systems which are mounted on the system by setting nosuid or ro and
set logging option for root file system in vfstab.
13.Enable Packet Filtering is necessary to increase system security.
14.Restrict access to TCP based network services by using TCP wrappers.
15.Disable un-used SMF service using svcadm command.
16.Use Solaris Security Toolkit (JASS)
17.Be cautious with removable media devices.Stop vold if possible.
Solaris Troubleshooting : NFS TroubleShooting

13 Comments
This article will help you to understand some of the basic troubleshooting instructions for NFS
problems
1. Determine the NFS version:

To determine what version and transport of NFS is currently available, run rpcinfo on the NFS
server.
# rpcinfo -p | grep 100003

100003 2 udp 0.0.0.0.8.1 nfs superuser
100003 3 udp 0.0.0.0.8.1 nfs superuser
100003 2 tcp 0.0.0.0.8.1 nfs superuser
100003 3 tcp 0.0.0.0.8.1 nfs superuser
he second column above is the NFS version, the third column is the transport protocol.
Sun has implemented the following versions of NFS on its operating systems, for both client
and server:
OS Version
NFSv2
SunOS
UDP
Solaris[TM] 2.4 and below
UDP
NFSv3
NFSv4
Solaris[TM] 2.5,2.6,7,8,9
UDP and/or TCP
UDP and/or TCP
Solaris[TM] 10
UDP and/or TCP
UDP and/or TCP
*The UDP transport is not supported in NFSv4, as it does not contain the required congestion
control methods
2. Check the Connectivity for NFS Server from NFS client:
1. Check that the NFS server is reachable from the client by running:
#/usr/sbin/ping
2. If the server is not reachable from the client, make sure that the local name service is
running. For NIS+ clients:
#/usr/lib/nis/nisping -u
3. If the name service is running, make sure that the client has received the correct host
information # /usr/bin/getent hosts
4. If the host information is correct, but the server is not reachable from the client, run the ping
command from another client.
5. If the server is reachable from the second client, use ping to check connectivity of the first
client to other systems on the local network. If this fails, check the networking configuration on
the client. Check the following files:
/etc/hosts, /etc/netmasks, /etc/nsswitch.conf,
/etc/nodename, /etc/net/*/hosts etc.
6. If the software is correct, check the networking hardware.
TCP*
Additionally you can refer the NFS Hard mounts vs Soft Mounts
3. From the Server, Verify Service Daemons are running

a) confirm S10 smf network nfs server services are online:
# svcs -a |grep nfs
b) statd , lockd , mountd and nfsd processes should be running:
# ps -elf |grep nfs
c) compare the times when nfsd and mountd started with the time
when rpcbind was started. The rpcbind MUST have started before the NFS Daemons.
d) verify that the NFS programs have been registered with rpcbind:
# rpcinfo -s
to confirm specific RPC service use the following commands:
# rpcinfo -t 100003
# rpcinfo -t 100005
# rpcinfo -t 100021
e) logging may be enabled (not for NFSv4).
On the client:
a) confirm S10 smf network nfs client services are online:
# svcs -a |grep nfs

b) statd , lockd should be running
# ps -elf |grep nfs
c) You can verify the server is working from the client side.
# rpcinfo -s |egrep ?nfs|mountd|lock?
# rpcinfo -u 100003
# rpcinfo -u 100005
# rpcinfo -u 100021
4. Confirm proper syntax of dfstab share entries on NFS server.
Solaris OS defines shared (or exported) filesystems in the /etc/dfs/dfstab file. The standard
syntax of lines in that file is:
share [-F fstype] [ -o options] [-d ] [resource]
For example, the following /etc/dfs/dfstab file is for a server that makes available the filesystems
/usr, /var/spool/mail and /home:
share -F nfs /usr
share -F nfs /var/spool/mail
share -F nfs /home
You can add normal mount options to these lines, such as ro, rw and root. This is done by
proceeding the options with a -o flag. The following example shows our /etc/dfs/dfstab file, with
all filesystems shared read only:
share -F nfs -o ro /usr

share -F nfs -o ro /var/spool/mail
share -F nfs -o ro /home
To add new shares to existing ones, simply run the shareallcommand:
# shareall
This will share ALL filesystems available in the /etc/dfs/dfstab file. If you have never shared
filesystems from this machine before, you
must run the nfs.server script:
# /etc/init.d/nfs.server start
This will run the shareall(1M) command and start the nfs daemons, mountd(1M), and nfsd. The
nfs.server start procedure is also run on bootup, when the system enters run level 3.
5. Confirm file system is shared as seen on both ends.

The NFS server is the system that will share a file system. The ?showmount-e? or ?dfshares?
command will display what is being shared. From the client use command with nfs server
name.
# showmount -e
Note: that NFSv4 does not use mountd. If mountd is not running, showmount will not work.
6. Verify mount point exists and is in use
To display statistics for each NFS mounted file system, use the command ?nfsstat -m?. This
command will also tell you which options were used when the file system was mounted. You can
also check the contents of the /etc/mnttab. It should show what is currently mounted. Lastly,
check the dates between the server and the client. An incorrect date may show the file created
in the future causing confusion
http://www.gurkulindia.com/main/2011/05/nfs-troubleshooting/
Disk Groups in VXVM are in Disabled State.

11 Comments
A common problem which sometimes we use to see in vxvm is DGs in disabled state. In this
post I will try to provide a solution to this problem.
1.) Check out the outputs of df, vxdisk and vxdg to identify the state of DGs and filesystems.
yogesh-test# df -h
Filesystem
size used avail capacity Mounted on
/dev/md/dsk/d10
7.7G 3.9G 3.7G

14G
52%
1%
swap
14G 120K
/var/run
dmpfs
7G
0K 7G
0%
/dev/vx/dmp
dmpfs
7G
0K 7G
0%
/dev/vx/rdmp
df: cannot statvfs /myvol1: I/O error

df: cannot statvfs /yogvol: I/O error
yogesh-test# vxdisk -o alldgs list
DEVICE
TYPE
DISK
GROUP
STATUS
c1t0d0s2
auto:sliced
disk01
rootdg
online
c1t1d0s2
auto:sliced
disk02
rootdg
online
c3t1d0s2
auto:sliced
mydg02
mydg
online dgdisabled
c3t1d1s2
auto:sliced
mydg01
mydg
online dgdisabled
c3t1d2s2
auto:sliced
yogdg01
yogdg
online dgdisabled
yogesh-test# vxdg list

NAME
STATE
ID
rootdg
disabled
1090964640.15.yogesh-test
mydg
enabled
yogdg
disabled
Note: DGs are showing in disabled state, but still the volumes are still mounted. We need to
umount the filesystems which are in staevfs state. Also you can check the volume state by
vxinfo -pg <DG>. I missed out to take the output of this command to present here.
yogesh-test# fuser -cu /myvol1
yogesh-test# fuser -cu /yogvol
yogesh-test# fuser -ck /myvol1
yogesh-test# fuser -ck /yogvol
yogesh-test# umount /myvol1
yogesh-test# umount /yogvol
OR
yogesh-test# umount -f /myvol1

yogesh-test# umount -f /yogvol
2.) Now to get rid of the DGs from disabled state, we need to deport and import the DGs as
shown below:
yogesh-test# vxdg deport yogdg

yogesh-test# vxdg deport mydg
yogesh-test# vxdg import yogdg
yogesh-test# vxdg import mydg
yogesh-test# vxdg list

NAME
STATE
ID
rootdg
enabled
mydg
enabled
yogdg
enabled
yogesh-test# vxdisk -o alldgs list

DEVICE
TYPE
DISK
GROUP
STATUS
c1t0d0s2
auto:sliced
disk01
rootdg
online
c1t1d0s2
auto:sliced
disk02
rootdg
online
c3t1d0s2
auto:sliced
mydg02
mydg
online
c3t1d1s2
auto:sliced
mydg01
mydg
online
c3t1d2s2
auto:sliced
yogdg01
yogdg
online
Note: Sometime we have to use force option for importing & deporting DGs i.e vxdg -f import
<dg> & vxdg -f deport <dg>.
3.) Next step is to proceed with the volumes start and there mount using vxvol & mount
commands.
yogesh-test# vxvol -g yogdg startall
yogesh-test# vxvol -g mydg startall
yogesh-test# mount /yogvol

yogesh-test# mount /myvol1
yogesh-test# df -h /yogvol /myvol1

Filesystem
size used avail capacity Mounted on
/dev/vx/dsk/yogdg//yogvol
134G 975M
44G
3%
/yogvol
/dev/vx/dsk/mydg//myvol1
124G
83G
41G
68%
/myvol1
Note: Some times you may encounter problem during mounts, at that time kindly proceed with
the fcsk to clean the bad blocks in the FS and then try to mount the FS again.
http://www.gurkulindia.com/main/2012/02/disk-groups-in-vxvm-are-in-disabledstate/
http://www.gurkulindia.com/main/category/unix-administration/veritas/veritasvolume-manager/veritas-volume-manager-troubleshooting/
Veritas Volume Manager Notes

Veritas has long since been purchased by Symantec, but its products continue to be sold
under the Veritas name. Over time, we can expect that some of the products will have name
changes to reflect the new ownership.
Veritas produces volume and file system software that allows for extremely flexible and
straightforward management of a system's disk storage resources. Now that ZFS is
providing much of this same functionality from inside the OS, it will be interesting to see how
well Veritas is able to hold on to its installed base.
In Veritas Volume Manager (VxVM) terminology, physical disks are assigned a diskname
and imported into collections known as disk groups. Physical disks are divided into a
potentially large number of arbitrarily sized, contiguous chunks of disk space known as
subdisks. These subdisks are combined into volumes, which are presented to the operating
system in the same way as a slice of a physical disk is.
Volumes can be striped, mirrored or RAID-5'ed. Mirrored volumes are made up of equallysized collections of subdisks known as plexes. Each plex is a mirror copy of the data in the
volume. The Veritas File System (VxFS) is an extent-based file system with advanced
logging, snapshotting, and performance features.
VxVM provides dynamic multipathing (DMP) support, which means that it takes care of path
redundancy where it is available. If new paths or disk devices are added, one of the steps to
be taken is to run vxdctl enable to scan the devices, update the VxVM device list, and
update the DMP database. In cases where we need to override DMP support (usually in
favor of an alternate multipathing software like EMC Powerpath), we can run vxddladm
addforeign.
Here are some procedures to carry out several common VxVM operations. VxVM has a
Java-based GUI interface as well, but I always find it easiest to use the command line.
Standard VxVM Operations

Operation
Create a volume: (length specified in
sectors, KB, MB or GB)
Procedure
vxassist -g
dg-name make vol-name length(skmg)
Create a striped volume (add options for a

layout=stripe diskname1 diskname2...
stripe layout):
vxstop
Remove a volume (after unmounting and

removing from vfstab):
vol-name
then
vxassist -g
dg-name remove volume vol-name
or
vxedit -rf rm
vol-name
Create a VxFS file system:
mkfs -F vxfs -o largefiles

/dev/vx/rdsk/dg-name/vol-name
Snapshot a VxFS file system to an empty

volume:
mount -F vxfs -o snapof=orig-vol emptyvol mount-point
Display disk group free space:
vxdg -g dg-name free
Display the maximum size volume that

can be created:
vxassist -g dg-name maxsize[attributes]
List physical disks:
vxdisk list
Print VxVM configuration:
vxprint -ht
vxdiskadm
Add a disk to VxVM:
or
vxdiskadd
Bring newly attached disks under VxVM

control (it may be necessary to
use format or fmthardto label the disk
before the vxdiskconfig):
(follow menu prompts)

disk-name
drvconfig; disks
vxdiskconfig
vxdctl enable
Scan devices, update VxVM device list,

reconfigure DMP:
vxdctl enable
Scan devices on OS device tree, initiate

dynamic reconfig of multipathed disks.
vxdisk scandisks
Reset a disabled vxconfigd daemon:
vxconfigd -kr reset
Manage hot spares:
vxdiskadm (follow menu options and prompts)

vxedit set spare=[off|on] vxvm-disk-name
Rename disks:
vxedit rename
Rename subdisks:
vxsd mv
Monitor volume performance:
vxstat
old-disk-name new-disk-name
old-subdisk-name new-subdisk-name
vxassist growto|growby|shrinkto|
Re-size a volume (but not the file system): shrinkbyvolume-name length[s|m|k|g]

Resize a volume, including the file system: vxresize -F vxfs volume-name new-size[s|m|k|g]
Change a volume's layout:
vxassist relayout
volume-namelayout=layout
The progress of many VxVM tasks can be tracked by setting the -t flag at the time the
command is run: utility -t taskflag. If the task flag is set, we can use vxtask to list, monitor,
pause, resume, abort or set the task labeled by the tasktag.
Physical disks which are added to VxVM control can either be initialized (made into a native
VxVM disk) or encapsulated (disk slice/partition structure is preserved). In general, disks
should only be encapsulated if there is data on the slices that needs to be preserved, or if it
is the boot disk. (Boot disks must be encapsulated.) Even if there is data currently on a nonboot disk, it is best to back up the data, initialize the disk, create the file systems, and
restore the data.
When a disk is initialized, the VxVM-specific information is placed in a reserved location on
the disk known as a private region. The public region is the portion of the disk where the
data will reside.
VxVM disks can be added as one of several different categories of disks:
sliced: Public and private regions are on separate physical partitions. (Usually s3 is
the private region and s4 is the public region, but encapsulated boot disks are the reverse.)
simple: Public and private regions are on the same disk area.
cdsdisk: (Cross-Platform Data Sharing) This is the default, and allows disks to be
shared across OS platforms. This type is not suitable for boot, swap or root disks.
If there is a VxFS license for the system, as many file systems as possible should be
created as VxFS file systems to take advantage of VxFS's logging, performance and
reliability features.
At the time of this writing, ZFS is not an appropriate file system for use on top of VxVM
volumes. Sun warns that running ZFS on VxVM volumes can cause severe performance
penalties, and that it is possible that ZFS mirrors and RAID sets would be laid out in a way
that compromises reliability.
VxVM Maintenance
The first step in any VxVM maintenance session is to run vxprint -ht to check the state of
the devices and configurations for all VxVM objects. (A specific volume can be specified
with vxprint -ht volume-name.) This section includes a list of procedures for dealing with
some of the most common problems. (Depending on the naming scheme of a VxVM
installation, many of the below commands may require a -g dg-name option to specify the
disk group.)
Volumes which are not starting up properly will be listed as DISABLED orDETACHED. A
volume recovery can be attempted with the vxrecover -s volume-name command.
If all plexes of a mirror volume are listed as STALE, place the volume in maintenance
mode, view the plexes and decide which plex to use for the recovery:
vxvol maint volume-name (The volume state will be DETACHED.)
vxprint -ht volume-name
vxinfo volume-name (Display additional information about unstartable plexes.)
vxmend off plex-name (Offline bad plexes.)
vxmend on plex-name (Online a plex as STALE rather than DISABLED.)
vxvol start volume-name (Revive stale plexes.)
vxplex att volume-name plex-name (Recover a stale plex.)
If, after the above procedure, the volume still is not started, we can force a plex to a
clean state. If the plex is in a RECOVER state and the volume will not start, use a -f option
on the vxvol command:
vxmend fix clean plex-name
vxvol start volume-name
vxplex att volume-name plex-name
If a subdisk status is listing as NDEV even when the disk is listed as available with
vxdisk list the problem can sometimes be resolved by running
vxdg deport dgname; vxdg import dgname
to re-initialize the disk group.
To remove a disk:
Copy the data elsewhere if possible.
Unmount file systems from the disk or unmirror plexes that use the disk.
vxvol stop volume-name (Stop volumes on the disk.)
vxdg -g dg-name rmdisk disk-name (Remove disk from its disk group.)
vxdisk offline disk-name (Offline the disk.)
vxdiskunsetup c#t#d# (Remove the disk from VxVM control.)
To replace a failed disk other than the boot disk:

In vxdiskadm, choose option 4: Remove a disk for replacement. When prompted, chose
none for the disk to replace it.
Physically remove and replace the disk. (A reboot may be necessary if the disk is not hotswappable.) In the case of a fibre channel disk, it may be necessary to remove the /dev/dsk
and /dev/rdsk links and rebuild them with
drvconfig; disks
or a reconfiguration reboot.
In vxdiskadm, choose option 5: Replace a failed or removed disk. Follow the prompts and
replace the disk with the appropriate disk.
To replace a failed boot disk:
Use the eeprom command at the root prompt or the printenv command at the ok> prompt
to make sure that the nvram=devalias and boot-deviceparameters are set to allow a boot
from the mirror of the boot disk. If the boot paths are not set up properly for both mirrors of
the boot disk, it may be necessary to move the mirror disk physically to the boot disk's
location. Alternatively, the devalias command at the ok> prompt can set the mirror disk path
correctly, then use nvstore to write the change to the nvram. (It is sometimes necessary
to nvunalias aliasname to remove an alias from thenvramrc, then
nvalias aliasname devicepath
to set the new alias, then

nvstore
to write the changes to nvram.)

In short, set up the system so that it will boot from the boot disk's mirror.
Repeat the steps above to replace the failed disk.
Clearing a "Failing" Flag from a Disk:
First make sure that there really is not a hardware problem, or that the problem has been
resolved. Then,
vxedit set failing=off disk-name
Clearing an IOFAIL state from a Plex:
First make sure that the hardware problem with the plex has been resolved. Then,
vxmend -g dgname -o force off plexname
vxmend -g dgname on plexname
vxmend -g dgname fix clean plexname
vxrecover -s volname
VxVM Resetting Plex State

soltest/etc/vx > vxprint -ht vol53
Disk group: testdg
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO
EX NAME ASSOC VC PERMS MODE STATE
SR NAME KSTATE
v vol53 - DISABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 DISABLED IOFAIL 20971520 CONCAT - RW
sd disk141-21 vol53-01 disk141 423624704 20971520 0 EMC0_2 ENA
soltest/etc/vx > vxmend -g testdg -o force off vol53-01
Disk group: testdg
pl vol53-01 vol53 DISABLED OFFLINE 20971520 CONCAT - RW

soltest/etc/vx > vxmend -g testdg on vol53-01
Disk group: testdg
pl vol53-01 vol53 DISABLED STALE 20971520 CONCAT - RW
soltest/etc/vx > vxmend -g testdg fix clean vol53-01
soltest/etc/vx > !vxprint
vxprint -ht vol53
Disk group: testdg
pl vol53-01 vol53 DISABLED CLEAN 20971520 CONCAT - RW
soltest/etc/vx > vxrecover -s vol53
soltest/etc/vx > !vxprint
vxprint -ht vol53
Disk group: testdg
v vol53 - ENABLED ACTIVE 20971520 SELECT - fsgen
pl vol53-01 vol53 ENABLED ACTIVE 20971520 CONCAT - RW
VxVM Mirroring
Most volume manager availability configuration is centered around mirroring. While RAID-5
is a possible option, it is infrequently used due to the parity calculation overhead and the
relatively low cost of hardware-based RAID-5 devices.
In particular, the boot device must be mirrored; it cannot be part of a RAID-5 configuration.
To mirror the boot disk:
eeprom use-nvramrc?=true
Before mirroring the boot disk, set use-nvramrc? to true in the EEPROM settings. If you
forget, you will have to go in and manually set up the boot path for your boot mirror disk.
(See To replace a failed boot disk in the VxVM Maintenance section for the procedure.) It
is much easier if you set the parameter properly before mirroring the disk!
The boot disk must be encapsulated, preferably in the bootdg disk group. (The
bootdg disk group membership used to be required for the boot disk. It is still a standard,
and there is no real reason to violate it.)
If possible, the boot mirror should be cylinder-aligned with the boot disk. (This means
that the partition layout should be the same as that for the boot disk.) It is preferred that 12MB of unpartitioned space be left at either the very beginning or the very end of the
cylinder list for the VxVM private region. Ideally, slices 3 and 4 should be left unconfigured
for VxVM's use as its public and private region. (If the cylinders are aligned, it will make OS
and VxVM upgrades easier in the future.)
(Before bringing the boot mirror into the bootdg disk group, I usually run an
installboot command on that disk to install the boot block in slice 0. This should no longer be
necessary; vxrootmir should take care of this for us. I have run into circumstances in the
past where vxrootmir has not set up the boot block properly; Veritas reports that those bugs
have long since been fixed.)
Mirrors of the root disk must be configured with "sliced" format and should live in the
bootdg disk group. They cannot be configured with cdsdisk format. If necessary, remove the
disk and re-add it in vxdiskadm.
In vxdiskadm, choose option 6: Mirror Volumes on a Disk. Follow the prompts from
the utility. It will call vxrootmir under the covers to take care of the boot disk setup portion
of the operation.
When the process is done, attempt to boot from the boot mirror. (Check the
EEPROM devalias settings to see which device alias has been assigned to the boot mirror,
and run boot device-alias from the ok> prompt.
Procedure to create a Mirrored-Stripe Volume: (A mirrored-stripe volume mirrors several
striped plexesit is better to set up a Striped-Mirror Volume.)
vxassist -g dg-name make volume length layout=mirror-stripeCreating a

Striped-Mirror Volume: (Striped-mirror volumes are layered volumes which stripes across
underlaying mirror volumes.)
vxassist -g dg-name make volume length layout=stripe-mirror
Removing a plex from a mirror:
vxplex -g dg-name -o rm dis plex-name Removing a mirror from a volume:
vxassist -g dg-name remove mirror volume-name
Removing a mirror and all associated subdisks:
vxplex -o rm dis volume-name
Dissociating a plex from a mirror (to provide a snapshot):
vxplex dis volume-name
vxmake -U gen vol new-volume-name plex=plex-name (Creating a new volume
with a dissociated plex.)

vxvol start new-volume-name
vxvol stop new-volume-name (To re-associate this plex with the old volume.)
vxplex dis plex-name
vxplex att old-volume-name plex-name
vxedit rm new-volume-name
Removing a Root Disk Mirror:
vxplex -o rm dis rootvol-02 swapvol-02 [other root disk volumes]
/etc/vx/bin/vxunroot
http://solaristroubleshooting.blogspot.in/2013/06/veritas-volume-managernotes.html
Solaris SPARC Boot Sequence

The following represents a summary of the boot process for a Solaris 2.x system on Sparc
hardware.
Power On: Depending on the system involved, you may see some output on a serial
terminal immediately after power on. This may take the form of a Hardware Power
ONmessage on a large Enterprise server, or a "'" or "," in the case of an older Ultra system.
These indications will not be present on a monitor connected directly to the server.
POST: If the PROM diag-switch? parameter is set to true, output from the POST (Power
On Self Test) will be viewable on a serial terminal. The PROM diag-levelparameter
determines the extent of the POST tests. (See the Hardware Diagnostics page for more
information on these settings.) If a serial terminal is not connected, a prtdiag -vwill show
the results of the POST once the system has booted. If a keyboard is connected, it will beep
and the keyboard lights will flash during POST. If the POST fails, an error indication may be
displayed following the failure.
Init System: The "Init System" process can be broken down into several discrete parts:
OBP: If diag-switch? is set, an Entering OBP message will be seen on a serial

terminal. The MMU (memory management unit) is enabled.
NVRAM: If use-nvramrc? is set to true, read the NVRAMRC. This may contain
information about boot devices, especially where the boot disk has been encapsulated with
VxVM or DiskSuite.
Probe All: This includes checking for SCSI or other disk drives and devices.
Install Console: At this point, a directly connected monitor and keyboard will
become active, or the serial port will become the system console access. If a keyboard is
connected to the system, the lights will flash again during this step.
Banner: The PROM banner will be displayed. This banner includes a logo, system
type, PROM revision level, the ethernet address, and the hostid.
Create Device Tree: The hardware device tree will be built. This device tree can be
explored using PROM monitor commands at the ok> prompt, or by using prtconf once the
system has been booted.
Extended Diagnostics: If diag-switch? and diag-level are set, additional diagnostics will
appear on the system console.
auto-boot?: If the auto-boot? PROM parameter is set, the boot process will begin.
Otherwise, the system will drop to the ok> PROM monitor prompt, or (if sunmon-compat?
and security-mode are set) the > security prompt.
The boot process will use the boot-device and boot-file PROM parameters unlessdiagswitch? is set. In this case, the boot process will use the diag-device and diag-file.
bootblk: The OBP (Open Boot PROM) program loads the bootblk primary boot program
from the boot-device (or diag-device, if diag-switch? is set). If thebootblk is not present
or needs to be regenerated, it can be installed by running the installboot command after
booting from a CDROM or the network. A copy of the bootblk is available
at /usr/platform/àrch -k`/lib/fs/ufs/bootblk
ufsboot: The secondary boot program, /platform/àrch -k`/ufsboot is run. This
program loads the kernel core image files. If this file is corrupted or missing, a bootblk:
can't find the boot program or similar error message will be returned.
kernel: The kernel is loaded and run. For 32-bit Solaris systems, the relevant files are:
/platform/àrch -k`/kernel/unix
/kernel/genunix
For 64-bit Solaris systems, the files are:
/platform/àrch -k`/kernel/sparcV9/unix
/kernel/genunix
As part of the kernel loading process, the kernel banner is displayed to the screen. This
includes the kernel version number (including patch level, if appropriate) and the copyright
notice.
The kernel initializes itself and begins loading modules, reading the files with
the ufsbootprogram until it has loaded enough modules to mount the root filesystem itself.
At that point, ufsboot is unmapped and the kernel uses its own drivers. If the system
complains about not being able to write to the root filesystem, it is stuck in this part of the
boot process.
The boot -a command singlesteps through this portion of the boot process. This can be a
useful diagnostic procedure if the kernel is not loading properly.
/etc/system: The /etc/system file is read by the kernel, and the system parameters are
set.
The following types of customization are available in the /etc/system file:
moddir: Changes path of kernel modules.
forceload: Forces loading of a kernel module.
exclude: Excludes a particular kernel module.
rootfs: Specify the system type for the root file system. (ufs is the default.)
rootdev: Specify the physical device path for root.
set: Set the value of a tuneable system parameter.

If the /etc/system file is edited, it is strongly recommended that a copy of the working file
be made to a well-known location. In the event that the new /etc/system file renders the
system unbootable, it might be possible to bring the system up with a boot -a command
that specifies the old file. If this has not been done, the system may need to be booted from
CD or network so that the file can be mounted and edited.
kernel initialized: The kernel creates PID 0 ( sched). The sched process is sometimes
called the "swapper."
init: The kernel starts PID 1 (init).
init: The init process reads the /etc/inittab and /etc/default/init and follows the
instructions in those files.

Some of the entries in the /etc/inittab are:
fs: sysinit (usually /etc/rcS)
is: default init level (usually 3, sometimes 2)
s#: script associated with a run level (usually /sbin/rc#)

rc scripts: The rc scripts execute the files in the /etc/rc#.d directories. They are run by
the /sbin/rc# scripts, each of which corresponds to a run level.
Debugging can often be done on these scripts by adding echo lines to a script to print either
a "I got this far" message or to print out the value of a problematic variable.
http://solaristroubleshooting.blogspot.in/2013/03/solaris-sparc-boot-sequence.html

Ten Technical Points That Every System Admin Should Know When Joining Into A New Team

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ten Technical Points That Every System Admin Should Know When Joining Into A New Team

Uploaded by

Copyright:

Available Formats

Ten Technical points that every system admin

should know when joining into a new team

1. Know job scope of your team

Technical Architecture of Environment talks about below points :

3. Know about procedures and escalation

5. Intercommunication Procedures with Other Support Teams and Vendors

6. Know where to find the information

7. Know Important infrastructure servers Details

8. Get Ready with appropriate logistics

. The moment you join into a new job, start requesting

9. Areas of Automation, and the specific details

10. Understanding monitoring alerts and response procedures

What your experience says about this, just share with us

Republished by Blog Post Promoter

Network Physical Connectivity Check for Solaris

For X86 Hardware

Looking for Ethernet Packets.

Type any key to stop.

Settings for eth0:

# tcpdump -i eth0 -n ip host 10.16.8.63

Solaris Operating System

Home / Solaris Networking / How to configure Solaris 10 IPMP ?

How to configure Solaris 10 IPMP ?

1.Link Based IPMP

1.Link Based IPMP

speed: 1000 Mbps

speed: 1000 Mbps

speed: 1000 Mbps

Arena-Node1#cat /etc/netmasks |grep 192.168.2

Arena-Node1#ifconfig e1000g1 192.168.2.50 netmask 255.255.255.0 broadcast + up

Arena-Node1#ifconfig e1000g1 group arenagroup-1

Again i have connected back the LAN cable to e1000g1.

speed: 1000 Mbps

speed: 1000 Mbps

speed: 1000 Mbps

inet 192.168.2.50 netmask ffffff00 broadcast 192.168.2.255

Now i am going to enable it back.

How to configure Solaris 10 Link Based IPMP

How to configure Solaris 10 Probe based IPMP

mpathd daemon is responsible to detect an interface failure . It uses a configuration

The important parameters in mpathd configuration file are :

3. TRACK_INTERFACES_ONLY_WITH_GROUPS If turned on interfaces

# pkill -HUP in.mpathd

Most commonly used Link-Based IPMP configurations

1. Single interface Link based IPMP configuration

# ifconfig e1000g0 plumb 192.168.1.2 netmask + broadcast + group IPMPgroup up

For persistent configuration across reboots edit the files :

2. Multiple interface Link based IPMP configuration

# ifconfig e1000g0 plumb 192.168.1.2 netmask + broadcast + group IPMPgroup up

For persistent configuration across reboots edit the files :

e1000g1: flags=1000843[UP,BROADCAST,RUNNING,MULTICAST,IPv4] mtu 1500 index 15

b. Active standby Configuration

# ifconfig e1000g0 plumb 192.168.1.2 netmask + broadcast + group IPMPgroup up

# ifconfig e1000g1 plumb group IPMPgroup standby up

For persistent configuration across reboots edit the files :

How to configure Solaris 10 Probe based IPMP

How to configure Solaris 10 Link Based IPMP

mpathd daemon is responsible to detect an interface failure . It uses a configuration

The important parameters in mpathd configuration file are :

# pkill -HUP in.mpathd

Failure detection and repair detection time

The probing rate depends on the FAILURE_DETECTION_TIME set in the