11g R2 RAC

11g R2 RAC : NODE
EVICTION DUE TO
MEMBER KILL
ESCALATION
November 9, 201211g R2 RAC, Uncategorized
If the Oracle Clusterware itself is working perfectly but one of the RAC instances is
hanging , the database LMON process will request a member kill escalation and ask the
CSS process to remove the hanging database instance from the cluster.
The following example will demonstrate it in a cluster consisting of two nodes:
SQL> select instance_name, host_name from gv$instance;
SQL> col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
-
orcl1 host01.example.com
- On host02 server stop the execution of all rdbms processes (by sending the
STOP signal)
Find out current database processes
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2
oracle 6215 1 0 11:20 ? 00:00:00 ora_pmon_orcl2
oracle 6217 1 0 11:20 ? 00:00:00 ora_vktm_orcl2
oracle 6221 1 0 11:20 ? 00:00:00 ora_gen0_orcl2
oracle 6223 1 0 11:20 ? 00:00:00 ora_diag_orcl2
oracle 6225 1 0 11:20 ? 00:00:00 ora_dbrm_orcl2
oracle 6227 1 0 11:20 ? 00:00:00 ora_ping_orcl2
oracle 6229 1 0 11:20 ? 00:00:00 ora_psp0_orcl2
oracle 6231 1 0 11:20 ? 00:00:00 ora_acms_orcl2
oracle 6233 1 0 11:20 ? 00:00:00 ora_dia0_orcl2
oracle 6235 1 0 11:20 ? 00:00:00 ora_lmon_orcl2
oracle 6237 1 0 11:20 ? 00:00:02 ora_lmd0_orcl2

stop the execution of all rdbms processes (by sending the STOP signal)
[root@host02 ~]# ps -ef | grep ora_ | grep orcl2 | awk {print $2} | while read PID
do
kill -STOP $PID
done
. From the client point of view the Real Application Cluster database is hanging
on both nodes. No queries or DMLs are possible. Try to execute a query. The
query will hang.
SQL> select instance_name, host_name from gv$instance;
no output, query hangs
. Due to missing heartbeats the healthy RAC instance on node host01 will remove the
hanging RAC instance by requesting a member kill escalation.
Check the database alert log file on host01 : LMS process issues a request to
CSSD to reboot the node.
The node is evicted and instance is restarted after node joins the cluster.
[root@host01 trace]# tailf /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/alert_orcl1.log
LMS0 (ospid: 31771) has detected no messaging activity from instance 2
LMS0 (ospid: 31771) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Nov 09 11:15:04 2012
Remote instance kill is issued with system inc 30
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Fri Nov 09 11:15:13 2012
IPC Send timeout detected. Sender: ospid 6308 [oracle@host01.example.com (PZ97)]
Receiver: inst 2 binc 429420846 ospid 6251
Waiting for instances to leave:
2
Reconfiguration started (old inc 4, new inc 8)
List of instances:
1 (myinst: 1)
.. Recovery of instance 2 starts
Global Resource Directory frozen
.
All grantable enqueues granted
Post SMON to start 1st pass IR
-
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
IPC Send timeout to 2.0 inc 4 for msg type 12 from opid 42
Completed redo scan
read 93 KB redo, 55 data blocks need recovery
Started redo application at
Thread 2: logseq 9, block 42
Recovery of Online Redo Log: Thread 2 Group 3 Seq 9 Reading mem 0
Mem# 0: +DATA/orcl/onlinelog/group_3.266.798828557
Mem# 1: +FRA/orcl/onlinelog/group_3.259.798828561
Completed redo application of 0.05MB
Completed instance recovery at
Thread 2: logseq 9, block 228, scn 1069404
52 data blocks read, 90 data blocks written, 93 redo k-bytes read
Thread 2 advanced to log sequence 10 (thread recovery)
Fri Nov 09 12:18:55 2012
.
Check the cluster clusterware alert log of host01
The node is evicted and rebooted to join the cluster
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
[cssd(14493)]CRS-1607:Node host02 is being evicted in cluster incarnation
247848838; details at (:CSSNM00007:) in
/u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-09 11:15:56.140
[ohasd(12412)]CRS-8011:reboot advisory message from host: host02, component:
mo103324, with time stamp: L-2012-11-09-
11:15:56.580
[ohasd(12412)]CRS-8013:reboot advisory message text: clsnomon_status: need to
reboot, unexpected failure 8 received from
CSS
2012-11-09 11:16:17.365
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 11:16:17.400
[crsd(14820)]CRS-5504:Node down event reported for node host02.
2
Node 2 joins the cluster
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01
host02 .
2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server host02 has been assigned to pool Generic.
2012-11-09 12:18:52.713
[crsd(14820)]CRS-2772:Server host02 has been assigned to pool ora.orcl.
7. After the node rejoins the cluster and the instance is restarted, reexecute the
query it succeeds
SQL> conn sys/oracle@orcl as sysdba
col host_name for a20
select instance_name, host_name from gv$instance;
INSTANCE_NAME HOST_NAME
-
References:
http://www.unbreakablecloud.com/wordpress/2010/11/02/understanding-cluster-node-
eviction/
-
Related links:
Home
11G R2 RAC Index
Node Eviction Due To Missing Network Heartbeat
Node Eviction Due To Missing Disk Heartbeat
Node Eviction Due To CSSD Agent Stopping
11g R2 RAC: Reboot-less Node Fencing
11g R2 RAC :Reboot-less Fencing With Missing Disk Heartbeat
11g R2 RAC: Reboot-less Fencing With Missing Network Heartbeat
===========
11g R2 RAC: NODE
EVICTION DUE TO
MISSING NETWORK
HEARTBEAT
In this post, I will demonstrate node eviction due to missing netsork heartbeat i.e. a
node will be evicted from the cluster, if it cant communicate wioth other nodes in the
cluster. To simulate it, I will stop private network on one of the nodes and then scan alert
logs of the surviving nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop PVT. network service on host03 so that it is evicted.
Find out the pvt network name
[root@host03 ~]# oifcfg getif
eth0 192.9.201.0 global public
eth1 10.0.0.0 global cluster_interconnect
Stop pvt. network service on host03 so that it cant communicate with host01 and
host02 and will be evicted.
[root@host03 ~]# ifdown eth1
-
OCSSD log of host03
It can be seen that CSSD process of host03 cant communicate with host01 and host02
at 09:43:52
Hence votedisk timeouot is set to Short Disk Time OUT (SDTO) = 27000 ms (27 secs)
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: node host01 (1)
at 50% heartbeat fatal, removal in 14.880 seconds
is impending reconfig, flag 132108, misstime 15120
2012-11-19 09:43:52.714: [ CSSD][843736976]clssnmPollingThread: local
diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig
status(1)
2012-11-19 09:43:52.927: [ CSSD][2833247120]clssnmSendingThread: sending
status msg to all nodes
Alert log of host03
At 09:43:52, CSSD process host03 identifies that it cant communicate with CSSD on
host02 and host03
[cssd(5124)]CRS-1612:Network communication with node host01 (1) missing for 50% of
timeout interval. Removal of this node from cluster in 14.880 seconds
2012-11-19 09:43:52.714
2012-11-19 09:44:01.880
2012-11-19 09:44:01.880
2012-11-19 09:44:06.536
2012-11-19 09:44:06.536
2012-11-19 09:44:09.599
At 09:44:16, CSSD process of host03 reboots the node to preserve cluster integrity
[cssd(5124)]CRS-1609:This node is unable to communicate with other nodes in the
cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in
/u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-19 09:44:16.697
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5713)]CRS-5822:Agent
/u01/app/11.2.0/grid/bin/orarootagent_root disconnected from server. Details at
(:CRSAGF00117:) in
/u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-11-19 09:44:16.193
[ctssd(5285)]CRS-2402:The Cluster Time Synchronization Service aborted on host
host03. Details at (:ctsselect_mmg5_1: in
/u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
2012-11-19 09:44:21.177
Ocssd log of host01
At 09:43:53, CSSD process of host01 identifies that it cantommunicate with CSSD on

host03
status(1)
-
Alert log of host01
-
At 09:44:01, alert log of host01 is updated regarding communication failure with
host03
2012-11-19 09:44:01.695
2012-11-19 09:44:07.666
2012-11-19 09:44:10.606
[cssd(5308)]CRS-1607:Node host03 is being evicted in cluster incarnation 32819913;
details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log.
2012-11-19 09:44:24.705
At 09:44:24, OHASD process on host01 receives reboot message from host03
ag050107, with time stamp: L-2012-11-19-09:44:24.373
reboot, unexpected failure 8 received from CSS
2012-11-19 09:44:24.705
mo050107, with time stamp: L-2012-11-19-09:44:24.376
2012-11-19 09:44:46.379
host02 .
-
OCSSD log of host02
At 09:43:52, CSSD process of host02 identifies communication failure with host03

status(1)
2012-11-19 09:43:52.733: [ CSSD][810166160]clssnmvSchedDiskThreads:
DiskPingThread for voting file ORCL:ASMDISK01 sched delay 970 > margin 750
cur_ms 18331974 lastalive 18331004
20
Alert log of host02
At 09:44:01 (same as host01), alert log of host02 is updated regarding communication

failure with host03
2012-11-19 09:44:01.971
2012-11-19 09:44:06.750
2012-11-19 09:44:24.520
At 09:44:24 (same as host01), OHASD process on host01 receives reboot message
from host03
2012-11-19 09:44:24.520
2012-11-19 09:44:46.073
host02 .
20
References:
eviction/
Related links:
Home
11G R2 RAC INDEX
NODE EVICTION DUE TO MISSING DISK HEARTBEAT
NODE EVICTION DUE TO MEMBER KILL ESCALATION
NODE EVICTION DUE TO CSSDAGENT STOPPING
=========
11g R2 RAC: NODE

EVICTION DUE TO
MISSING DISK
HEARTBEAT
In this post, I will demonstrate node eviction due to missing disk heartbeat i.e. a node
will be evicted from the cluster, if it cant access the voting disk. To simulate it, I will stop
iscsi service on one of the nodes and then scan alert logs and ocssd logs of various
nodes.
Current scenario:
No. of nodes in the cluster : 3
Names of the nodes : host01, host02, host03
Name of the cluster database : orcl
I will stop ISCSI service on host03 so that it is evicted.
Stop ISCSI service on host03 so that it cant access shared storage and hence voting
disk
[root@host03 ~]# service iscsi stop

scan alert log of host03 Note that I/O error occurs at 03:32:11
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/alerthost03.log
Note that ocssd process of host03 is not able to access voting disks
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01;
2012-11-17 03:32:11.310

2012-11-17 03:32:11.311
[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details

at (:CSSNM00060:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:32:11.311

2012-11-17 03:32:11.312

2012-11-17 03:32:11.310

ACFS cant be accessed
[client(8048)]CRS-10001:ACFS-9112: The following process IDs have open references

on /u01/app/oracle/acfsmount/11.2.0/sharedhome:
[client(8050)]CRS-10001:6323 6363 6391 6375 6385 6383 6402 6319 6503 6361 6377
6505 6389 6369 6335 6367 6333 6387 6871 6325 6381 6327 6496 6498 6552 6373
7278 6339 6400 6357 6500 6329 6365
[client(8052)]CRS-10001:ACFS-9113: These processes will now be terminated.
[client(8127)]CRS-10001:ACFS-9114: done.
[client(8136)]CRS-10001:ACFS-9115: Stale mount point

/u01/app/oracle/acfsmount/11.2.0/sharedhome was recovered.
[client(8178)]CRS-10001:ACFS-9114: done.
[client(8183)]CRS-10001:ACFS-9116: Stale mount point

/u01/app/oracle/acfsmount/11.2.0/sharedhome was not recovered.
[client(8185)]CRS-10001:ACFS-9117: Manual intervention is required.
2012-11-17 03:33:34.050
[/u01/app/11.2.0/grid/bin/orarootagent.bin(5682)]CRS-5016:Process
/u01/app/11.2.0/grid/bin/acfssinglefsmount spawned by agent
/u01/app/11.2.0/grid/bin/orarootagent.bin for action start failed: details at

(:CLSN00010:) in
/u01/app/11.2.0/grid/log/host03/agent/crsd/orarootagent_root/orarootagent_root.log
At 03:34, voting disk cant be accessed even after waiting for timeout
2012-11-17 03:34:10.718
[cssd(5149)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting
file ORCL:ASMDISK01 will be considered not functional in 99190 milliseconds
2012-11-17 03:34:10.724
2012-11-17 03:34:10.724
2012-11-17 03:35:10.666
2012-11-17 03:35:10.666
2012-11-17 03:35:10.666
2012-11-17 03:35:46.654
2012-11-17 03:35:46.654
2012-11-17 03:35:46.654
Voting files are offlined as they cant be accessed

[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK01; details at
(:CSSNM00058:) in /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log.
2012-11-17 03:36:10.596

2012-11-17 03:36:10.596

2012-11-17 03:36:10.596
CSSD of host03 reboots the node as no. of voting disks available(0) is less than
minimum required (2)
[cssd(5149)]CRS-1606:The number of voting files available, 0, is less than the

minimum number of voting files required, 2, resulting in CSSD termination to
ensure data integrity; details at (:CSSNM00018:) in
/u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
2012-11-17 03:36:15.645
[ctssd(5236)]CRS-2402:The Cluster Time Synchronization Service aborted on host

host03. Details at (:ctsselect_mmg5_1: in
/u01/app/11.2.0/grid/log/host03/ctssd/octssd.log.
scan ocssd log of host03
[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log
I/O fencing for ORCL database is carried out by CSSD at 03:32 ( same time as
when host02 got the msg that orcl has failed on host03)
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client
(0xaa14990), member 2 in group DBORCL, no share, death fence 1, SAGE fence 0
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmUnreferenceMember: global

grock DBORCL member 2 refcount is 7
2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceProcessDeath: client

(0xaa14990) pid 6337 undead
..

(0xaa24250), member 4 in group DAALL_DB, no share, death fence 1, SAGE fence 0

(0xaa6db08), member 0 in group DG_LOCAL_DATA, same group share, death fence 1,
SAGE fence 0
2012-11-17 03:32:10.357: [ CSSD][864708496]clssgmTermMember: Terminating

member 2 (0xaa15920) in grock DBORCL
2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa46760)

process death fence completed for process 6337, object type 3
..
2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa05758)

process death fence completed for process 6337, object type 2
2012-11-17 03:32:10.359: [ CSSD][852125584]clssgmRemoveMember: grock

DAALL_DB, member number 4 (0xaa05aa8) node number 3 state 0x0 grock type 2
2012-11-17 03:32:11.310: [ SKGFD[942172048]ERROR: -15(asmlib

ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)
2012-11-17 03:32:11.310: [ CSSD][942172048]

(:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 19 of ORCL:ASMDISK02
2012-11-17 03:32:11.310: [ SKGFD][973764496]ERROR: -15(asmlib

ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so op ioerror error I/O Error)
2012-11-17 03:32:11.310: [ CSSD][973764496]

(:CSSNM00059:)clssnmvWriteBlocks: write failed at offset 19 of ORCL:ASMDISK03

(0xaa38ae0), member 2 in group DBORCL, same group share, death fence 1, SAGE
fence 0

(0xaa5e128), member 0 in group DG_LOCAL_DATA, same group share, death fence 1,
SAGE fence 0
2012-11-17 03:32:11.354: [ CSSD]

[908748688]clssnmvDiskAvailabilityChange: voting file ORCL:ASMDISK01 now
offline
2012-11-17 03:32:11.354: [ CSSD]
offline
2012-11-17 03:32:11.354: [ CSSD]

offline


2012-11-17 03:32:12.223: [ CLSF][887768976]Closing handle:0xa746bc0
2012-11-17 03:32:12.223: [ SKGFD][887768976]Lib

:ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa746df8 for disk
:ORCL:ASMDISK01:
2012-11-17 03:32:12.236: [ CLSF][921192336]Closing handle:0xa5cbbb0
2012-11-17 03:32:12.236: [ SKGFD][921192336]Lib

:ASM:/opt/oracle/extapi/32/asm/orcl/1/libasm.so: closing handle 0xa644fb8 for disk
:ORCL:ASMDISK02:



cur_ms 233574
2012-11-17 03:36:10.638: [ CSSD][877279120]CALL TYPE: call ERROR

SIGNALED: no CALLER: clssscExit
scan alert log of host01
Note that reboot message from host03 is received at 03:36:16
[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log


2012-11-17 03:36:29.610
After host03 reboots itself, network communication with host03 is lost
[cssd(5177)]CRS-1612:Network communication with node host03 (3) missing for

50% of timeout interval. Removal of this node from cluster in 14.060 seconds
2012-11-17 03:36:37.988

2012-11-17 03:36:43.992

2012-11-17 03:36:46.441
After network communication cant be established for timeout interval, the node
is removed form cluster
[cssd(5177)]CRS-1632:Node host03 is being removed from the cluster in cluster

incarnation 232819906
2012-11-17 03:36:46.572

host02 .
Note that ocssd process of host01 discovers missing disk heartbeat from
host03 at 03:32:16
[root@host01 cssd]# tailf /u01/app/11.2.0/grid/log/host01/cssd/ocssd.log
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess:

clssgmCommonAddMember failed, member(-1/CLSN.ONS.ONSNETPROC[3]) on
node(3)
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmGrockOpTagProcess:
Operation(3) unsuccessful grock(CLSN.ONS.ONSNETPROC[3])
2012-11-17 03:32:16.352: [ CSSD][852125584]clssgmHandleMasterJoin:

clssgmProcessJoinUpdate failed with status(-10)
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: host03, node(3)

connection failed, endp (0x319), probe((nil)), ninf->endp 0x319
2012-11-17 03:36:15.328: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up,

endp (0x319), init state 3, cur state 3
2012-11-17 03:36:15.329: [GIPCXCPT][852125584]gipcInternalDissociate: obj

0x96c7eb8 [0000000000001310] { gipcEndpoint : localAddr gipc://host01:f278-d1bd-
1509-2f25#10.0.0.1#20071, remoteAddr gipc://host03:gm_cluster01#10.0.0.3#58536 ,
numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x141d,
pidPeer 0, flags 0x261e, usrFlags 0x0 } not associated with any container, ret gipcretFail
(1)
scan alert log of host02
Note that reboot message is received ar 03:36:16
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/alerthost02.log
. At 03:32, CRSD process of host02 receives message that orcl database has
failed on host03 as
datafiles for orcl are on shared storage

[crsd(5576)]CRS-2765:Resource ora.orcl.db has failed on server host03 .
2012-11-17 03:32:44.303
. CRSD process of host02 receives message that acfs has failed on host03 as
shared storage cant be accessed
[crsd(5576)]CRS-2765:Resource ora.acfs.dbhome_1.acfs has failed on server

host03.
2012-11-17 03:36:16.981
. ohasd process receives reboot advisory message from host03


2012-11-17 03:36:16.981


2012-11-17 03:36:28.920
. CSSD process of host02 identifies missing network communication from

host03 as host03 has rebooted itself

2012-11-17 03:36:37.307

2012-11-17 03:36:43.328

After network communication cant be established for timeout interval, the node
is removed form cluster
2012-11-17 03:36:46.297

host02 .
2012-11-17 03:36:46.470
2012-11-17 03:36:51.890
[crsd(5576)]CRS-2773:Server host03 has been removed from pool Generic.
2012-11-17 03:36:51.909
[crsd(5576)]CRS-2773:Server host03 has been removed from pool ora.orcl.

host02 host03 .
note that ocssd of host02 discovers missing host03 only after it has been
rebooted at 03:36
[root@host02 ~]# tailf /u01/app/11.2.0/grid/log/host02/cssd/ocssd.log
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: host03, node(3)

connection failed, endp (0x22e), probe((nil)), ninf->endp 0x22e
2012-11-17 03:36:15.052: [ CSSD][810166160]clssnmDiscHelper: node 3 clean up,

endp (0x22e), init state 3, cur state 3
..
2012-11-17 03:36:15.052: [ CSSD][852125584]clssgmPeerDeactivate: node 3

(host03), death 0, state 0x1 connstate 0x1e


status(1)

DiskPingMonitorThread sched delay 810 > margin 750 cur_ms 474884 lastalive 474074


2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmTagize: version(1), type(13),

tagizer(0x80cf3ac)
2012-11-17 03:36:29.908: [ CSSD][852125584]clssgmHandleDataInvalid: grock

HB+ASM, member 1 node 1, birth 1





2012-11-17 03:36:32.204: [ CSSD][831145872]clssnmSendingThread: sending status

msg to all nodes
2012-11-17 03:36:46.161: [ CSSD][810166160]clssnmHandleSync: local disk timeout

set to 27000 ms, remote disk timeout set to 27000
References:
eviction/
11g R2 RAC: NODE

EVICTION DUE TO
CSSDAGENT STOPPING
In addition to the ocssd.bin process which is responsible, among other things, for the
network
and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring
processes
cssdagent and cssdmonitor , which run with the highest real-time scheduler priority and
are also
able to fence a server.
Find out PID for cssdagent
[root@host02 lastgasp]# ps -ef |grep cssd |grep -v grep
root 5085 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdmonitor
root 5106 1 0 09:45 ? 00:00:00 /u01/app/11.2.0/grid/bin/cssdagent
grid 5136 1 0 09:45 ? 00:00:02 /u01/app/11.2.0/grid/bin/ocssd.bin
Find out the scheduling priority of cssdagent
[root@host02 lastgasp]# chrt -p 5106
pid 5106s current scheduling policy: SCHED_RR
pid 5106s current scheduling priority: 99
Since cssdagent and cssdmonitor have schedulilng priority of 99 stopping them can
reset a server in case :
there is some problem with the ocssd.bin process
there is some problem with OS scheduler
. CPU starvation
OS is locked up in a driver or hardware (e.g. I/O call)
Both of them are also associated with an undocumented timeout. In case the execution
of the
processes stops for more than 28 sec., the node will be evicted.
Let us stop the execution of cssdagent for 40 secs
root@rac1 ~]# kill -STOP 5106; sleep 40; kill -CONT 5106
check the alert log of host01
Node2 is rebooted
[grid@host01 host01]$ tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log
ag100946, with time stamp: L-2012-11-09-
10:21:28.040
[ohasd(12412)]CRS-8013:reboot advisory message text: Rebooting after limit 28100
exceeded; disk timeout 28100, network
timeout 27880, last heartbeat from CSSD at epoch seconds 352436647.013, 34280
milliseconds ago based on invariant clock
Node 2 is rebooted and network connection with it breaks
value of 294678040
2012-11-09 10:21:45.671
50% of timeout interval. Removal of this node
from cluster in 14.330 seconds
2012-11-09 10:21:53.923
[cssd(14493)]CRS-1611:Network communication with node host02 (2) missing for 75%
of timeout interval. Removal of this node
2012-11-09 10:21:59.845
[cssd(14493)]CRS-1610:Network communication with node host02 (2) missing for 90%
of timeout interval. Removal of this node
2012-11-09 10:22:02.587
[cssd(14493)]CRS-1632:Node host02 is being removed from the cluster in cluster
incarnation 247848834
2012-11-09 10:22:02.717
[cssd(14493)]CRS-1601:CSSD Reconfiguration complete. Active nodes are host01 .
2012-11-09 10:22:02.748
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool Generic.
2012-11-09 10:22:10.086
[crsd(14820)]CRS-2773:Server host02 has been removed from pool ora.orcl.
References:
eviction/
Related links:
Home
11G R2 RAC Index
Node Eviction Due To Missing Network Heartbeat

Node Eviction Due T0 Missing Disk Heartbeat
Node Eviction Due To Member Kill Escalation


11g R2 RAC

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11g R2 RAC

Uploaded by

Copyright:

Available Formats

11g R2 RAC : NODE

Alert log of host03

Ocssd log of host01

At 09:43:53, CSSD process of host01 identifies that it cantommunicate with CSSD on

At 09:43:52, CSSD process of host02 identifies communication failure with host03

Alert log of host02

At 09:44:01 (same as host01), alert log of host02 is updated regarding communication

11g R2 RAC: NODE

I will stop ISCSI service on host03 so that it is evicted.

[root@host03 ~]# service iscsi stop

[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/alerthost03.log

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03;

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK03; details

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK01; details

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02;

[cssd(5149)]CRS-1649:An I/O error occured for voting file: ORCL:ASMDISK02; details

ACFS cant be accessed

[client(8048)]CRS-10001:ACFS-9112: The following process IDs have open references

[client(8136)]CRS-10001:ACFS-9115: Stale mount point

[client(8183)]CRS-10001:ACFS-9116: Stale mount point

[client(8185)]CRS-10001:ACFS-9117: Manual intervention is required.

/u01/app/11.2.0/grid/bin/orarootagent.bin for action start failed: details at

Voting files are offlined as they cant be accessed

[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK02; details at

[cssd(5149)]CRS-1604:CSSD voting file is offline: ORCL:ASMDISK03; details at

[cssd(5149)]CRS-1606:The number of voting files available, 0, is less than the

[ctssd(5236)]CRS-2402:The Cluster Time Synchronization Service aborted on host

scan ocssd log of host03

[root@host03 ~]# tailf /u01/app/11.2.0/grid/log/host03/cssd/ocssd.log

2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmUnreferenceMember: global

2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceProcessDeath: client

2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client

2012-11-17 03:32:10.356: [ CSSD][997865360]clssgmFenceClient: fencing client

2012-11-17 03:32:10.357: [ CSSD][864708496]clssgmTermMember: Terminating

2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa46760)

2012-11-17 03:32:10.358: [ CSSD][864708496]clssgmFenceCompletion: (0xaa05758)

2012-11-17 03:32:10.359: [ CSSD][852125584]clssgmRemoveMember: grock

2012-11-17 03:32:11.310: [ SKGFD[942172048]ERROR: -15(asmlib

2012-11-17 03:32:11.310: [ CSSD][942172048]

2012-11-17 03:32:11.310: [ SKGFD][973764496]ERROR: -15(asmlib

2012-11-17 03:32:11.310: [ CSSD][973764496]

2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client

2012-11-17 03:32:11.349: [ CSSD][997865360]clssgmFenceClient: fencing client

2012-11-17 03:32:11.354: [ CSSD]

2012-11-17 03:32:11.354: [ CSSD]

2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads:

2012-11-17 03:32:12.038: [ CSSD][810166160]clssnmvSchedDiskThreads:

2012-11-17 03:32:12.223: [ CLSF][887768976]Closing handle:0xa746bc0

2012-11-17 03:32:12.223: [ SKGFD][887768976]Lib

2012-11-17 03:32:12.236: [ CLSF][921192336]Closing handle:0xa5cbbb0

2012-11-17 03:32:12.236: [ SKGFD][921192336]Lib

2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads:

2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads:

2012-11-17 03:32:13.825: [ CSSD][997865360]clssnmvSchedDiskThreads:

2012-11-17 03:36:10.638: [ CSSD][877279120]CALL TYPE: call ERROR

scan alert log of host01

Note that reboot message from host03 is received at 03:36:16

[root@host01 host01]# tailf /u01/app/11.2.0/grid/log/host01/alerthost01.log

[ohasd(4942)]CRS-8011:reboot advisory message from host: host03, component:

[ohasd(4942)]CRS-8013:reboot advisory message text: clsnomon_status: need to

After host03 reboots itself, network communication with host03 is lost

[cssd(5177)]CRS-1612:Network communication with node host03 (3) missing for

[cssd(5177)]CRS-1611:Network communication with node host03 (3) missing for

[cssd(5177)]CRS-1610:Network communication with node host03 (3) missing for