Unless otherwise noted, articles © 2005-2008 Doug Spencer, SecurityBulletins.com. Linking to articles is welcomed. Articles on this site are general information and are NOT GUARANTEED to work for your specific needs. I offer paid professional consulting services and will be happy to develop custom solutions for your specific needs. View the consulting page for more information.
Veritas Cluster Server (VCS) Troubleshooting
From SecurityBulletins.com
Written by Doug Spencer
The following may be helpful for troubleshooting Veritas Cluster problems. If you need Veritas Cluster expertise, I offer consulting services.
Contents |
Commands to check cluster status
hastatus -sum # Show a summary of resources
hastatus # show the running status of resources. VCS even tracks frozen resource groups, so you can verify that VCS can effectively discern the status of a resource when you manually bring it online or offline.
hagrp -clear GROUP_NAME # Clear a faulted resource group
Using gabconfig -a output to determine problems
gabconfig -a # Shows the state of the VCS resources required to implement clustering.
The letters returned from gabconfig -a mean the resource is available on a particular node:
a gab driver b I/O fencing (designed to guarantee data integrity) d ODM (Oracle Disk Manager) f CFS (Cluster File System) h VCS (VERITAS Cluster Server: high availability daemon) o VCSMM driver (kernel module needed for Oracle and VCS interface) q QuickLog daemon v CVM (Cluster Volume Manager) w vxconfigd (module for cvm)
With regard to the GAB driver(Port a)
The /etc/gabtab file will contain the number of nodes defined in the cluster. During an initial build, the cluster won't fully start until all nodes are seen. The gabtab is in the following format:
/sbin/gabconfig -c -n2
Where -n2 specifies there are 2 nodes required to "seed" the cluster. That number should reflect the actual number of nodes in the cluster. Once that number of nodes is seen, the "Port a" membership is established. Running gabconfig -a | grep "Port a" will show the current membership ID and count for the Port a membership. This check is in place to prevent split-brain conditions and the resulting data corruption that occurs if the cluster starts two or more mini-clusters and related resources.
If you are certain that no split-brain condition is happening, gabconfig -cx can be used to manually bypass the protection from pre-existing partitions.
IOFencing driver(Port b)
Port b/IOFencing is started as a result of the /etc/rc2.d/S97vxfen start script. It performs the following actions:
- reads /etc/vxfendg to determine name of the diskgroup (DG) that contains the coordinator disks
- parses "vxdisk -o alldgs list" output for list of disks in that DG
- performs a "vxdisk list diskname" for each to determine all available paths to each coordinator disk
- uses all paths to each disk in the DG to build a current /etc/vxfentab
The purpose of all this is that the IOFencing driver is simply trying to find the same shared disk on all nodes to use for the coordinator disk.
Oracle Disk Manager/ODM (Port d)
This port is started by the commands in /etc/rc2.d/S92odm
Cluster File System/CFS (Port f)
There are various methods that can be done to reload CFS if required. Much of VxFS needs to be unloaded to reload this and it usually isn't required.
Veritas Cluster Server/VCS (Port h)
This is the cluster daemon itself.
CVM (ports v and w)
Cluster Volume Manager allows multiple disks to be mounted and shared on the Veritas cluster. You must have the IOFencing driver running before you can start CVM. You can check CVM status with the following commands:
- gabconfig -a | egrep "Port v|Port w"
- vxdctl -c mode
- vxclustadm -v nodestate
For debugging purposes, you can start CVM manually with the following command on each node:
vxclustadm -m vcs -t gab startnode vxclustadm: initialization completed
All diskgroups with disks marked with "shared flag" should now automatically be imported shared. You can check their status with:
vxdg list
and look for "enabled,shared" in the result for each shared disk group.
To see if a disk has the shared flag, run:
vxdisk -o alldgs list | grep shared
and
vxdisk list DISKNAME
QuickLog daemon (Port q)
To reload the QuickLog daemon:
# ps -ef| grep qlog
root 2099 1 0 13:04:44 ? 0:00 /opt/VRTSvxfs/sbin/qlogckd
# kill -9 2099
# modinfo | grep qlog 195 7821e000 17fc7 208 1 qlog (VxQLOG 3.5_REV-MP1f QuickLog dr) # modunload -i 195
# /opt/VRTSvxfs/sbin/qlogckd
VCSMM(port_o)
VCSMM is required for RAC communications. It loads in /etc/rc2.d/S98vcsmm
Changing cluster status
hagrp -online RESOURCE_GROUP -sys SYSTEM # Bring a resource online on a particular system
hagrp -switch RESOURCE_GROUP -to SYSTEM # Move a resource to a particular system
hagrp -autoenable RESOURCE_GROUP # Enable a group that has been autodisabled.
Editing cluster configuration
/etc/VRTSvcs/conf/config/main.cf # The main configuration file for VCS.
I usually copy the config directory elsewhere, then do a hacf -verify . in the config directory, then hacf -cftocmd . and then hacf -cmdtocf . to rebuild the dependency mapping in main.cf. When it looks good, put the main.cf in place and activate it. If you only do a hacf -verify, it doesn't find some problems in the main.cf and does not rebuild the dependency tree diagram in the file.
tail /var/VRTSvcs/log/engine_A.log # The logging file
vxdctl -c mode # Determine current node status when using CVM
lltstat # will print output similar to the following to diagnose the low latency transport:
LLT statistics:
15903 Snd data packets
469 Snd retransmit data
4384 Snd connect packets
2999 Snd independent ACKs
10355 Snd piggyback ACKs
0 Snd independent NACKs
0 Snd piggyback NACKs
4138 Snd loopback packets
15749 Rcv data packets
586 Rcv out of window
0 Rcv duplicates
0 Rcv datagrams dropped
0 Rcv multiblock data
0 Rcv misaligned data
LLT errors:
0 Rcv not connected
0 Rcv unconfigured
0 Rcv bad dest address
0 Rcv bad source address
0 Rcv bad generation
0 Rcv no buffer
0 Rcv malformed packet
0 Rcv bad dest SAP
0 Rcv bad STREAM primitive
0 Rcv bad DLPI primitive
0 Rcv DLPI error
26 Snd not connected
0 Snd no buffer
0 Snd stream flow drops
26 Snd no links up
0 Rcv bad checksum
If you run an lltstat -nvv it will show a verbose status of each Low Latency Transport (LLT) interface. This can be used to check that each interface is plugged into the right destination. It shows what the node thinks its interface name is and what it thinks the remote interface names are. Running the command on all nodes will give a map of the overall LLT network.
Files
/etc/gabtab
/etc/llttab
Other common problems
SCSI reservations on a RAC or other cluster file system install are sometimes a problem if one gets stuck to a particular node that is unavailable.
Consulting
Put my experience to work to improve your Veritas Cluster Server infrastructure. I offer professional consulting services. E-mail sales@securitybulletins.com or click to have GrandCentral call to set up a service contract.
