Thursday, September 9, 2010

Oracle 11gr2 Grid: root.sh fails on node2, asmlib issue

A very interesting problem which took me quite a while to resolve.
Issue while running root.sh on node 2 during 11gr2 grid software installation i was receiving the following error:
Error in root.sh:

DiskGroup DG_SYS01 creation failed with the following message:
ORA-15018: diskgroup cannot be created
ORA-15031: disk specification 'ORCL:DISK0' matches no disks


Configuration of ASM failed, see logs for details
Did not succssfully configure and start ASM
CRS-2500: Cannot stop resource 'ora.crsd' as it is not running
CRS-4000: Command Stop failed, or completed with errors.
Command return code of 1 (256) from command: /oragrid/product/11.2/bin/crsctl stop resource ora.crsd -init
Stop of resource "ora.crsd -init" failed
Failed to stop CRSD


Error in ASM alert log:

ORA-15183: ASMLIB initialization error [driver/agent not installed]
WARNING: FAILED to load library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
ERROR: diskgroup DG_SYS01 was not mounted
NOTE: cache deleting context for group DG_SYS01 1/-239075992
WARNING: Disk Group DG_SYS01 containing configured OCR is not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DG_SYS01" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DG_SYS01"
ERROR: ALTER DISKGROUP ALL MOUNT
Wed Sep 08 22:09:42 2010
SQL> CREATE DISKGROUP DG_SYS01 EXTERNAL REDUNDANCY DISK 'ORCL:DISK0' ATTRIBUTE 'compatible.asm'='11.2.0.0.0' /* ASMCA */
ORA-15018: diskgroup cannot be created
ORA-15031: disk specification 'ORCL:DISK0' matches no disks
ERROR: CREATE DISKGROUP DG_SYS01 EXTERNAL REDUNDANCY DISK 'ORCL:DISK0' ATTRIBUTE 'compatible.asm'='11.2.0.0.0' /* ASMCA */
kfdp_dismount(): 3
kfdp_dismountBg(): 3
ERROR: diskgroup DG_SYS01 was not created


What was puzzling was that why is node2 trying to run CREATE diskgroup, while node1 has run successfully and created the diskgroup.
Other errors in cssd.log

2010-09-08 20:06:14.856: [ SKGFD][1151920448]ERROR: -14(asmlib /opt/oracle/extapi/64/asm/orcl/1/libasm.so version failed with 2)
...
2010-09-08 20:06:14.856: [ SKGFD][1151920448]Discovery skipping bad asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so:

2010-09-08 20:06:14.856: [ CSSD][1151920448]clssnmvDiskVerify: Successful discovery of 0 disks
2010-09-08 20:06:14.856: [ CSSD][1151920448]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-09-08 20:06:14.857: [ CSSD][1151920448]clssnmvFindInitialConfigs: No voting files found


This showed that it looked like an issue with the oracleasm library.
But I was able to execute all /etc/init.d/oracleasm commands without any problems.
e.g /etc/init.d/oracleasm listdisks --> this showed all the disk correctly.

I tried rerunning /etc/init.d/oracleasm configure. But still continued to get the error.

I used steps mentioned in http://jarneil.wordpress.com/2008/07/07/asmlib-troubleshooting/ to make sure the libraries were installed properly. This note was really helpful, I would like to thank the author.
The libraries were installed properly.

I had actually reinstalled the oracleasm libraries also:

[oracle@node2]/% rpm -qa |grep oracleasm
oracleasm-support-2.1.3-1.el5
oracleasm-2.6.18-128.el5debug-2.0.5-1.el5
oracleasm-2.6.18-128.el5-2.0.5-1.el5
oracleasm-2.6.18-128.el5xen-2.0.5-1.el5
oracleasm-2.6.18-128.el5-debuginfo-2.0.5-1.el5
oracleasmlib-2.0.4-1.el5


Finally after a lot of searching (google, metalink..etc...etc).
I found metalink note "FAQ ASMLIB CONFIGURE,VERIFY, TROUBLESHOOT [ID 359266.1]" on metalink and started with all the checks mentioned.
Finally I figured out that there was problem with my /etc/sysconfig/oracleasm file. This file on other servers is a sym link:

[oracle@node1]/% ls -lrt /etc/sysconfig/oracle*
-rw-r--r-- 1 root root 774 Sep 8 23:32 /etc/sysconfig/oracleasm-_dev_oracleasm
lrwxrwxrwx 1 root root 24 Sep 8 23:36 /etc/sysconfig/oracleasm -> oracleasm-_dev_oracleasm

But in my case it was:

[root@node2 sysconfig]# ls -lrt oracle*
-rw-r--r-- 1 root root 574 Mar 18 2009 oracleasm
lrwxrwxrwx 1 root root 24 Sep 7 23:30 oracleasm.rpmsave -> oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 774 Sep 8 23:32 oracleasm-_dev_oracleasm

And all the parameters were blank inside oracleasm. What I understood is that libasm.so used oracleasm file.
So I:

[root@node2 sysconfig]# ln -s oracleasm-_dev_oracleasm oracleasm
[root@node2 sysconfig]# ls -lrt oracle*
lrwxrwxrwx 1 root root 24 Sep 7 23:30 oracleasm.rpmsave -> oracleasm-_dev_oracleasm
-rw-r--r-- 1 root root 774 Sep 8 23:32 oracleasm-_dev_oracleasm
lrwxrwxrwx 1 root root 24 Sep 8 23:36 oracleasm -> oracleasm-_dev_oracleasm
[root@node2 sysconfig]# rm oracleasm.rpmsave
rm: remove symbolic link `oracleasm.rpmsave'? y


Now as I had already run root.sh and it had failed, i was unable to run it again:

[root@node2 11.2]# ./root.sh
Running Oracle 11g root.sh script...

The following environment variables are set as:
ORACLE_OWNER= oracle
ORACLE_HOME= /oragrid/product/11.2

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The file "dbhome" already exists in /usr/local/bin. Overwrite it? (y/n)
[n]: y
Copying dbhome to /usr/local/bin ...
The file "oraenv" already exists in /usr/local/bin. Overwrite it? (y/n)
[n]: y
Copying oraenv to /usr/local/bin ...
The file "coraenv" already exists in /usr/local/bin. Overwrite it? (y/n)
[n]: y
Copying coraenv to /usr/local/bin ...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root.sh script.
Now product-specific root actions will be performed.
2010-09-08 23:37:12: Parsing the host name
2010-09-08 23:37:12: Checking for super user privileges
2010-09-08 23:37:12: User has super user privileges
Using configuration parameter file: /oragrid/product/11.2/crs/install/crsconfig_params
CRS is already configured on this node for crshome=0
Cannot configure two CRS instances on the same cluster.
Please deconfigure before proceeding with the configuration of new home.


So first we have to deconfigure the previous run of root.sh:

[root@node2 11.2]# crs/install/rootcrs.pl -verbose -deconfig -force
2010-09-08 23:37:48: Parsing the host name
2010-09-08 23:37:48: Checking for super user privileges
2010-09-08 23:37:48: User has super user privileges
Using configuration parameter file: crs/install/crsconfig_params
PRCR-1035 : Failed to look up CRS resource ora.cluster_vip.type for 1
PRCR-1068 : Failed to query resources
Cannot communicate with crsd
PRCR-1070 : Failed to check if resource ora.gsd is registered
Cannot communicate with crsd
PRCR-1070 : Failed to check if resource ora.ons is registered
Cannot communicate with crsd
PRCR-1070 : Failed to check if resource ora.eons is registered
Cannot communicate with crsd

ACFS-9200: Supported
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Stop failed, or completed with errors.
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node2'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'node2'
CRS-2677: Stop of 'ora.drivers.acfs' on 'node2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node2' has completed
CRS-4133: Oracle High Availability Services has been stopped.
error: package cvuqdisk is not installed
Successfully deconfigured Oracle clusterware stack on this node


After this I restart oracleasm:

[root@node2 11.2]# /etc/init.d/oracleasm stop
Dropping Oracle ASMLib disks: [ OK ]
Shutting down the Oracle ASMLib driver: [ OK ]
[root@node2 11.2]# /etc/init.d/oracleasm start
Initializing the Oracle ASMLib driver: [ OK ]
Scanning the system for Oracle ASMLib disks: [ OK ]


Check that the sym link is also in place, basically to see if restarting oracleasm changed anything:

lrwxrwxrwx 1 root root 24 Sep 8 23:36 oracleasm -> oracleasm-_dev_oracleasm

All looking fine i reran root.sh and it went through fine.

Some command that I ran in the pursuit to find the problem:

/etc/init.d/oracleasm listdisks
/etc/init.d/oracleasm start
/etc/init.d/oracleasm status
ls -rlt /dev/oracleasm/disks/
/etc/init.d/oracleasm querydisk DISK0
rpm -ql oracleasm-support
df -ha |grep asm
rpm -ql oracleasmlib
/usr/sbin/oracleasm-discover
/usr/sbin/oracleasm-discover 'ORCL:*'
/usr/sbin/oracleasm-discover 'ORCL:*'

3 Comments:

CJ Travis said...

This saved my butt! I ran into this issue today. Thank you, Apun!

Anonymous said...

Much thanks. Ran into a different issue but your deconfigure stuff saved my bacon.

Thanks for the details. Much appreciated.

Anonymous said...

This saved my life. Not once but twice. Very very good information, on how to back out things and restart. We did not have the sym link issue, and who knows what the real issue was, but after doing the deconfig and restarting oracleasm the root.sh ran. thanks for taking the time to put this together.

John