Welcome Guest ( Log In | Register )


 
 
 
 
 
 

 
 
Oracle 

Performance Tuning Reference poster
 
Oracle training in Linux 

commands
 
Oracle training Weblogic Book
 
Easy Oracle Jumpstart
 
Oracle training & performance tuning books
 
Burleson Consulting Remote DB Administration
 
 
 
Reply to this topicStart new topic
> rac crs issue
jamie055
post Apr 10 2012, 07:33 PM
Post #1


Newbie
*

Group: Members
Posts: 8
Joined: 3-January 12
From: Cameron, NC
Member No.: 46,523



I have a 3 node RAC server on Windows Server 2008. Last week the hard drive went out on one of the nodes and I have had to rebuild as I could not recover anything.

I went through and deleted the old node and now I have just finished adding the new node back to my cluster via documentation. Once I created the new instance on that server DBCA attempted to start and it failed gaving me a crs error. I found out later that the other 2 nodes went down and the new one that did not start correctly was the only one up!! I went and stopped the new instance and restarted the first 2. The associated services did not start with the instance so I had to start each manually. The trace files show an ORA-29702 error with cluster group service and the instance being stopped on both of the existing nodes. No other error messages stood out.

Now I cannot get any crs services to start on that 3rd node even if I attempt to start manually. I have also tried stopping all and restarting and that does not work. I found another post on this forum from you and followed it. The ASM service was fine the entire time through all the logs and I don't know how to verify LMON in Windows but I didn't see any LMON errors in the alert log. Also, the voting disks are online. Each node has their own and they are mirrored. Where else to look?

Thanks for the help!
Go to the top of the page
 
+Quote Post
burleson
post Apr 11 2012, 06:53 AM
Post #2


Advanced Member
***

Group: Members
Posts: 11,730
Joined: 26-January 04
Member No.: 13



Hi Jamie,

>> Last week the hard drive went out on one of the nodes and I have had to rebuild as I could not recover anything.

Why not? You should have been able to fully recover from a lost disk.

****************************************
>> I have had to rebuild as I could not recover anything.

No no! This is the WRONG approsch!

Losing data is NOT an option!

It sounds like you don't know enough RAC to perform this operation.

If you have purchased Oracle support, call thm at 800-223-1711.

Else, call BC at 800-766-1884 and get a RAC expert to recover this for you.


--------------------
Hope this helps. . .

Donald K. Burleson
Oracle Press author
Author of Oracle Tuning: The Definitive Reference
Go to the top of the page
 
+Quote Post
jamie055
post Apr 11 2012, 07:28 AM
Post #3


Newbie
*

Group: Members
Posts: 8
Joined: 3-January 12
From: Cameron, NC
Member No.: 46,523



The disk that failed on the server corrupted the mirror disk as well. Dell had me perform various testing but were at a loss as to why that happened.

The other two nodes were completely functional during this and the vip address transferred over as it was supposed to. I opened an SR with Oracle and they recommended deleting and readding the node once rebuilt.

I have been looking through the logs again. The other two nodes are still completely stable this morning. Logs show when I added the new node to the cluster and the instance starting. Then I get this and it is repeated through all three nodes over and over.

Error: KGXGN polling error (15)
Errors in file f:\oracle\product\10.2.0\admin\flms\bdump\flms3_lmon_1320.trc:
ORA-29702: error occurred in Cluster Group Service operation

Then LMON terminates instance due to error and further along I see the list of nodes as 2 with Global Resource Directory Frozen. The instance then attempts to start multiple times. until I shut it down.

I can open another SR with oracle and see what they have to say.
Go to the top of the page
 
+Quote Post
burleson
post Apr 11 2012, 07:50 AM
Post #4


Advanced Member
***

Group: Members
Posts: 11,730
Joined: 26-January 04
Member No.: 13



Hi,



>> The disk that failed on the server corrupted the mirror disk as well.

Wow! I have cluients using Dell mirrored disks too.

Do you have a model number?

*************************************************
>> Dell had me perform various testing but were at a loss as to why that happened.

The hardware failed, it happens all the time. Remember, a disk is the only part of a computer with moving parts.

Can you post the disk model number and server type?

**************************************************

>> The other two nodes were completely functional during this and the vip address transferred over as it was supposed to.

Oh sorry, I misunderstood. I thought you said "hard drive went out on one of the nodes and I have had to rebuild as I could not recover anything."

When you say "recover" to a DBA, they think you are talking about recovering the data, not the RAC node!

Let me research this for you . . . .


--------------------
Hope this helps. . .

Donald K. Burleson
Oracle Press author
Author of Oracle Tuning: The Definitive Reference
Go to the top of the page
 
+Quote Post
burleson
post Apr 11 2012, 07:54 AM
Post #5


Advanced Member
***

Group: Members
Posts: 11,730
Joined: 26-January 04
Member No.: 13



Hi Jamie,

This is most likely caused by a cloned a RAC ORACLE_HOME to a server with no CRS defined.



*************************************
>> ORA-29702

Can you post the "whole error?" like this?


CODE
kjxgmpoll: terminate the CGS reconfig.
Error: KGXGN polling error (15)
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
ksuitm: waiting up to [5] seconds before killing DIAG


PLease read this:

http://www.dba-oracle.com/t_ora_29702_ksuitm_waiting.htm


QUOTE
ORA-29702: error occurred in Cluster Group Service operation

Cause: An unexpected error occurred while performing a CGS operation.

Action: Verify that the LMON process is still active. Also, check the Oracle LMON trace files for errors.



*******************************************

>> "crsctl start crs"

What mesage do you get when you run this command?
Run this command:

crsctl check crs


--------------------
Hope this helps. . .

Donald K. Burleson
Oracle Press author
Author of Oracle Tuning: The Definitive Reference
Go to the top of the page
 
+Quote Post
burleson
post Apr 11 2012, 08:24 AM
Post #6


Advanced Member
***

Group: Members
Posts: 11,730
Joined: 26-January 04
Member No.: 13



Hi Jamie,

I see you are from Raleigh, I;m right up the road in Kittrell . . . .

If this is production, make sure to log a SEV1 SR on MOSC:

http://support.oracle.com


Also see MOSC note 1050908.1 "How to troubleshoot RAC CRS startup issues".

The very first thing to try is to re-boot the whole server and see if the CRS starts automatically.

See here, I wrote this for you:

http://www.dba-oracle.com/t_rac_crs_start_failure.htm

Try this, starting the CRS resources separately:


root> crsctl start resources
Starting resources.
Successfully started CRS resources



******************************************

Did you forget to run root.sh or root102.sh (depending on what you were trying to do)?

These lines in /etc/inittab are necessary to start CRS correctly."

h1:3:respawn:/sbin/init.d/init.evmd run >/dev/null 2>&1 </dev/null
h2:3:respawn:/sbin/init.d/init.cssd fatal >/dev/null 2>&1 </dev/null
h3:3:respawn:/sbin/init.d/init.crsd run >/dev/null 2>&1 </dev/null
For complete details on installing and configuring CVRS (with working scripts), see the book Oracle 10g RAC and Grid, by Rampant TechPress.


--------------------
Hope this helps. . .

Donald K. Burleson
Oracle Press author
Author of Oracle Tuning: The Definitive Reference
Go to the top of the page
 
+Quote Post
jamie055
post Apr 11 2012, 11:43 AM
Post #7


Newbie
*

Group: Members
Posts: 8
Joined: 3-January 12
From: Cameron, NC
Member No.: 46,523



Mr. Burleson,

The server was a Dell R610 and I am working to see if someone has the model number of the disk. Sorry about the confusion!! Yes only the server failed and what was odd ws not only did disk 1 fail but it went over and corrupted the mirrored disk too.

What does bother me is that the node took over, the other 2 failed, and we had no redundancy for the time!!

The procedures Oracle gave me were "RAC on Windows: How to Cleanup When a Node Has Been Disconnected or The OS Rebuilt [ID 742737.1] and Using Cloning in CRS/RAC Windows Environments to add a node [ID 407086.1]

Each of the steps completed successfully and I thought everything was good.

Here is the whole error I am getting throughout the nodes:
Error: KGXGN polling error (15)
Tue Apr 10 12:55:21 2012
Errors in file
f:\oracle\product\10.2.0\admin\flms\bdump\flms1_lmon_6424.trc:
ORA-29702: error occurred in Cluster Group Service operation

LMON: terminating instance due to error 29702
Tue Apr 10 12:55:21 2012
Errors in file
f:\oracle\product\10.2.0\admin\flms\bdump\flms1_lms4_5820.trc:
ORA-29702: error occurred in Cluster Group Service operation

When I run a crsctl check crs everything comes back normal on all of the nodes. I read through the documentation you attached and cannot find the LMON process in taskmgr.

I am worried about restarting the server as I do not need it to kick the other nodes off again! Thanks for the help!!


Go to the top of the page
 
+Quote Post
burleson
post Apr 11 2012, 02:47 PM
Post #8


Advanced Member
***

Group: Members
Posts: 11,730
Joined: 26-January 04
Member No.: 13



Hi Jamie,

See here, the solution!

http://www.google.com/search?sourceid=ie7&...024&bih=550


http://www.peasland.net/?p=25


--------------------
Hope this helps. . .

Donald K. Burleson
Oracle Press author
Author of Oracle Tuning: The Definitive Reference
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

Lo-Fi Version Time is now: 28th November 2014 - 07:01 AM