Wednesday, April 1, 2009

What is Voting Disk & Split Brain Syndrome in RAC

Voting Disk


Oracle Clusterware uses the voting disk to determine which instances are members of a cluster. The voting disk must reside on a shared disk. Basically all nodes in the RAC cluster register their heart-beat information on thes voting disks. The number decides the number of active nodes in the RAC cluster. These are also used for checking the availability of instances in RAC and remove the unavailable nodes out of the cluster. It helps in preventing split-brain condition and keeps database information intact. The split brain syndrome and its affects and how it has been managed in oracle is mentioned below.
For high availability, Oracle recommends that you have a minimum of three voting disks. If you configure a single voting disk, then you should use external mirroring to provide redundancy. You can have up to 32 voting disks in your cluster. What I could understand about the odd value of the number of voting disks is that a noe should see maximun number of voting disk to continue to function, so with 2, if it can see only 1, its not the maximum value but a half value of voting disk. I am still trying to search more on this concept.

Split Brain Syndrome:


In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle rac systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all pysically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of commincation the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instance running, the sane block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. Oracle has efficiently implemented check for the split brain syndrome.

What does RAC do incase node becomes inactive:


In RAC if any node becomes inactive, or if other nodes are unable to ping/connect to a node in the RAC, then the node which first detects that one of the node is not accessible, it will evict that node from the RAC group. e.g. there are 4 nodes in a rac instance, and node 3 becomes unavailable, and node 1 tries to connect to node 3 and finds it not responding, then node 1 will evict node 3 out of the RAC groups and will leave only Node1, Node2 & Node4 in the RAC group to continue functioning.
The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted. Moreover, I have seen this node eviction issue with only 1 node getting evicted and the rest function fine, so I cannot really testify that if thats how it work by experience, but this is the theory behind it.
When we see that the node is evicted, usually oracle rac will reboot that node and try to do a cluster reconfiguration to include back the evicted node.
You will see oracle error: ORA-29740, when there is a node eviction in RAC. There are many reasons for a node eviction like heart beat not received by the controlfile, unable to communicate with the clusterware etc.
A good metalink note on understanding node eviction and how to address is Note ID: 219361.1

The CSS (Cluster Synchronization Service) daemon in the clusterware maintains the heart beat to the voting disk.

17 Comments:

Anonymous said...

Great!simple enough to understand,Keep up the good work..

Anonymous said...

simple to understand good work and keep going..

dbadon said...

very good and simple ....... It will be good if you could write article on GRD,GCS,GES and TAF .. Its a Request

Jagjeet Singh -- said...

Good Article Arun.

you said - only one node get evicted always. was it on 2 node ?

If yes, then it might be the second node only which should get evicted.

Oracle give priority to first node.

Apun Hiran said...

Well in a 2 node setup, the split brain concept doesn't really work. In 2 node rac I have seen that if there is any real problem with the instance/server, node eviction happen. As there are no node groups formed, as there are only 2 nodes. If one node is unable to contact the other it will evict the node, which ever node find out first, will evict the other node.
Regards
Apun

Anuj said...

Hi,

It cleared my doubt.. thanks.. very well explained..

Anuj

oracleandffun said...

heyy if we have 3 voting disks and 2 OCR then what is the redundancy called ?
External redundancy or NORMAL REDUNDANCY .

Apun Hiran said...

External REDUNDANCY is when you have storage level REDUNDANCY, like RAID setup. So in that case you would have 1 OCR and 1 Voting Disk.
Answering your question 2 OCR and 3 Voting Disk is Normal REDUNDANCY.

Regards
Apun

Oracle DBA said...

similar post here :
http://chandu208.blogspot.com/2011/04/ocr-file-and-voting-disk-administration.html
or
http://chandu208.blogspot.com/2011/04/oracle-rac-overview.html

Apun Hiran said...

Hi Abinas,
Could you please let me know the oracle version you are running on? Secondly whats the reason for the instance reboots? Could please the crs alert log/ocssd log details as well.

Regards
Apun

Anand Chavan said...

Excellent and Infromative explanation. But did not cover role of voting disk in split brain.what inforamtion being written in the vote disk. Also there is Network and Disk heartbit. When there is no network heartbit then Split brain is done using Disk (vote disk) information, i.e which group of server would be survived. Can you please explain in detail. How it happen internaly.

Chris said...

Can we have Normal redundancy level for OCR_DG (Voting and OCR) and External redundancy for DATA_DG (Data). I was able to set it up that way in 11gR2. However I am concerned if it will create problems later.

Aman said...

gud one...simple and cleat enough to understand

Anonymous said...

Simple and Clear

Anonymous said...

Simple and easy to understand.

Anonymous said...

So it shd be 3 ocr file and 5 voting disk for high redundancy. right?

Apun Hiran said...

OCR don't need to be in odd numbers, you can have 2 OCR files and make sure you back them up and store them offline. Voting disks need to be in odd numbers, spread across devices with redundancy. For voting disks 3,5 etc are good configurations.