There was an identical incident posted at:
http://www.microsoft.com/communities...a-a0ff03f55183
However, I thought I would pose it as a seperate question in the hope of
bringing additional attention to this problem. I am basically copying and
editing the previous posters' information as it is so close to our issue.
We have a problem with a Windows 2008 SP1 cluster. It has Exchange 2007 SP1
with Cumulative Rollup 4 installed. The second node is continually going
down. Here is what is going on:
• Cluster servers are IBM xSeries 3650s
• Using a IBM DS4800 SAN for shared storage
• The NIC configuration on the nodes is as follows:
o Onboard Broadcom adapter - v4.4.16.0
o 2 Intel PCI-X adapters
o 3 network connections setup: public - 10.2.105.x Intel
switch VLAN 1, private - 10.2.109.x Intel switch VLAN 2, private -
192.168.1.x Broadcom crossover cable
o We setup the 3 network connections to help eliminate the
network as the issue.
o IPv4 Connectivity only, no teaming
o Windows cluster validation does not report any issues.
The issue that we are seeing is that intermittently Node 2 gets kicked out
of the cluster and shuts down the cluster service generating an 1177 error in
the event log. Basically, this means that it lost quorum due to losing
connectivity with the cluster nodes. This sometimes happens 3 times an hour,
but might not happen for a few hours. The cluster service will always
automatically restart and everything is fine again for a period of time.
The problem is NOT isolated to Node 2 however. If we make Node 2 the cluster
owner, then Node 1 exhibits the problem, if Node 1 is the owner, then Node 2
exhibits the problem. Using Node and Disk Majority for quorum setting.
It looks like the nodes are losing network connectivity to each other based
on the cluster logs indicating the routes as down, but we now have 3 network
connections between the 2 nodes using 3 different adapters from 2 different
vendors. So I doubt this is the issue.
MS believes the issue to be storage related due to "error 170" appearances
in the cluster logs and indicates these are related to persistent reservation
problems. We have installed the latest MPIO from IBM which supposedly
resolves some of these types of issues. However, the problem continues. IBM
is also looking into this, but we await a solution.
Has anyone else ran into this problem? Suggestions? Any help is greatly
appreciated.