lost switch fabric on one node causes total cluster failure

Discussion in 'Clustering' started by Dave's not here, Jul 21, 2004.

  1. I have 2 seperate W2K3 clusters, each with 2 nodes. We
    have 2 seperate fabrics attached to a DMX1000. A node
    from each cluster is on a seperate fabric. We lost one of
    the switches monday night and instead of the "lost" disk
    resources faling over to the remaining up node, the event
    logs show the disk as lost and finally it times out. I've
    poured through the cluster.log and the event logs trying
    to piece this together but it simply appears that the
    clustering failover functionality just did not try to
    fail over to the other nodes. Why would a fibre channel
    path going dark cause MSCS to think the disk was lost
    without trying the other node to see if it can grab the
    disk? I tested this many months ago manually and the
    failover worked flawlessly.
    thanks much for any insight into this,
    Dave's not here, Jul 21, 2004
    1. Advertisements

  2. Sounds like old problems creeping back into W2K3. Usually this is a problem
    with HBA Firmware/driver combinations not passing the events properly to
    MSCS. Check out the following KB article for more details:


    BTW, I'm guessing that the node that did not own the quorum resource was the
    one that failed. In my experience, when you pull the cable on the node that
    owns the quorum, it always fails over though this is not always true if the
    node does not own the quorum device. Here's another article that explains
    why this occurs:


    John Toner [MVP], Jul 22, 2004
    1. Advertisements

  3. Dave's not here

    Dave S Guest

    You are right on, and after I called Emulex support they
    told me to try the latest drivers. Funny, 4 months ago I
    had the latest drivers and those were the ones that were
    crap. I just updated the firmware and drivers and the
    failover is working perfectly again. I would advise
    anyone using the Emulex 9802 (9000 series) cards to
    upgrade unless they have tested their systems recently to
    see if they will failover as expected. And, I am using
    the "full" driver, not the miniport.

    Dave S, Jul 22, 2004
  4. Do you have used the storeport or scsiport driver?

    Months ago I had a lot of problems with the first
    storeport drivers.

    Another question: You are using two fabrics.
    Do you also use two HBA's for each node connected two
    each fabric? (I use Powerpath SW)


    wilfried.lang, Jul 26, 2004
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.