Cluster node failover takes very long only from Node A to Node B

Discussion in 'Clustering' started by arrowman, May 10, 2006.

  1. arrowman

    arrowman Guest

    Hi,

    I have built a cluster using:

    2 x IBM X345
    2 x ServeRAID 6M cards
    Windows 2003 Enterprise

    I followed IBM and Microsoft instructions for this implementation. I am now
    testing and have run into a problem.

    When I Move Group from Node B to Node A, the failover takes approximately
    15-20 seconds. When I Move Group from Node A to Node B, the failover takes 3
    minutes. No application or system log entries appear to explain this
    problem. I have found errors in my cluster.log that correspond to the 3
    minute gap but I have been unable to locate any documentation describing the
    errors or providing information about how to correct them.

    Here are the two errors that I receive repeatedly over the span of 3 minutes:

    000005bc.00000c54::2006/05/01-17:27:08.424 ERR ServeRAID Logical Disk:
    IPSHAReadAdapterConfiguration FAILED by Windows NT. LastError = 1117.
    000005bc.00000c54::2006/05/01-17:27:08.424 ERR ServeRAID Logical Disk:
    IPSHAReadUserPage4 FAILED by Windows NT. LastError = 1117.

    I only have these resources configured:

    Cluster IP (Cluster Group - no problem moving)
    Cluster Name (Cluster Group - no problem moving)
    IPSHA Disk Q - Quorum (Drive Q group - 3 minutes to move)
    IPSHA Disk E - Blank logical drive (Drive E group - 3 minutes to move)

    This 3 minute pause almost feels like a timeout.

    Any help would be appreciated.

    TIA,

    Arrowman
     
    arrowman, May 10, 2006
    #1
    1. Advertisements

  2. arrowman

    Don Wilwol Guest

    Are you using a domain user to run the cluster service? Make sure that user
    has local admin rights on each node. Also try putting the Q resource in the
    cluster group and try it.

    --
    ----------
    Hope It Helps!

    dw
    _______________________________
    Don Wilwol
    Distributed Application Technologies.
    dwilwol(DELETE)@datbusiness.com
    www.AtTheDataCenter.com
    www.skysphere.com
    www.skyphere.com
     
    Don Wilwol, May 10, 2006
    #2
    1. Advertisements

  3. arrowman

    arrowman Guest

    Hi Don,

    I verified that my domain user has local administrator privileges on both
    nodes.

    I also tested your suggestion and found no change.

    Cheers,

    Arrowman
     
    arrowman, May 10, 2006
    #3
  4. Yep, it's the default 180 sec timeout. If these were Physical Disk
    resource, this would not happen. IPSAH Disks? Cannot speak for them except
    to tell you to collaborate with IBM and switch to Fibre as soon as you can.

    --
    Chuck Timon, Jr.
    Microsoft Corporation
    Longhorn Readiness Team
    This posting is provided "AS IS" with no
    warranties, and confers no rights.
     
    Chuck Timon [Microsoft], May 10, 2006
    #4
  5. arrowman

    Don Wilwol Guest

    It sound like a RAID driver issue and it sounds like you have some new
    equipment. I'd give the vendor a call. They may have some insight into the
    errors you are seeing.

    --
    ----------
    Hope It Helps!

    dw
    _______________________________
    Don Wilwol
    Distributed Application Technologies.
    dwilwol(DELETE)@datbusiness.com
    www.AtTheDataCenter.com
    www.skysphere.com
    www.skyphere.com
     
    Don Wilwol, May 10, 2006
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.