Cluster Concept Question

Discussion in 'Clustering' started by RA, May 26, 2005.

  1. RA

    RA Guest

    I have setup a MS Cluster on two nodes running windows 2003 in an active
    passive enviroment. I have my cluster setup with three partitions all on the
    same lun on an HP SAN. Partition q has the quorum. Partition y: has SQL and
    partition z: has data. We wanted to test the cluster. I can manually move
    all the groups or any of the three individually with no problems. We decided
    to test a power failure while we only had test data in place. I had the
    quorum owed by server1 and the data owned by server1 but had sql owned by
    server2. Then I turned off the power on server1. The cluster did not fail
    over. I had to start that server back up and reboot server2 before it would
    work.

    I am brand new to CLustering did I miss a concept. Any information would be
    greatly appreciated.
     
    RA, May 26, 2005
    #1
    1. Advertisements

  2. Hi RA,

    You cannot have three partitions on a LUN in different Resource Groups. In
    fact, there is no suvh thing as a Partition Resource. You can have one or
    more PHYSICAL disks in a group, and a physical disk can only be in ONE group.

    If a group doesn't fail over, you should check if the other node is a
    "possible owner" for all the resources in that group. (Check the Properties.)
     
    Pascal Damman, May 27, 2005
    #2
    1. Advertisements

  3. What errors, if any, are you seeing on Server2 in the event log for cluster
    service?

    Regards,
    John
     
    John Toner [MVP], May 27, 2005
    #3
  4. RA

    RA Guest

    Event ID 17055

    Desciption

    17053 :

    LogWriter: Operating system error 21(The device is not ready.) encountered.

    Then event id 17055

    18052 :

    Error: 9001, Severity: 21, State: 4.

    Then event id 17055 again

    17053 :

    fcb::close-flush: Operating system error 21(The device is not ready.)
    encountered.

    17052 :

    Device activation error. The physical file name
    'Y:\SQLData\MSSQL\data\msdblog.ldf' may be incorrect.
     
    RA, May 27, 2005
    #4
  5. RA

    RA Guest

    I have a EMA 12000e with an HSG80 controller pair (HP San) and I have 3
    disks presented two this cluster server (so in disk management this cluster
    server can see 3 phyiscal disks) which are on one Raid 5 Lun that contains 6
    physical disks and has access to a two hotspares located in my san. Each
    Cluster group only has one physical disk in this group.
    Sorry if I was unclear earlier.
     
    RA, May 27, 2005
    #5
  6. Does cluster service stop on Server2 when you power down Server1? Does the
    whole cluster fail or is it just the SQL database? The errors you've posted
    seem to indicate SQL only.

    Regards,
    John
     
    John Toner [MVP], May 27, 2005
    #6
  7. RA

    RA Guest

    The cluster manager on server2 locked up and did not start back working
    until server1 was brought back up and server 2 was rebooted while server one
    cluster was working.
     
    RA, May 27, 2005
    #7
  8. Did you check the cluster service to verify that it was still running? Did
    you check the cluster.exe command line to see if any of the cluster
    resources were offline or failed?

    You might want to re-test this scenario and check on these. It might just be
    that your cluster IP or network name failed so that your cluster
    administrator GUI did not respond properly.

    Regards,
    John
     
    John Toner [MVP], May 27, 2005
    #8
  9. Your registry is corrupt on that node. 17052 means that it thinks your
    Virtual SQL was renamed. Try evicting the node and joining the cluster again
    or restoring the system state from backup.

    Cheers,

    Rod

    MVP - Windows Server - Clustering
    http://www.nw-america.com - Clustering Website
    http://www.msmvps.com/clustering - Blog
     
    Rodney R. Fournier [MVP], May 28, 2005
    #9
  10. RA

    RA Guest

    This is a brand new install. I also tried the same test but left the server
    with the quorum up and running and failed the one (by turning off the power)
    with the SQL virtual server. The server with the sql virtual server never
    failed over to the one with the quorum and the data virtual disk.
     
    RA, May 31, 2005
    #10
  11. RA

    RA Guest

    I did another scenerio. Server 2 has the quorum and data virtual server.
    Server 1 has only SQL virtual server. I powered off Server1, Server2 (the
    one without the quorum) tried to bring sql online but the cluster service
    was hung in a starting state.

    Event log as follows
    ID 1123 Warning network link down. Both heartbeat and lan
    ID 1135 Warning removed server1 from active cluster
    ID 1200 Info trying to bring SQL virtual server online
    ID 118 Error cluster service requesting reset on bus
    ID 15 (8) Errors device hardisk not ready for access yet
    ID 1038 Error Disk q (quorum) has been lost.
    ID 10.8 Error Cluster resevation for cluster disk has been lost.
     
    RA, May 31, 2005
    #11
  12. If the node that owns the quorum is losing its SCSI reservation on the
    device when the other node is powered down, you've got something flaky going
    on in your SAN.

    Verify that you have the latest, supported drivers for your HBAs. If you're
    using STORport drivers, make sure that you have the latest STORport hotfixes
    applied. See the following KB article for the hotfix:
    http://support.microsoft.com/kb/891793

    If it's not the HBA drivers, you'll need to get HP involved to investigate.

    Regards,
    John
     
    John Toner [MVP], May 31, 2005
    #12
  13. RA

    RA Guest

    John, I was not waiting on it takes 14 minutes for the cluster service to
    restart on either scenerio. I saw some timeouts under the quorum properties
    such as 5000 "looks alive" polling iterval and a "60000 is alive" polling
    interval on the physical disk on the quorum. SO if all the resources have
    polling intervals that take this long it could all add up to a long wait
    time. Is it feasible to adjust these polling intervals so the server may
    fail in a more reasonable four or five minutes or is that something I should
    leave alone.
     
    RA, May 31, 2005
    #13
  14. Based upon what you've written thus far, MSCS should not be stopping at all
    on your Server2. If Server2 is reporting that it is losing it's reservation,
    this is likely an issue in your SAN. Adjusting polling intervals likely
    would not make any difference to your scenario.

    Regards,
    John
     
    John Toner [MVP], May 31, 2005
    #14
  15. RA

    RA Guest

    You were right I found a Microsoft Document that said it would be best to
    run a Storport driver on my HBA. I installed the latest HBA Storport driver
    (I was running a old SCSIport driver) and my failover time went from 14
    minutes to 43 seconds. Thanks for the help.
     
    RA, Jun 1, 2005
    #15
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.