Help Again! Cluster Resources Moving Takes Ages & unreliable

Discussion in 'Clustering' started by Simon, May 17, 2006.

  1. Simon

    Simon Guest

    Hi again everyone :)
    I am in need of assistance - happy to post on here, or directly if anyone is
    willing to provide some assistance here. I am more concerned about our
    cluster misbehaving, and not having any good resources in the United Kingdom
    (1 book I found which I am still waiting on!!), and the White Papers are not
    fully helping me understand what is going on.

    Spec is as follows
    2 Servers x HP Proliant DL380 G3 Dual Processor 3Ghz and 2GB RAM
    1 x HP EVA SAN Hardware Solution
    OS installed on C:
    The Quorum is on a seperate disk, but from my understanding, its configured
    as Shared. In other words, it can move to another NODE.

    All hardware fibre channel using HBA's all drivers and firmware up to date.
    Secure Path has been installed and configured and we also use Veritas
    Enterprise Administrator for the Disks Administrator application (This
    replaces the Microsoft Disk Admin tool which is part of the OS).

    1st Node hosts 9 VDisks in sizes of 250GB. On this Node, there are 65
    Resources configured in Cluster Administrator.

    2nd Node hosts 8 VDsisks in sizes of 250GB. On this node, there are 31
    Resources configured in CLuster Administrator.

    The main problem I have is the uptime, which currently stands at 14 days
    (Max has only been 22 so far!!) and the Resources taking a LONG time to move,
    5-7 mins in most cases.

    I have monitored this and noticed that some resources stay in a pending
    state for some time. When they move over, it takes a long time for the
    resources to move onto to another node.

    Also, if the resources have not come back online CORRECTLY, it takes the
    entire group down and moves them back. I think I may have resolved this by
    disabling the option "Affect the Group", which was ticked. A lesson learned
    here was a Shared Resource was removed, the cluster tool could not find it
    and took the ENTIRE group down!

    I am not too sure where to start - its in production so taking it down is
    not easy. But I want to help the company with the limited skills I have. Im
    not sure if its permissions problem with resources that is causing the issue,
    or hardware but if anyone is able to share any additional info, this would
    truly help.

    Im the first to admit, I am NOT a cluster expert - But I want to be and I
    want to know what is happening, so I can understand and correct it.

    I also apprecaite everyone is busy, but in times like this, I am willing to
    do almost anything to help sort this out.

    Thank you
    Simon, May 17, 2006
    1. Advertisements

  2. Simon

    MarkFox Guest

    Simon, Have you checked to see if the delay in coming online is due to a
    chkdsk running? This can be determined by looking in the event logs are at
    the time of the delay on the server that the resource is online pending you
    will see a chkdsk process running. You can also look in C:\Windows\Cluster\
    and look for a file called ChkDsk_Disk10_Sig....Log. This could explain the
    long delay in a resource coming online, chkdsk can take quite a while to run
    depending on amount of space in use etc. The cluster checks for corruption
    every time a resource is moved.

    Hope that helps.
    MarkFox, May 17, 2006
    1. Advertisements

  3. Simon

    Simon Guest

    No there is no file of that name, or any chkdsk log file either. Only file I
    see is the cluster.log file.

    Simon, May 17, 2006
  4. Simon,

    Specifically, which resources are taking the most time to go offline/online?
    You might start by explaining where the delay is in the failover
    process...are they spending most of the time going offline or coming online?

    Another item that I'd look at is why you're experiencing downtime. What is
    crashing and why is it crashing?

    BTW, 5 - 7 minutes to move a cluster group is not really an excessive amount
    of time to move a group. Some applications can take a while to stop and
    start their services (like exchange) so this is not an unreasonable failover
    time. If it's taking 5 minutes for your disk resources or IP resources to go
    offline and back online, this would certainly be excessive.

    John Toner [MVP], May 18, 2006
  5. Simon

    Simon Guest

    The 2 nodes are file servers - thats all! I have another cluster at another
    site, same setup, hardware, software as mentioned above, yet it takes 90
    seconds for their resources to move over.

    As you can see, much better compared to ours. The resources go offline, but
    several stay on offline pending - mainly the groups stay in this state.

    Also, when attempting to go online, they say online pending for ages - 3-4
    mins at least, and then attempt to come up. most resources fail to come up!

    The blue screens are inconclusive - I posted them before here and no one
    has yet to come to a real point as to why it failed - must point out for 3
    weeks its not blue screened yet!

    Simon, May 18, 2006
  6. Simon,

    Specifically which RESOURCES are taking a long time to go offline or online
    pending? Resources will go offline and online in a specific order based on
    your dependencies. For example, a correctly configured file share resource
    will not even attempt to go online until the NetName and Physical Disk
    resources are online. Use the Cluster Administrator GUI and watch the way
    the resources go offline and online.

    Which resources are failing? Is it the File Share resources, disk resources,
    all of the above?

    Getting a memory dump analysis probably isn't going to happen in the
    newsgroups. We can give you suggestions as to what might cause a blue
    screen, but you'll likely want to open a case with MS at this point and have
    them pinpoint what is causing the host to BSOD.

    John Toner [MVP], May 19, 2006
  7. Simon

    Simon Guest

    Hi John
    Thank you for the reply! The resources hanging are mainly all the file
    resources but especially the ones that pending offline or pending online are
    the main Volume Manager Disk Group (we call it groupcluster).

    Incidently, one of our nodes is set for SYSTEM MANAGED SIZE for the Pagefile
    - I was always under the impression you had to specify a size - ie: if you
    have 2GB RAM, it should be 4096 min and 4096 max!

    Any ideas?
    Simon, May 22, 2006
  8. Simon

    Simon Guest

    Just to add insult to injury, one of our nodes is complaining about lack of
    resources on the Server - in fact its the server with a system managed

    VSS also runs on both nodes, but only recently started (2 weeks ago).

    Event id 7001 - VssAdmin: Unable to create a shadow copy: Ran out of
    resources while running the command.

    Simon, May 22, 2006
  9. I'm not at all shocked to hear that the Veritas resources are causing
    issues. In my experience, Veritas + MSCS = headaches. Unfortuantely, there's
    not much that anyone here can do to help resolve why your veritas disk
    groups are taking an excessive amount of time to go offline and online. Your
    best bet would be to contact Veritas and have them explain why this process
    is delayed.

    John Toner [MVP], May 22, 2006
  10. System managed page file should have nothing to do with your issues. It
    sounds to me like you might have an application/driver with a memory leak.
    You'll want to have your memory dumps analysed to help pinpoint what exactly
    is causing your issues. You really need to contact PSS at this point.

    John Toner [MVP], May 22, 2006
  11. Simon

    Simon Guest

    thanks - Can I just clarify about the managed pagefile - I was always lead
    to belive its essential a pagefile exists on C: and is specified a correct
    min and max size setting.

    the Veritas Disk Administrator is the tool behind the scenes that seems to
    have replaced the built in tool that comes under Manage My Computer. I do not
    know what it is fully doing.

    Its not blue screened for a while now - just one node complaining of lack of
    memory - hence why I felt the pagefile set at system managed may not be good
    for the amount of resources that the node hosts.

    Any clarification on the pagefile will be good.

    Thanks, Simon
    Simon, May 22, 2006
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.