Unexpexted Cluster Switch (due error 5 from clusterlog)

Discussion in 'Clustering' started by Matthias, May 16, 2008.

  1. Matthias

    Matthias Guest

    Hello all,
    yesterday one of our clustersystems do an unexpexted clusterswitch.

    Systeminformation:

    HW: ProLiant DL585 G1 / 2x AMD Opteron 2,2 GHz / 16 GB RAM
    OS: Microsoft Windows Server 2003 Enterprise x64 Edition
    OS Version: 5.2.3790 Service Pack 2 Build 3790

    HP ProLiant Support Pack 7.90

    Atached to a SAN via FC

    Main Software: SAP CRM 5.0 SP15 / on MS SQL Server 2005
    Support Software: DataProdector / McAfee (Enterp. 8.0.0 Patch 15) / SNARE
    3.0.0

    MSCS-Configuration:

    Userlan (Teaming)
    Serverlan ( NO-Team)
    PrivatLAN (crossover)

    Clustergoup / MSDTC-Group / SAP-Group / SQL-Group


    ___________________________________________________________-

    The Clusterlog:

    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [DM] DmpGetSnapShotCb:
    DmpGetDatabase returned 0x00000000
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsGetTempFileName
    Q:\MSCS\, chkpt, 8011 => Q:\MSCS\chk1F4B.tmp, status 0
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] DmpGetSnapshotCb:
    Checkpoint file name=Q:\MSCS\chk1F4B.tmp Seq#=8011
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsMoveFileEx
    Q:\MSCS\chk619F.tmp=>Q:\MSCS\chk1F4B.tmp
    0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] DmpGetSnapShotCb:
    Failed to move the temp file to checkpoint file,
    TempFileName=Q:\MSCS\chk619F.tmp, ChkPtFileName=Q:\MSCS\chk1F4B.tmp,
    Error=0x00000005
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsDeleteFile
    Q:\MSCS\chk619F.tmp, status 0
    0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] LogCheckPoint: Callback
    failed to return a checkpoint
    0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] LogpReset:: Callback
    failed to return a checkpoint, error=5
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogClose : Entry
    LogFile=0x02ad7df0
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogFlush :
    pLog=0x02ad7df0 writing the 1024 bytes for active page at offset 0x00000400
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] WriteFile 99c (....)
    1024, status 0 (0=>0)
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsFlushBuffers 99c,
    status 0
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsCloseHandle 99c,
    status 0
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogClose : Exit
    returning success
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsDeleteFile
    Q:\MSCS\tqu619E.tmp, status 0
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogpReset exit,
    returning 0x00000005
    0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogReset exit,
    returning 0x00000005
    0000098c.00000a64::2008/05/15-15:16:43.912 ERR [DM]DmpCheckpointTimerCb -
    Failed to reset log, error=5
    0000098c.00000a64::2008/05/15-15:16:44.005 ERR Cluster service suffered an
    unexpected fatal error at line 2324 of source module
    d:\nt\base\cluster\service\dm\dmlog.c. The error code was 5.
    00000f58.00000f5c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
    Shutdown = 0.
    00000f58.00000f5c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
    00000000
    00000f58.00000f5c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
    00000f58.00000f5c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
    notification.
    00000f38.00000f3c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
    Shutdown = 0.
    00000f38.00000f3c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
    00000000
    00000f38.00000f3c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
    00000f38.00000f3c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
    notification.
    00000f18.00000f1c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
    Shutdown = 0.
    00000f18.00000f1c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
    00000000
    00000f18.00000f1c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
    00000f18.00000f1c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
    notification.
    00000b70.00000b74::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
    Shutdown = 0.
    00000b70.00000b74::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
    00000000
    00000b70.00000b74::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
    00000b70.00000b74::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
    notification.
    00000b70.00000b74::2008/05/15-15:16:45.004 INFO SAP Resource <SAP CPR 00
    Instance>: ResourceControl request.
    00000f18.00000f34::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
    down.
    00000f38.00000f54::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
    down.
    00000f58.00000f74::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
    down.
    00000b70.00000f08::2008/05/15-15:16:45.035 INFO [RM] NotifyChanges shutting
    down.
    00000b70.00000f10::2008/05/15-15:16:45.050 INFO Physical Disk <Disk H:>:
    [DiskArb] CompletionRoutine, status 0.
    00000b70.00000f10::2008/05/15-15:16:45.050 INFO Physical Disk <Disk H:>:

    There are also Errors in the Eventlog:

    Event Type: Error
    Event Source: ClusSvc
    Event Category: Log Mgr
    Event ID: 1016
    Date: 15.05.2008
    Time: 17:16:43
    User: N/A
    Computer: NODE1
    Description:
    Cluster service failed to obtain a checkpoint from the server cluster
    database for log file Q:\MSCS\tqu619E.tmp.

    Next:

    Event Type: Error
    Event Source: ClusSvc
    Event Category: Database Mgr
    Event ID: 1000
    Date: 15.05.2008
    Time: 17:16:43
    User: N/A
    Computer: NODE1
    Description:
    Cluster service suffered an unexpected fatal error at line 2324 of source
    module d:\nt\base\cluster\service\dm\dmlog.c. The error code was 5.

    alot of:

    Event Type: Warning
    Event Source: Ftdisk
    Event Category: Disk
    Event ID: 57
    Date: 15.05.2008
    Time: 17:16:45
    User: N/A
    Computer: NODE1
    Description:
    The system failed to flush data to the transaction log. Corruption may occur.

    And:

    Event Type: Error
    Event Source: Service Control Manager
    Event Category: None
    Event ID: 7031
    Date: 15.05.2008
    Time: 17:16:45
    User: N/A
    Computer: NODE1
    Description:
    The Cluster Service service terminated unexpectedly. It has done this 1
    time(s). The following corrective action will be taken in 60000
    milliseconds: Restart the service.

    I found the KB http://support.microsoft.com/kb/321531/en-us but I can not
    belive that our virusscanner is the reason because we EXCLUDE all recommented
    Drives and Files ( e.q Quorumdrive/ Databasedives / DatabasLOG-Drives/
    SQL-Executables, Pagefile, C:\Windows\Cluster, ..\NTDS, ..ntfsr, ..SYSVOL,
    *.chk, *.ebd, *.ldf, *.log, *.mdf, *.ndf, *.stm) from read and write scan.


    Anyone has an idea ?


    br, Matthias
    ____________________________________________
    Matthias Schweifer - Austria
     
    Matthias, May 16, 2008
    #1
    1. Advertisements

  2. Error 5 is an access denied and it occurred when we were checkpointing the
    cluster registry to the quorum drive. Check and make sure the cluster
    service account has both the 'backup files and directories' and 'restore
    files and directories' user rights. Also, make sure your Antivirus is NOT
    scanning the quorum. If it was scanning a quorum file at the time of a
    checkpoint, that may explain the error 5.
    --
    Jeff Hughes, MCSE
    Support Escalation Engineer
    Microsoft Enterprise Platforms Support (Server Core/Cluster)


     
    Jeff Hughes [MSFT], May 16, 2008
    #2
    1. Advertisements

  3. Matthias

    Matthias Guest

    I am not the backup-administrator in our company, but as further information
    I note that there was a FILE-System FULLBACKUP on both nodes ( with HP
    DataProtector) ; also the physikal QuorumDisk was backuped....
    Beginn : 17:15

    Is that a possible reason for the erro 5 ?
    Should we exclude the Quorumdisk from the backupset ?
    (Is a Systemstatebackup sufficiently)

    br, matthias
     
    Matthias, May 16, 2008
    #3
  4. Yes, if the quorum files were being backed up at the time, that's very
    possible why you got an error 5. You do not need to backup the quorum and it
    should be excluded from your scheduled backups. There's nothing there you'd
    ever need to recover since all the quorum is used for is maintaining a copy
    of the cluster database and any checkpointed registry keys, and you can
    always recreate those files if needed.
     
    Jeff Hughes [MSFT], May 20, 2008
    #4
  5. Hello,
    i got nearly the same messages as descriped above.
    But my error code is 2

    Event Type: Error
    Event Source: ClusSvc
    Event Category: Database Mgr
    Event ID: 1000
    Date: 06.06.2008
    Time: 14:34:44
    User: N/A
    Computer: SVREHDWHCLN1
    Description:
    Cluster service suffered an unexpected fatal error at line 2236 of source module d:\nt\base\cluster\service\dm\dmlog.c. The error code was 2.

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    then i got several messages:

    The system failed to flush data to the transaction log. Corruption may occur.

    After that only this messages appear:

    Cluster service is requesting a bus reset for device \Device\ClusDisk0.

    Cluster Service did not start any more:

    Server specific error code 5086

    The cluster fails over properly and is running on the other node.

    But the first node died

    Any ideas??
    I do not want to evict the node, or set up the machine new.

    Config:

    FSC Blade BX630
    Win2k3 64 bit
    Sql 2005 SP2

    IBM SVC San FC Connected

    Thanks for your help
     
    steffen busch, Jun 9, 2008
    #5
  6. Not enough info here to figure out the problem, but it looks like you might
    have lost connectivity to your quorum disk.

    Regards,
    John

    Visit my blog: http://msmvps.com/blogs/jtoner

    module d:\nt\base\cluster\service\dm\dmlog.c. The error code was 2.
     
    John Toner [MVP], Jun 13, 2008
    #6
  7. Matthias

    praveen Guest

    Hi Jeff,

    It will be very helpfull if you can provide a solution for one of the issue i am facing with the same Error 5.

    I am facing this error in a Majority node cluster which has Exchange 2007 .

    Cluster service could not write to a file (C:\DOCUME~1\XXX~1\LOCALS~1\Temp\CLS1348.tmp.

    From cluster log,
    00000de8.00002fa0::2011/03/17-02:45:19.673 WARN [CP] CppCheckpoint failed to get registry database SYSTEM\CurrentControlSet\Services\MSExchangeIS\ahexclex1 to file C:\DOCUME~1\XXXAHC~1\LOCALS~1\Temp\CLS2D86.tmp error 5

    00000de8.00002fa0::2011/03/17-02:45:19.673 WARN [CP] CppRegNotifyThread CppNotifyCheckpoint due to timer failed, reset the timer.



    SO basically Error 5 comes for "Access denied" issue. we have Majority node set and I have ecxluded the C:\DOCUME~1\XXXAHC~1\LOCALS~1\Temp c:\Windows\Cluster from Antivirus scanning but still the error persists.

    Kindly help to understand the possible cause of the occurence of Error 5 in this case.

     
    praveen, Apr 21, 2011
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.