We have a two node Windows 2003 Enterprise R2 x64 cluster set up on our EMC
CX700, connected with on a Brocade based SAN. The servers are HP DL360G6,
2 quad core CPU, 12GB RAM, with dual QLogic PCIExpress cards, running EMC
PowerPath multipathing software. All drives, firmware, patches, etc. are
current.
This cluster replaced a 2003 x86/32bit cluster we had set up for several
years (on the same storage array). There are about 10 "drives" on the
cluster, some that were brought over by moving the LUN's from the old
cluster, some that were new and we used backup/restore to move the data.
Total storage all together is about 4TB, although the largest LUN is only
about 1.5TB (most are in the 5-800GB range)
The other day I was on the server console copying a single 25GB file (a
Windows Backup "BKF" file) from one "drive" to another. It started copying
and I locked the console and left it. I was away from the console so I
didn't see that about 10 minutes into the copy, the system started popping
up "Delayed write failed.." warnings and logging NTFS error 50 and
Application error 26 in the system log
About an hour after the warnings started (they were still happening), the
Clussvc service started logging "1055" errors in the log, "Cluster File
Share Resource xyz has failed a status check. The error code is 64". The
file shares were not on the same drive that I was copying to or from.
Then I started seeing 1069 errors "Cluster Resource failed", again from
shares on various drives.
By then, our users started contacting the help desk that they couldn't
access the shares (surprise!), I went back to the console, saw the "Delayed
write" popups, and killed the file copy I was doing. According to the
event log, about 10 seconds after the copy was stopped, we had a few more
1055 and 1069 errors, then a "1201" info message, "Cluster service brought
the resource group online" and everything was working OK.
I've checked the drives, and CHKDSK isn't showing any of them "dirty", so
it doesn't appear to be a hardware problem.
I'm guessing that the big file copy was causing the I/O bottleneck on the
server and the shares timed out.
Any suggestions on what caused that and how to keep that from happening
again? Granted, we don't move 20GB+ files around all the time, but I would
think the system should handle it without overloading the system.
I've done some searching on Microsoft, EMC, and general internet searching.
I've found some similar cases, and some things to check, but nothing really
definite for our installation.
Mike O.
|