First, what do the logs on both servers say. You should have application
logs that indicate why the first node failed to carry the complete load.
One of the more common failure causes is not configuring memory to support
both instances on the same node. The instance that failed over may not have
had enough memory to operate.
Also, if you have file shares on clustered resources (or worse, clustered
SQL server instances accessing local file shares) then you have a design
that is incompatible with clustering. Again, the error logs should tell you
what is going on.

Signature
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP
> Our second node in an active/active SQL cluster (Win 2K SP4/ SQL2K SP2)
> failed as a result (it seems) of a bad NIC at the public interface.
[quoted text clipped - 15 lines]
> articles referring to similar issues? I need a post-mortem reason for
> why everything just suddenly died.
dwcscreenwriterextremesupreme@gmail.com - 25 Apr 2006 23:13 GMT
File shares are all on a central SAN, with quorum drive on its own
spindles, sql resources for each instance on their own spindles.
The sequence of events appears to be this:
(from system log @ 22:39)
The browser has forced an election on network
\Device\NetBT_Tcpip_{210B8C37-6D9D-479A-97FA-3F6B28A0D447} because a
master browser was stopped.
(from cluster logs)
00000788.00000794::2006/04/22-22:40:04.211 File Share <Logship2>: Error
checking for share. Error 2114.
00000788.00000794::2006/04/22-22:40:04.211 File Share <Logship2>: Error
checking for share. Error 2114.
0000055c.0000079c::2006/04/22-22:40:04.242 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:04.258 [CP] CppResourceNotify for
resource Logship2
0000055c.0000064c::2006/04/22-22:40:04.274 [FM]
FmpHandleResourceTransition: Resource Name =
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 old state=2 new state=4
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate: queuing
update type 0 context 8
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate:
Dispatching seq 29499 type 0 context 8 to node 1
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate:
completed update seq 29499 type 0 context 8
0000055c.0000064c::2006/04/22-22:40:04.289 [FM]
FmpPropagateResourceState: resource
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 failed event.
0000055c.0000064c::2006/04/22-22:40:04.289 [FM]
FmpHandleResourceFailure: taking resource
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 and dependents offline
00000788.00000794::2006/04/22-22:40:10.961 Network Name <SQL Network
Name(2P0SQLV2)>: Name 2P0SQLV2<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:10.976 File Share <LogShip>: Error
checking for share. Error 2114.
00000788.00000794::2006/04/22-22:40:10.976 File Share <LogShip>: Error
checking for share. Error 2114.
0000055c.0000079c::2006/04/22-22:40:10.976 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:10.976 [CP] CppResourceNotify for
resource LogShip
00000788.00000794::2006/04/22-22:40:11.023 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:11.023 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.
00000788.0000041c::2006/04/22-22:40:11.023 File Share <Logship2>:
SmbShareDoTerminate: SmbpShareNotifyWorker Terminated... !!!
00000788.0000041c::2006/04/22-22:40:11.023 File Share <Logship2>: Error
removing share 'Logship2'. Error 2114.
0000055c.0000064c::2006/04/22-22:40:11.039 [CP] CppResourceNotify for
resource Logship2
00000788.00000794::2006/04/22-22:40:11.039 Network Name <SQL Network
Name(2P0SQLV2)>: Name 2P0SQLV2<20> is no longer registered with NBT.
0000055c.0000064c::2006/04/22-22:40:11.039 [FM] RmTerminateResource:
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 is now offline
00000788.00000794::2006/04/22-22:40:11.039 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
0000055c.0000064c::2006/04/22-22:40:11.039 [FM] RestartResourceTree,
Restart resource 09ccd7f0-6c52-4756-af88-a3b6e4fb6043
00000788.00000794::2006/04/22-22:40:11.039 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.
A few seconds later, the SQL node fails its heartbeat
00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1 failed IsAlive/LooksAlive check, error
5038.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1 failed IsAlive/LooksAlive check, error
5038.
As does the cluster itself
0000055c.0000079c::2006/04/22-22:40:16.351 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:16.351 [CP] CppResourceNotify for
resource SQL Network Name(2P0SQLV1)
00000788.00000794::2006/04/22-22:40:16.351 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <Cluster Name>:
Name 2P0CLUSTER1 failed IsAlive/LooksAlive check, error 5038.
Again, in SQL everything seemed to be fine. It was Cluster Services
itself that seemed to die.
Geoff N. Hiten - 26 Apr 2006 01:31 GMT
Turn off NETBIOS on your private NIC.

Signature
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP
> File shares are all on a central SAN, with quorum drive on its own
> spindles, sql resources for each instance on their own spindles.
[quoted text clipped - 85 lines]
> Again, in SQL everything seemed to be fine. It was Cluster Services
> itself that seemed to die.
dwcscreenwriterextremesupreme@gmail.com - 26 Apr 2006 19:33 GMT
Thank you, we'll try that.
What I'm most interested in though, is if you've ever heard of
something like this happening before. This is a server cluster that's
been up and solid for over a year, and I can't just put in my
post-mortem "mystical fairies killed the server."
Does anyone know *why* something like this might have happened, or have
any KB articles referring to similar problems?
> Turn off NETBIOS on your private NIC.
>
[quoted text clipped - 92 lines]
> > Again, in SQL everything seemed to be fine. It was Cluster Services
> > itself that seemed to die.