Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
DB Engine
SQL ServerMSDESQL Server CE
Services
Analysis (Data Mining)Analysis (OLAP)DTSIntegration ServicesNotification ServicesReporting Services
Programming
CLRConnectivitySQLXML
Other Technologies
ClusteringEnglish QueryFull-Text SearchReplicationService Broker
General
Data WarehousingPerformanceSecuritySetupSQL Server ToolsOther SQL Server Topics
DirectoryUser Groups
Related Topics
MS AccessOther DB ProductsMS Server Products.NET DevelopmentVB DevelopmentJava DevelopmentMore Topics ...

SQL Server Forum / Other Technologies / Clustering / April 2006

Tip: Looking for answers? Try searching our database.

Cluster Failover Failed -- KB help anyone?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
dwcscreenwriterextremesupreme@gmail.com - 25 Apr 2006 20:36 GMT
Our second node in an active/active SQL cluster (Win 2K SP4/ SQL2K SP2)
failed as a result (it seems) of a bad NIC at the public interface.

Cluster and SQL logs seem to support that the failover went
successfully, except for SQL job agent suddenly stopping on the first
node.

However, 2 hours later, the first node received a boatload of errors
indicating that the file shares were down, the second node was no
longer registered in the NBT, the first node was no longer in the NBT
and the cluster itself was no longer in the NBT. At this point, cluster
services on the node shut down everything within the cluster --
including SQL.

I have heard that this might be a result of an SQL bug regarding
clustering, but can find no corroborating documentation. Has anyone
ever heard of a problem like this, or can you point me to some KB
articles referring to similar issues? I need a post-mortem reason for
why everything just suddenly died.
Geoff N. Hiten - 25 Apr 2006 21:27 GMT
First, what do the logs on both servers say.  You should have application
logs that indicate why the first node failed to carry the complete load.
One of the more common failure causes is not configuring memory to support
both instances on the same node.  The instance that failed over may not have
had enough memory to operate.

Also, if you have file shares on clustered resources (or worse, clustered
SQL server instances accessing local file shares) then you have a design
that is incompatible with clustering.  Again, the error logs should tell you
what is going on.

Signature

Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP

> Our second node in an active/active SQL cluster (Win 2K SP4/ SQL2K SP2)
> failed as a result (it seems) of a bad NIC at the public interface.
[quoted text clipped - 15 lines]
> articles referring to similar issues? I need a post-mortem reason for
> why everything just suddenly died.
dwcscreenwriterextremesupreme@gmail.com - 25 Apr 2006 23:13 GMT
File shares are all on a central SAN, with quorum drive on its own
spindles, sql resources for each instance on their own spindles.

The sequence of events appears to be this:

(from system log @ 22:39)
The browser has forced an election on network
\Device\NetBT_Tcpip_{210B8C37-6D9D-479A-97FA-3F6B28A0D447} because a
master browser was stopped.

(from cluster logs)
00000788.00000794::2006/04/22-22:40:04.211 File Share <Logship2>: Error
checking for share. Error 2114.
00000788.00000794::2006/04/22-22:40:04.211 File Share <Logship2>: Error
checking for share. Error 2114.
0000055c.0000079c::2006/04/22-22:40:04.242 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:04.258 [CP] CppResourceNotify for
resource Logship2
0000055c.0000064c::2006/04/22-22:40:04.274 [FM]
FmpHandleResourceTransition: Resource Name =
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 old state=2 new state=4
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate: queuing
update    type 0 context 8
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate:
Dispatching seq 29499    type 0 context 8 to node 1
0000055c.0000064c::2006/04/22-22:40:04.289 [GUM] GumSendUpdate:
completed update seq 29499    type 0 context 8
0000055c.0000064c::2006/04/22-22:40:04.289 [FM]
FmpPropagateResourceState: resource
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 failed event.
0000055c.0000064c::2006/04/22-22:40:04.289 [FM]
FmpHandleResourceFailure: taking resource
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 and dependents offline
00000788.00000794::2006/04/22-22:40:10.961 Network Name <SQL Network
Name(2P0SQLV2)>: Name 2P0SQLV2<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:10.976 File Share <LogShip>: Error
checking for share. Error 2114.
00000788.00000794::2006/04/22-22:40:10.976 File Share <LogShip>: Error
checking for share. Error 2114.
0000055c.0000079c::2006/04/22-22:40:10.976 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:10.976 [CP] CppResourceNotify for
resource LogShip
00000788.00000794::2006/04/22-22:40:11.023 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:11.023 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.
00000788.0000041c::2006/04/22-22:40:11.023 File Share <Logship2>:
SmbShareDoTerminate: SmbpShareNotifyWorker Terminated... !!!
00000788.0000041c::2006/04/22-22:40:11.023 File Share <Logship2>: Error
removing share 'Logship2'. Error 2114.
0000055c.0000064c::2006/04/22-22:40:11.039 [CP] CppResourceNotify for
resource Logship2
00000788.00000794::2006/04/22-22:40:11.039 Network Name <SQL Network
Name(2P0SQLV2)>: Name 2P0SQLV2<20> is no longer registered with NBT.
0000055c.0000064c::2006/04/22-22:40:11.039 [FM] RmTerminateResource:
09ccd7f0-6c52-4756-af88-a3b6e4fb6043 is now offline
00000788.00000794::2006/04/22-22:40:11.039 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
0000055c.0000064c::2006/04/22-22:40:11.039 [FM] RestartResourceTree,
Restart resource 09ccd7f0-6c52-4756-af88-a3b6e4fb6043
00000788.00000794::2006/04/22-22:40:11.039 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.

A few seconds later, the SQL node fails its heartbeat

00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1 failed IsAlive/LooksAlive check, error
5038.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <SQL Network
Name(2P0SQLV1)>: Name 2P0SQLV1 failed IsAlive/LooksAlive check, error
5038.

As does the cluster itself

0000055c.0000079c::2006/04/22-22:40:16.351 [FM] NotifyCallBackRoutine:
enqueuing event
0000055c.00000648::2006/04/22-22:40:16.351 [CP] CppResourceNotify for
resource SQL Network Name(2P0SQLV1)
00000788.00000794::2006/04/22-22:40:16.351 Network Name <Cluster Name>:
Name 2P0CLUSTER1<20> is no longer registered with NBT.
00000788.00000794::2006/04/22-22:40:16.351 Network Name <Cluster Name>:
Name 2P0CLUSTER1 failed IsAlive/LooksAlive check, error 5038.

Again, in SQL everything seemed to be fine. It was Cluster Services
itself that seemed to die.
Geoff N. Hiten - 26 Apr 2006 01:31 GMT
Turn off NETBIOS on your private NIC.

Signature

Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP

> File shares are all on a central SAN, with quorum drive on its own
> spindles, sql resources for each instance on their own spindles.
[quoted text clipped - 85 lines]
> Again, in SQL everything seemed to be fine. It was Cluster Services
> itself that seemed to die.
dwcscreenwriterextremesupreme@gmail.com - 26 Apr 2006 19:33 GMT
Thank you, we'll try that.

What I'm most interested in though, is if you've ever heard of
something like this happening before. This is a server cluster that's
been up and solid for over a year, and I can't just put in my
post-mortem "mystical fairies killed the server."

Does anyone know *why* something like this might have happened, or have
any KB articles referring to similar problems?

> Turn off NETBIOS on your private NIC.
>
[quoted text clipped - 92 lines]
> > Again, in SQL everything seemed to be fine. It was Cluster Services
> > itself that seemed to die.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.