Hi
Have a look at
http://www.microsoft.com/sql/techinfo/administration/2000/availability.asp
On our clusters, we have failover times on 10-32 seconds. Most of them
average 12 minutes downtime per year. Banking environment, with procedures in
place to ensure this.
Regards
Mike
> Dear All,
>
[quoted text clipped - 19 lines]
>
> Patrice
Inline.
joe.
> Dear All,
>
[quoted text clipped - 7 lines]
> (1) It takes between 1 and 3 min to make a move group of either SQL or
> DTC resources.
It usually takes under 10seconds for failover at the OS/MSCS level. This has
been tested with various hardware vendors including some that aren't even on
the Failover Clustering HCL. SQL Server failover time is dependent on a
number of things but mostly on how much time it takes to startup and
complete roll forward/back operations. Tweaking the checkpoint interval can
help as can having short transactions. Huge caches in the SAN can impact
recovery time also. You first need to figure out where the bulk of the 1-3
minutes is being spent in order to determine what you might need to do at
each end.
> (2) To make a HW/SW maintenance on the two nodes, we need to perform 6
> moves for a total of 6 min downtime minimum, 18 min downtime maximum.
Not sure what you mean by this. Why would you need 6 moves? Take down one
side and do your maintenance then the other and then back to the original if
you really want to (not necessary/recommended unless you have multiple
instances). Please elaborate.
> (3) Indeed, I have more, but let's start with only those two.
>
> Then, 52 minutes / 6 minutes = 8 maintenances/year maximum, 52 minutes
> / 18 minutes = 2 maintenances/year minimum. In my opinion, it clearly
> shows that 99.99% with MS SQL Cluster is NOT possible, until I have
> missed something ...
99.99% availability is exceedingly difficult with hardware alone. Processes
and management play massive roles in a highly available system. Getting to
99.9% is fairly easy with just technology. Beyond that you're really looking
at people & processes which may include design changes.
At the end of the day, your business requirements should set your
availability goals. Every point you gain after 99.9 generally results in a
bigger increase in cost & complexity/effort. What do you really need?
> Thanks in advance for your comments and best regards,
>
> Patrice
Patrice - 25 Feb 2005 15:05 GMT
Hello Joe,
After the 10-32 seconds for Mike, you come with < 10 seconds. We should
do something very wrong to end with ~ 2 min on a not loaded SQL
cluster.
But, I also think that we should at this stage be sure that we are
measuring the failover duration the same way ;-) Indeed, last WE, we
had to install some MS patches, and we have made two complete
failovers. Here are the details of the 2nd failover (the fastest one),
which took 2 min 9 sec:
- NODE01 17:25:30 The Cluster Service is attempting to offline the
Resource Group "Cluster Group".
- NODE01 17:25:30 The Cluster Service brought the Resource Group
"Cluster Group" offline.
- NODE02 17:25:55 The Cluster Service is attempting to bring online
the Resource Group "Cluster Group".
- NODE02 17:25:59 The Cluster Service brought the Resource Group
"Cluster Group" online.
- NODE01 17:26:11 The Cluster Service is attempting to offline the
Resource Group "MSDTC".
- NODE01 17:26:12 The Cluster Service brought the Resource Group
"MSDTC" offline.
- NODE01 17:26:32 The Cluster Service is attempting to offline the
Resource Group "SQL01".
- NODE02 17:26:34 The Cluster Service is attempting to bring online
the Resource Group "MSDTC".
- NODE01 17:26:39 The Cluster Service brought the Resource Group
"SQL01" offline.
- NODE02 17:26:47 The Cluster Service brought the Resource Group
"MSDTC" online.
- NODE01 17:27:00 The Cluster Service is attempting to offline the
Resource Group "SQL02".
- NODE02 17:27:00 The Cluster Service is attempting to bring online
the Resource Group "SQL01".
- NODE01 17:27:08 The Cluster Service brought the Resource Group
"SQL02" offline.
- NODE02 17:27:11 The Cluster Service brought the Resource Group
"SQL01" online.
- NODE02 17:27:28 The Cluster Service is attempting to bring online
the Resource Group "SQL02".
- NODE02 17:27:39 The Cluster Service brought the Resource Group
"SQL02" online.
[00:02:09]
Indeed, I must admit that the administrator has moved the four groups
one by one, which is probably not the most efficient way, any advices
on this topic is welcome! But, if we take the four moves independently,
we can see that we still have durations that are > 10 sec:
Move of the "Cluster Group" group:
- NODE01 17:25:30 The Cluster Service is attempting to offline the
Resource Group "Cluster Group".
- NODE01 17:25:30 The Cluster Service brought the Resource Group
"Cluster Group" offline.
- NODE02 17:25:55 The Cluster Service is attempting to bring online
the Resource Group "Cluster Group".
- NODE02 17:25:59 The Cluster Service brought the Resource Group
"Cluster Group" online.
[00:00:29]
Move of the "MSDTC" group:
- NODE01 17:26:11 The Cluster Service is attempting to offline the
Resource Group "MSDTC".
- NODE01 17:26:12 The Cluster Service brought the Resource Group
"MSDTC" offline.
- NODE02 17:26:34 The Cluster Service is attempting to bring online
the Resource Group "MSDTC".
- NODE02 17:26:47 The Cluster Service brought the Resource Group
"MSDTC" online.
[00:00:36]
Move of the "SQL01" group:
- NODE01 17:26:32 The Cluster Service is attempting to offline the
Resource Group "SQL01".
- NODE01 17:26:39 The Cluster Service brought the Resource Group
"SQL01" offline.
- NODE02 17:27:00 The Cluster Service is attempting to bring online
the Resource Group "SQL01".
- NODE02 17:27:11 The Cluster Service brought the Resource Group
"SQL01" online.
[00:00:39]
Move of the "SQL02" group:
- NODE01 17:27:00 The Cluster Service is attempting to offline the
Resource Group "SQL02".
- NODE01 17:27:08 The Cluster Service brought the Resource Group
"SQL02" offline.
- NODE02 17:27:28 The Cluster Service is attempting to bring online
the Resource Group "SQL02".
- NODE02 17:27:39 The Cluster Service brought the Resource Group
"SQL02" online.
[00:00:39]
By the way, you can here see why I was talking about 6 moves. You can
see here 4 moves, which should be completed by 2 last moves in order to
equally distribute the groups between the two servers. We did not
perform the 2 last moves because of our lack of confidence with the
NODE01 server, which has crashed 2 times since the beginning of the
year :-(
In summary, I am looking for:
(1) Advices on the most efficient way to move all the groups of a
cluster;
(2) Similar Event Log analysis
Finally, I need to emphasize that I have no clue about the SQL "load"
during the failover, I guess it would very interesting to have a graph
of duration of failover versus load :-))
Many thanks in advance and best regards,
Patrice