We've been dealing with this issue for 3 weeks now, and i've gotten tired,
frustrated, and as some of you may have noticed: a little cranky.
We have an application that we wrote. It runs fine on our development 2005
machine, it runs fine on their test 2005 server, but when they installed it
on their live 2005 machine, it runs differently. Our developement server is
not too impressive (4GB, dual core), their test server was cheapo machine
hanging around. Their live server is quad core 8GB monster.
There is one representative (note: representative, for examples sake, a case
in point) query that takes 90k logical reads here, and 90k logical reads on
the test server, but 200k logical reads on the live server - and runs
noticably slower. The execution plans are identical on development and test,
but different on live. The query can be tuned, and then on live it has an
identical execution plan to development and test.
The data is identical (backuped and restord). All statistics are up to date
WITH FULLSCAN.
The question is why is it running poorly on live. In fact the question isn't
even why is it running poorly on live, why is it running differenly on live?
Boss: "Ian, what i don't understand is why is it running differently on
live? It runs fine here on 2005 with the ninety thousand reads. Why is it
running differently on live."
Ian: (very nearly quoting Andrew J. Kelly) "That's just the way it is.
The optimizer in 2005 is deciding to run the query different. This sort of
thing is going to happen when people move to 2005 from 2000. It happened
when people moved from 7 to 2000, and it happened when people moved from 65
to 7. The optimizer in 2005 is choosing to run the query differently, and
there's nothing you can do about it except to tune the query. Most things
will be faster in 2005 and those aren't the ones we see. There will be some
queries that perform worse. And there's no choise to re-write them, and when
they do they'll be faster than they were even on 2000."
Boss: "i understand that. But this isn't an upgrade from 2000. We're
2005 here, they tested it on 2005. Why is it running differently on their
live 2005 server? Of course you could tune queries, but the queries already
run fine on 2005. What's going on that's making it perform poorly on live?"
Ian: "i don't know. Nobody can answer that question. The optimizer is
deciding what to do, and there's little you can do to understand why it
chooses to do something some way. There's no way to understand why the
optimizer chose to do something some way. "
So, SQL Server community, i turn to you tired and exhausted. Is the answer,
"i don't know?" Is the answer that they need to move the database back to a
rinky dink server, which can run it better than the super-expensive live
monster they have, that runs 20 other databases fine? Of course we could
re-tune queries, and in an ideal world we would have retuned them 3 weeks
ago and be done with it. But that's not the question. Customer wants to know
why it runs good on other 2005 machines, but not good on the server they
want it to be installed on.
And my hand-waving arguments about optimizer differences between 2000 and
2005 just isn't gonna work.
note: my past confusion came from knowing that years ago it was on 2000, but
it had since been ported to 2005 - where the queries in question run
identically as they did on 2000.
Flora - 17 Jul 2008 15:57 GMT
Ian,
Understanding why any optimizer picks a certain plan can be one of the most
frustrating (and yet rewarding) experiences. I am not a SQL Server expert.
I've converted from UDB/DB2/Oracle background. But I believe that optimizers
all operate logically and there has to be a difference that you have not
noted.
You mentioned that the statistics were up-to-date, but did not mention
whether they were identical. (yes, it sounds stupid but when something this
confusing happens - it sometimes hlps to go back to square one and make sure
the simple things really do match up and that you have what you think you
have.
Then, is there excessive fragmentation in the production file systems ?
Are the indexes on the production tables fragmented? Are the test and
production systems set for different levels of parallelization?
You may have already verified all this, but I do not think the optimizer is
magic. Something is different and it may be something more basic that you
have not thought to recheck because it seemed to be obvious.
Ian Boyd - 17 Jul 2008 19:51 GMT
> You mentioned that the statistics were up-to-date, but did not mention
> whether they were identical. (yes, it sounds stupid but when something
[quoted text clipped - 3 lines]
> the simple things really do match up and that you have what you think you
> have.
Well, you're right. After reading off key densities, bucket sizes, and
distinct value counts for all the indexes of the tables involved in the
query on both servers, the index on one side seemed to indicate that it was
about 25% empty.
Looking for that index in both execution plans showed that the "live" server
wasn't using that index at all.
Ian: "How could it be that this index is mostly empty here?"
Developer#1: "Oh, you know what that could be. That was probably when we
restored their database here we began testing; and [tester#1] created an
event with 30,000 people.
After we restored their live database again on our development server - the
query runs with the same poor execution plan. It runs 50% faster here, but
the same plan at least. i'll chalk up the doubled execution time to their
busy server.
Now we just have to explain to them how more data made it faster.
Thank you.
And thanks be to Kevin, Andrew, Russell, Roy and Eric - wherever you are.
Linchi Shea - 18 Jul 2008 01:09 GMT
So, now you are seeing exactly the same plan in prod and dev. Are they doing
the same 200k logical reads in both environments?
Also, how long does the slow query typically run? If it runs sufficiently
long, I'd suggest looking at the I/O block size (Disk Bytes/Read) and the
latency (Disk sec/Read) on the drive where the data file(s) resides during
the query execution. Perhaps, there is a significant difference in the I//O
performance in getting these pages into the buffer pool. In addition to that,
if the amount of data that this query touches can fit into the buffer pool,
execute the query several times and eliminate physical reads, and compare the
performance in both environments (if it's possible to do that). This helps to
eliminate the I/O subsystem being a factor.
Linchi
> > You mentioned that the statistics were up-to-date, but did not mention
> > whether they were identical. (yes, it sounds stupid but when something
[quoted text clipped - 28 lines]
>
> And thanks be to Kevin, Andrew, Russell, Roy and Eric - wherever you are.
TheSQLGuru - 17 Jul 2008 16:10 GMT
1) if you REALLY TRULY want to have the best shot at understanding what is
going on and why, you need to hire a performance tuning expert to come in
and evaluate the system. There are simply too many variables and things at
play for us to hunt-and-peck a solution from afar. Therer are a number of
such people on this forum and more than a few consulting companies that
specialize is such work. Note that there is no guarantee that a true reason
will be found.
2) Likewise you can hire Microsoft Consulting Services to come in and figure
it out.
3) IIRC, you were given lots of advice on configurations to evaluate in your
previous threads (chief among them MAXDOP). Did you evaluate every single
suggestion made to you?
4) Are the sql server versions EXACTLY IDENTICAL? Are both same bit level
(32/64) and patch level?
5) What is the memory and server settings on the mondo box and how does the
IO situation differ from test?
6) You certainly have been (more than) a bit cranky. And I assure you that
has affected peoples' desire to help you now and in the future.
Note that 4 and 5 above are questions that do not require an answer. As I
led off with, it is VERY unlikely that we can solve this problem to your
satisfaction without access to both your test and production servers.

Signature
Kevin G. Boles
Indicium Resources, Inc.
SQL Server MVP
kgboles a earthlink dt net
> We've been dealing with this issue for 3 weeks now, and i've gotten tired,
> frustrated, and as some of you may have noticed: a little cranky.
[quoted text clipped - 59 lines]
> but it had since been ported to 2005 - where the queries in question run
> identically as they did on 2000.
Ian Boyd - 17 Jul 2008 19:14 GMT
> 6) And I assure you that has affected peoples' desire to help you now and
> in the future.
k