SQL Server Forum / General / Other SQL Server Topics / June 2007
Strange performance issue with UPDATE FROM
|
|
Thread rating:  |
Richard - 26 Jun 2007 16:31 GMT Hello!
I have this piece of SQL code:
UPDATE a SET Field1 = c.Field1 FROM a INNER JOIN b ON a.GUID1 = b.GUID1 INNER JOIN c ON b.GUID2 = c.GUID2 WHERE c.Type = 1 AND @date BETWEEN b.DateFrom AND b.DateTo
This query takes hours to complete.
Now while trying to find out what's causing the poor performance (it surely looks simple enough!) I've rewritten it to use temp tables:
SELECT a.GUID1, a.Field1, c.Type, b.DateFrom, b.DateTo INTO #temptable FROM a INNER JOIN b ON a.GUID1 = b.GUID1 INNER JOIN c ON b.GUID2 = c.GUID2 WHERE c.Type = 1 AND @date BETWEEN b.DateFrom AND b.DateTo
UPDATE a SET Field1 = subsel.Field1 FROM (SELECT * FROM #temptable) AS subsel WHERE subsel.GUID1 = a.GUID1
Now it completes in 10 seconds.
My question is why? Am I wrong in saying that the two batches above produce same results? Is there something I've missed about the UPDATE FROM syntax? Why would the first query perform THAT poorly?
Table sizes: a: 24k rows b: 268k rows c: 260k rows
GUIDs are of type uniqueidentifier.
Any answers appreciated!
Regards, // Richard
Erland Sommarskog - 26 Jun 2007 22:46 GMT > I have this piece of SQL code: > [quoted text clipped - 27 lines] > produce same results? Is there something I've missed about the UPDATE > FROM syntax? Why would the first query perform THAT poorly? One problem with UPDATE FROM is that you can update the same row several times if your join conditions are not unique. What happens if you run:
UPDATE a SET Field = (SELECT c.Field1 FROM c JOIN b ON c.GUID2 = b.GUID2 WHERE a.GUID1 = b.GUID1 AND c.type = 1 AND @date BETWEEN b.DateFrom AND b.DateTo)
I'm most interested to know if the query succeds at all, or if it fails with an error message. From the table sizes you have indicated, I would expect an error, but I don't know how your tables are related.
As for the performance, investigating the query plan can give some ideas. Without seeing query plans, the table definitions, the indexes etc, it's difficult to say much useful.
Which version of SQL Server are you using?
Is @date a parameter to a stored procedure or a local variable?
 Signature Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Roy Harvey - 27 Jun 2007 01:15 GMT >One problem with UPDATE FROM is that you can update the same row >several times if your join conditions are not unique. What happens if [quoted text clipped - 7 lines] > AND c.type = 1 > AND @date BETWEEN b.DateFrom AND b.DateTo) BE CAREFUL WITH THIS!!
One thing that could happen from this UPDATE is that it sets Field = NULL for rows that are untouched by the UPDATEs in the original post. That happens if there are rows in the table being UPDATEd that do not have matches in the subquery. The FROM/JOIN prevents that in the original versions. I don't know if any such non-matching rows exist, but it certainly seems possible with the date range and type tests dropping rows from the subquery.
I believe this query would show if the original UPDATEs using FROM result in the same row updated more than once, without possible impact on the data.
SELECT A.PrimaryKey, count(*) FROM A JOIN B ON A.GUID1 = B.GUID1 JOIN C ON B.GUID2 = B.GUID2 WHERE C.type = 1 AND @date BETWEEN B.DateFrom AND B.DateTo) GROUP BY A.PrimaryKey HAVING COUNT(*) > 1
Roy Harvey Beacon Falls, CT
Erland Sommarskog - 27 Jun 2007 08:31 GMT >>One problem with UPDATE FROM is that you can update the same row >>several times if your join conditions are not unique. What happens if [quoted text clipped - 12 lines] > One thing that could happen from this UPDATE is that it sets Field = > NULL for rows that are untouched by the UPDATEs in the original post. Right, Roy. I didn't add a WHERE clause, because it was more intended as a test to see if the query would work at all. But I should have included the warning.
 Signature Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
--CELKO-- - 27 Jun 2007 01:27 GMT I hope you know better than to use GUIDs in an RDBMS except for replication, never to use reserved words like "date" for data element names or vague names like "type" -- the basic ISO-11179 rules, etc. You also seem to confuse fields and columns, but let's skip the signs of poor SQL practices for now.
If you rewrite this in statement into Standard SQL, you will see if the join returns multiple rows instead of a scalar value. That would let us know that the schema has serious design problems. The illegal syntax you used can do multiple updates on each row; talk to an old Sybase programmer about this problem.
UPDATE A SET field1 = (SELECT C.field1 FROM B, C WHERE A.guid1 = B.guid1 AND B.guid2 = C.guid2 AND C.somekind_type = 1 AND @my_date BETWEEN B.start_date AND B.end_date);
Richard - 27 Jun 2007 14:40 GMT > I hope you know better than to use GUIDs in an RDBMS except for > replication, never to use reserved words like "date" for data element > names or vague names like "type" -- the basic ISO-11179 rules, etc. > You also seem to confuse fields and columns, but let's skip the signs > of poor SQL practices for now. Well, yes, I actually do know better. The columns, variables and tables in the query are renamed as I don't want to post production code on the Internet. Also excuse the mixup between fields and columns, I'm not a native English speaker.
One big problem (as i see it, and I'm by no means a SQL expert) is that the db in question uses uniqueidentifier primary keys with clustered indexes on those almost EVERYWHERE, and there is nothing I can do to change that at the moment...Constructs like
FROM z INNER JOIN a ON ..GUID = ..GUID INNER JOIN b ON ..GUID = ..GUID INNER JOIN c ON ..GUID = ..GUID INNER JOIN d ON ..GUID = ..GUID LEFT OUTER JOIN eON ..GUID = ..GUID AND VERSION = ( SELECT MAX(VERSION) FROM f WHERE ..GUID = ..GUID)
make the queries run painfuly slow.
So the question is, is there ANYTHING I can do to optimize this type of queries or is a redesign the only thing that would help?
@Erland: I use MS SQL 2000 server and @date is a local variable =)
Erland Sommarskog - 27 Jun 2007 22:27 GMT > One big problem (as i see it, and I'm by no means a SQL expert) is > that the db in question uses uniqueidentifier primary keys with > clustered indexes on those almost EVERYWHERE, and there is nothing I > can do to change that at the moment...Constructs like Clustered indexes on GUIDs requires a lot of skill in monitoring fragmentation.
With the standard setup with a high fill factor, clustering on GUIDs is bad, because you get page splits and fragmentation galore.
SQL Server MVP Greg Linwood suggested to me that clustering on GUIDs may still be good for INSERT performance, if you create the indexes with a low fill factor, say 50%. Now when you add new rows, there are good chance that there is a hole to plug into. When the table starts to fill up, you need to reindex again. But this strategy requires careful planning, and is nothing for the armchair DBA.
If you are stuck with these clustered indexes, make you sure you set up a reindexing job that runs regularly, and you should probably aim at a lower fill factor. If not 50%, maybe 75-80%. It depends a bit how big the INSERT frequency is. And use DBCC SHOWCONTIG to monitor fragmentation.
> FROM z > INNER JOIN a ON ..GUID = ..GUID [quoted text clipped - 9 lines] > So the question is, is there ANYTHING I can do to optimize this type > of queries or is a redesign the only thing that would help? With the information you have posted, it's impossible to tell. But I would at least give defragmentation with DBCC DBREINDEX a chance first if DBCC SHOWCONTIG show horrendeous numbers.
You could also consider adding covering non-clustered index on tables where only a few columns of many are involved in the query.
 Signature Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server 2005 at http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
--CELKO-- - 28 Jun 2007 20:51 GMT >> Well, yes, I actually do know better. The columns, variables and tables in the query are renamed as I don't want to post production code on the Internet. << Understood. But look at how many production DBs posted here have such flaws.
>> Also excuse the mix up between fields and columns, I'm not a native English speaker. << And that means that your English is probably better than a native English speaker :) But my objection is not English; it is RDBMS versus File Systems. A big problem I see when I teach SQL is students using SQL as if it were a sequential file system -- no constraints, improper data types. no DRI actions, depending on applications to do what DDL should do, etc.
>> One big problem (as i see it, and I'm by no means a SQL expert) is that the db in question uses uniqueidentifier primary keys with clustered indexes on those almost EVERYWHERE,.. << You are doing very well for an amateur :) Yes, this is a major problem and not just for performance. A uniqueidentifier cannot be a key in a properly designed RDBMS by definition -- it is an attribute of the hardware and not the data model. You cannot verify it with a trusted external source, so you have no data integrity. And it is bitch to write them out without making an error.
The programmers who do this are trying to mimic pointer chains and build a linked list in SQL. They missed the whole idea of RDBMS.
>> So the question is, is there ANYTHING I can do to optimize this type of queries or is a redesign the only thing that would help? << Not much. Clustered indexes are not going to help with the random nature of a uniqueidentifier. Perhaps the best thing you can do is kill the guy that did this to you and prevent him from coding again.
Hugo Kornelis - 27 Jun 2007 18:39 GMT > The illegal >syntax you used can do multiple updates on each row; talk to an old >Sybase programmer about this problem. Hi Joe,
It's not illegal. It's non-standard, but in SQL Server (check the name of this group!!), it's legal and documented. And in many cases much faster than the ANSI-standard equivalent - an ommission I keep kicking the optimizer team for until they finallly get it right, but until that day we developers have to code our way around it.
>UPDATE A SET field1 > = (SELECT C.field1 FROM B, C > WHERE A.guid1 = B.guid1 > AND B.guid2 = C.guid2 AND C.somekind_type = 1 > AND @my_date BETWEEN B.start_date AND B.end_date); Oh boy, I really hope that Richard runs a quick test or two (oh wait, one should be enough with such glaring errors) before deploying this "alternative" in production.
I think that this snippet of code is a good contender for not only the worst code ever posted to usenet, but also for the worst formatting. If I were a book author, I'd now have to ask you for permission to include it as a sample of bad query writing (though the reader probably wouldn't believe it -hah! as if I even *could* come up with something this bad).
 Signature Hugo Kornelis, SQL Server MVP My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
--CELKO-- - 28 Jun 2007 21:32 GMT >> It's not illegal. It's non-standard, but in SQL Server (check the name of this group!!), it's legal and documented. And in many cases much faster than the ANSI-standard equivalent - an omission I keep kicking the optimizer team for until they finally get it right, but until that day we developers have to code our way around it. << Okay, would accept "proprietary and in total violation of the programming model in ANSI/ISO Standard SQL and RDBMS concepts" instead of "illegal"?
>> Oh boy, I really hope that Richard runs a quick test or two (oh wait, one should be enough with such glaring errors) before deploying this alternative" in production. << I am not happy with the idea that one value is being replicated among multiple tables in the same schema. I have the feeling that we should not be doing this UPDATE at all. But I have no DDL, so the best I can do is translate from dialect to Standard directly.
>> I think that this snippet of code is a good contender for not only the worst code ever posted to Usenet, but also for the worst formatting. << Hey, I am not even close the worst code ever posted to Usenet!! You ought to see the stuff people send me in emails :) I just finished one "quickie Email consult" for an old friend where we removed 70-75% of the lines of code and reduced the time from 98 minutes to 4 minutes for time-to-completion for a complex report using CASE expressions instead of sub-queries. It took three emails and ~15 minutes of my time. If I could re-do the schema design, I could get it down to 1 minute or less.
What I was not expecting in that exercise was that replacing INNER JOIN operators with the WHERE clause syntax improved performance in DB2. I have no idea why, but perhaps their optimizer has a bias in ordering the tables. The infixed JOIN operators are required to behave as if they are executed in left-to-right order, while the old "FROM.. WHERE.." syntax is not so required.
Alex Kuznetsov - 28 Jun 2007 22:12 GMT > Hello! > [quoted text clipped - 41 lines] > Regards, > // Richard Not arguing with other party on correctness/standards etc., when you create a temp table, you get statistics on it. So them optimizer has a better estimate of number of rows to modify and may choose a better plan.
http://sqlserver-tips.blogspot.com/
|
|
|