Re: problem using identity column as primary key
You are accessing this site in a read-only mode. For full access to all member benefits, including message posting, please login or register. Registration is completely free, simple, and takes only a few seconds.
Login |
Free SQLMonster.com registration |
Whole discussion thread
The message you are replying to and its parents are listed in the reverse order with the most recent posts first. This might not be the whole discussion thread. To read all the messages in this thread please click here.
Re: problem using identity column as primary key
| J. M. De Moor | 25 Feb 2004 01:10 |
Petrik
> One problem with natural keys that hasn't come up in this discussion > is the fact that it is very difficult to find a natual key that is > quaranteed not to change. That's why some put ON UPDATE CASCADE in their DDL, right?
Joe De Moor
|
| Petrik Salovaara | 25 Feb 2004 00:00 |
> The simple fact is that using Identity columns as PK DO work. And it > will keep on working. And all the theorizing in the world is not > going to stop it from working. You are right. Not only it works, but it works much better than a natural key. All the problems described by CELKO are easily solved using decent design, integrity check tools and data transformation utilities.
One problem with natural keys that hasn't come up in this discussion is the fact that it is very difficult to find a natual key that is quaranteed not to change. Hey, even person's SSN will change in some situations!
Cheers, Petrik
|
| Bob Boran | 24 Feb 2004 22:09 |
> If I had only perfect users who never re-entered the same data twice > (and got two unique autonumbers on the dups), Unique constraint would have caught this and not allowed the second entry.
> who always typed in key > values without typos, etc. Who lets the end user even know what the Identity value is? Let alone let them type it in somewhere.
> >> Identity is not a datatype, it is just a way of generating a value. > << > > Exactly! And what is the definition of a columns? It has a known > datatype! What? *Bob looks perplexed trying to figure how Joe's reply applies.....*
> Wrong. Let's go back to the Motor Pool. Let's assume that we had used > a VIN instead of an IDENTITY. Let's further assume that a sabateur (not [quoted text clipped - 4 lines] > pass the CHECK(). If started at a referencing table, the DRI would have > blocked him. So, you would trade daily performance, scalability, ease of use, for the potential of a sabateur attacking your DB. Can you say Daily offsite backup? How about competent DBA? Security? Even silly little "dangerous" people like me realize that the real world has it's tradeoffs. Perhaps you have spent to much time in acedemia.
> And Dr. Codd, me, and everyone else in the Relational Database arena > will still say "Wrong!"; and we will also remind you that a row is not a > record. The simple fact is that using Identity columns as PK DO work. And it will keep on working. And all the theorizing in the world is not going to stop it from working.
> --CELKO-- |
| Joe Celko | 24 Feb 2004 21:20 |
>> Careful scripting and testing before doing any data moves would have solved this problem. I can't see how it applies to our discussion. <<
If I had only perfect users who never re-entered the same data twice (and got two unique autonumbers on the dups), who always typed in key values without typos, etc. then I might feel the way you do and believe that IDENTITY is safe.
Any of the more experienced programmer here are probably having flashbacks to the first time they said "But users won't do THAT!"; but the users did.
But here in the **real** world, we need check digits, syntax edits and tools that can recover when **real** people screw up. This is funny because I am usually the guy who gets an irate reply to the effect, "Shut up with all that theory and math crap, just give me the fastest, easier kludge that will make this thing run NOW! NOW!" from the newbies.
While nothing might be idiot-proof, at least make it idiot-retardant. "Against stupidity the gods themselves struggle in vain." - Die Jungfrau von Orleans; Friedrich von Schiller (1759-1805).
>> Identity is not a datatype, it is just a way of generating a value. <<
Exactly! And what is the definition of a columns? It has a known datatype!
>> If an artificially generated value wasn't going to work for you, then any natural value would run into the same problem. <<
Wrong. Let's go back to the Motor Pool. Let's assume that we had used a VIN instead of an IDENTITY. Let's further assume that a sabateur (not just a klutz) went into the tables and tried to trash all the VINs. The first problem is a CHECK() constraint with a regular expression to validate the VIN. Okay, he is a sabateur and wants to keep going. He writes a program to generate fake VINs or to scramble real ones so as to pass the CHECK(). If started at a referencing table, the DRI would have blocked him.
Now he has to go to the referenced table and scrambles the real VINs while adding fakes. The ON UPDATE CASCADE actions push those scrambled VINs down to the referencing tables; the DB is consistent but not correct. The fake VINs sit around the referenced table, unused.
We catch the bad guy, or discover the problem in an ISO-9000 or Sarbanes-Oxley audit. Gasp! What can we do? Well, what we actually did was go out to the Motor Pool and read the VINs off of the vehicles.
Try doing that with a sequential IDENTITY number.
1) There is no CHECK() on IDENTITY, every number is as good as the next. A klutz can do this, no sabateur needed.
2) Since I know that if (n) is in the set of PRIMARY KEYs , I have a huge chance of also having (n-k) as a valid PRIMARY KEY, too. That lets me go to a refrencing table and decrement all the FOREIGN KEYs. Notice the use of uppercase here ("primary key" is not the same as "PRIMARY KEY"; IDENTITY cannot be the first, but it can be the second).
3) Ditto for trashing the referenced table. But there is a better way: INSERT INTO Foobar SELECT a,b,c FROM Foobar;
This is a relational version of a DoS attack. IDENTITY will cheerfully flood the table with redudant data. A klutz who is not sure if he sent the data or not can do this, no sabateur needed.
>> Since this [pointer chains] obviously works, I don't know that any rebuttal is necessary. <<
Then why did we move from IDMS, et al to relational DBMS systems? Maybe because they did not work?
>> I said it once, I will say it again. The ONLY purpose of a PK is to provide a uniqueness to the record [sic]. <<
And Dr. Codd, me, and everyone else in the Relational Database arena will still say "Wrong!"; and we will also remind you that a row is not a record.
--CELKO-- =========================== Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are.
|
| Bob Boran | 24 Feb 2004 19:16 |
> I posted a horror story > a few weeks back about a motor pool system. In bulk moving the tables > from one platform to another, the IDENTITY column got renumbered. > Unverifiable keys are not my idea of safe and maintainable. Careful scripting and testing before doing any data moves would have solved this problem. I can't see how it applies to our discussion.
> The insanely expensive manual reconstruction was a good thing in one > way; we found a lot of hanky-panky had been going out. VIN numbers, > commercial auto tags and the generated ids did not always match to the > same vehicle. Proper db maintenance and decent coding are your solutions to this problem. There is nothing here that is intrinsic to Artificial Keys being a problem.
> >> .. A nice clean number that requires very little storage space, << > > Who cares about storage space in the 21-st Century? Now who is being lazy?
> If you will go back to the prior posting, you will see a simple > demonstration that IDENTITY has no integrity as a Foreign Key. I got > that one from the Motor Pool job I mentioned. Identity is not a datatype, it is just a way of generating a value. If an artificially generated value wasn't going to work for you, then any natural value would run into the same problem. If you can't keep your Foreign Key entegrity, then you need to re-examine the design of the database, not the choice of using an Identity column.
> >> .. something that I know is unique without having to use 2-10 columns > looking for a natural key, ..<< > a) It does not serious affect speed or space. Have you never created a multi join query, using composite keys as the foriegn key? Goodness man, it can take up to 10X as long to execute! Not to mention the readability of the code.
How about just doing a join on a textual field? That can take up to 3X as long.
> b) If you want your data model to be correct and consistent, and your > reality has a (n) column natural key, then you damn well have to enforce > them. Unique constraint provides that for me.
> The only consistent is that ALL the tables are linked together with > "homemade pointer chains" that use links of one "magic, all-purpose, > universal" data type, thus returning us to the days of IDMS databases. Since this obviously works, I don't know that any rebuttal is necessary.
I said it once, I will say it again. The ONLY purpose of a PK is to provide a uniqueness to the record.
Finally, I can't see that you have made any points that will convice anyone that natural keys have an advantage over artificial keys. Even MS teaches artificial PK. The industry standard has become artificial PK. Don't you think maybe you got left behind somewhere?
|
| Joe Celko | 24 Feb 2004 18:12 |
>> My experience has taught me that using an artificial PK in every table is the safest, easiest, most maintainable design technique for today's large databases. <<
You are right about one of the three things: auto numbering and mimicking a tape file is the Easiest way to "unique-ify" a row. It requires no thought, no research and no effort. You can skip that year or two of learning the Relatioanl model and SQL; you can mimic what you already know about file systems.
Of course, this is like the lumberjack who bought a chain saw, thought reading the manual was too hard and never put gasoline in it. He could still cut down trees with his new tool, but it took longer.
But it is not safe and it is not maintainable. I posted a horror story a few weeks back about a motor pool system. In bulk moving the tables from one platform to another, the IDENTITY column got renumbered. Unverifiable keys are not my idea of safe and maintainable.
The insanely expensive manual reconstruction was a good thing in one way; we found a lot of hanky-panky had been going out. VIN numbers, commercial auto tags and the generated ids did not always match to the same vehicle.
>> .. A nice clean number that requires very little storage space, << Who cares about storage space in the 21-st Century? You go to jail or kill people with bad data. You do not go to jail or kill people with a cheaper, higher density disk drive and extra RAM.
>> .. something that can be easily used as a FK without causing potential integrity issues when data needs to be update, .. <<
If you will go back to the prior posting, you will see a simple demonstration that IDENTITY has no integrity as a Foreign Key. I got that one from the Motor Pool job I mentioned.
>> .. something that I know is unique without having to use 2-10 columns looking for a natural key, ..<<
1) At the risk of a good straight line, size is not important.
a) It does not serious affect speed or space. This is 2004 and not 1954. Magnetic core has been replaced with silicon chips. Timing for various operations on a typical 1GHz PC in summer 2001:
execute single instruction: 1 nsec fetch word from L1 cache memory: 2 nsec fetch word from main memory: 10 nsec fetch word from consecutive disk location: 200 nsec fetch word from new disk location (seek): 8,000,000 nsec
Anyone got the current numbers and the price of a byte of storage today?
b) If you want your data model to be correct and consistent, and your reality has a (n) column natural key, then you damn well have to enforce them.
c) If you have a natural key, then an exposed physical locator is redundant at best and demonstatably dangerous. Uniqueness is necessary but it is not sufficient for a key.
>> ... and most of all consistency in my design model. << The only consistent is that ALL the tables are linked together with "homemade pointer chains" that use links of one "magic, all-purpose, universal" data type, thus returning us to the days of IDMS databases.
>> Now, does that mean we don't also look for a natural key? << For most of the "ID-iots", yes, it does! Read the DDL that gets posted here.
>> No, a natural key is still identified (if it is available) and a Unique constraint is applied to it to enforce that rule. <<
LOL! If it is not available, then call it's press agent and see when it IS available. How do you tell the entities apart in the reality of the model now?
And thus once more the sole original purpose of all databases, relational or otherwise, is defeated by lazy programmers! We've got redundancy again! But this time, the logical/physical tiers are crossed and all the advantages of Software Engineering since 1970 are lost, too.
From an operational viewpoint, life really stinks. When we had IDMS, IMS, TOTAL and all the other network databases, they came with utility programs to fix broken pointer chains. SQL doesn't.
--CELKO-- =========================== Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are.
|
| Bob Boran | 24 Feb 2004 17:14 |
You obviously feel very strong about this issue. Please lean back on this couch for a minute and lets talk about it. Tell me about your days as a young DBA. Any early tramas that may have left you feeling that artificial primary keys are inadequite? Let me turn on this soft soothing music to help calm your troubled spirit.
There, now that your breathing has returned to normal..
I am not going to pretend that I have spent years getting a masters or doctorate in Database Design. I will even admit up-front that I don't even have a CS degree. But what I do have is 10 years of experience designing, implementing and maintaining enterprise databases for my clients. (yes, go ahead and make your negative remark here, I will even leave space) ------------ This space reserved for Celco
----------- Ok, now that you are done.
My experience has taught me that using an artificial PK in every table is the safest, easiest, most maintainable design technique for today's large databases. A nice clean number that requires very little storage space, something that can be easily used as a FK without causing potential integrity issues when data needs to be update, something that I know is unique without having to use 2-10 columns looking for a natural key, and most of all consistency in my design model. Now, does that mean we don't also look for a natural key? No, a natural key is still identified (if it is available) and a Unique constraint is applied to it to enforce that rule.
I will stop there in order to keep my post long enough to say what needed to be said, but short enough to be readable.
|
| Joe Celko | 24 Feb 2004 16:55 |
>> I am thinking of creating an identity column to use it as primary key in my table. <<
**by definition** a key is a subset of the attributes of an entity. It is a LOGICAL construct. The proprietary, non-relational IDENTITY property is a PHYSICAL state of a particular release of a particular product on of a particular machine at of a particular time. It cannot have anything whatsoever to with your data model (you did build a data model, build a data dictionary and research your industry weeks before you started coding, didn't you?)
It is a handle to the row. This means that you are faking a sequential file's positional record number, so you can reference the physical storage location? Sure, I'd do this if I want to lose all the advantages of an abstract data model, SQL set oriented programming, carry extra data and destroy the portability of code!
More and more programmers who have absolutely no database training are being told to design a database. They are using GUIDs, IDENTITY, ROWID and other proprietary auto-numbering "features" in SQL products to imitate either a record number (sequential file system mindset) or OID (OO mindset) since they don't know anything else.
Experienced database designers tend toward intelligent keys they find in industry standard codes, such as UPC, VIN, GTIN, ISBN, etc. They know that they need to verify the data against the reality they are modeling. A trusted external source is a good thing to have.
The IDENTITY column is a holdover from the early programming languages which were very close to the hardware. For example, the fields (not columns; big difference) in a COBOL or FORTRAN program were assumed to be physically located in main storage in the order they were declared in the program. The languages have constructs using that model -- logical and physical implementations are practically one! The data has meaning BECAUSE of the program reading it (i.e. the same bits could be a character in one program and be an integer in another)
The early SQLs were based on existing file systems. The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns. In short, just like a deck of punch cards or a magnetic tape. Most programmer still carry that mental model, which is why I keep doing that rant about file vs. table, row vs. record and column vs. field.
But physically contiguous storage is only one way of building a relational database and it is not the best one. The basic idea of a relational database is that user is not supposed to know *how* or *where* things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product on particular hardware at a particular time.
One of the biggest errors is the IDENTITY column (actually property, not a column at all) in the Sybase/SQL Server family. People actually program with this "feature" and even use it as the primary key for the table! Now, let's go into painful details as to why this thing is bad.
The first practical consideration is that IDENTITY is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or port your system to other products. Newbies actually think they will never port code! Perhaps they only work for companies that are failing and will be gone. Perhaps their code is such crap nobody else want their application.
But let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY. If you cannot declare more than one column to be of a certain data type, then that thing is not a datatype at all, by definition. It is a property which belongs to the PHYSICAL table, not the LOGICAL data in the table.
Next, create a table with one column and make it an IDENTITY. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition.
Finally create a simple table with one IDENTITY and a few other columns. Use a few statements like
INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1'); INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2'); INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');
To put a few rows into the table and notice that the IDENTITY sequentially numbered them in the order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table. This is how we did record numbers in pre-allocated sequential files in the 1950's, by the way. A utility program would then "pack" or "compress" the records that were flagged as deleted or unused to move the empty space to the physical end of the physical file.
But now use a statement with a query expression in it, like this:
INSERT INTO Foobar (a, b, c) SELECT x, y, z FROM Floob;
Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one do you pick? The answer has been to use whatever the *physical* order of the result set happened to be. That non-relational phrase "physical order" again!
But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order.
Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical.
Using IDENTITY as a primary key is a sign that there is no data model, only an imitation of a sequential file system. Since this "magic, all-purpose, one-size-fits-all" pseudo-identifier exists only as a result of the physical state of a particular piece of hardware at a particular time as read by the current release of a particular database product, how do you verify that an entity has such a number in the reality you are modeling?
You will see newbies who design tables like this:
CREATE Drivers (driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY, ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn), vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));
Now input data and submit the same row a thousand times, a million times. Your data integrity is trashed. The natural key was this:
CREATE Drivers (ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn), vin CHAR(17) NOT NULL REFERENCES Motorpool(vin), PRIMARY KEY (ssn, vin));
To demonstrate, here is a typical idiot newbie schema -- you will them all over the news groups. I call them "idiots" because they always name the IDENTITY property column "id" in EVERY table. They don't understand basic data modeling -- one and only name for an attribute. About half the time they don't use any DRI, but let's show it.
CREATE TABLE MotorPool (id IDENTITY (1,1) NOT NULL PRIMARY KEY, ssn CHAR(9) NOT NULL REFERENCES Personnel(id), vin CHAR(17) NOT NULL REFERENCES Vehicle(id));
CREATE TABLE Personnel (id IDENTITY (1,1) NOT NULL PRIMARY KEY, ssn CHAR(9) NOT NULL UNIQUE, ..);
CREATE TABLE Vehicles (id IDENTITY (1,1) NOT NULL PRIMARY KEY, vin CHAR(17) NOT NULL UNIQUE, .);
Now change the natural key in Personnel:
UPDATE Personnel SET ssn = '666666666' WHERE ssn = '000000000';
Nothing happened in Motorpool, did it? You can do the same thing with a VIN.
Now you are REALLY thinking about relations and keys instead of 1950's sequential record numbering. Adding an IDENTITY column to either of these tables as a candidate key would be dangerously redundant; one query uses the IDENTITY and another uses the real key, and like a man with two watches, you are never sure what time it is.
Finally, an appeal to authority, with a quote from Dr. Codd: "..Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them .."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434.
This means that a surrogate ought to act like an index; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.
Codd also wrote the following:
"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.
(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).
(2) Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.
(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree).
These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution - proposed in part [4] and more fully in [14] - is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them....." (Codd in ACM TODS, pp 409-410).
References
Codd, E. (1979), Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434
>> But the problem is that once I delete some records [sic], the SQL Server would not reuse the IDs of the deleted records [sic]. So there will be holes in the sequence.<<
You need to read a book on SQL and RDBMS. Rows are not records; fields are not columns; tables are not files; there is no sequential access or ordering in an RDBMS, so "first", "next" and "last" are totally meaningless.
You are confusing the parking spaces with the automobiles. Automobiles have VIN numbers, not parking space numbers!!
>> I know many people use Identity columns to generate unique id. << And that is a major reason so many database project produce bad data, fail out right and why I could easily bill between $1000-2000 per day when I did clean up work or SQL training. Better than half of my clean up jobs used IDENTITY, GUIDs or other autonumbering schemes that could not be verified.
>> How do they deal with the holes in the identity columns? << Your mental model is data in a sequential file. You have no concept whasoever of what an RDBMS is. This question is what we used to do with magnetic tape files in the early 1960's.
You marked a record (not a row) as deleted on the tape as you processed it; this was usually a bit or byte flag at the start of each record. We also used record numbers, where the zero or negative values meant that the record was deleted.
Record number were handy because you did relative reads from the current position. For example, an order might consists of an order header record (the term "Header record" is still in use today! Look at some of the postings here) followed by detail records. The header would have a count of the number of details. So I could program my application to jump to the next order header with a little math. I would finally run the tape thru a program that closed up the gaps and re-numbered the records as it sorted them.
Look at the CURSOR reads in Standard SQL today; the whole cursor model is based on mag tape file systems. The reason for this is that all of the X3J Programming Language Standards have a sequential file model in them.
Every DB programmer ought to have to take a course in file systems and have to write a polyphase merge sort once in their education. That teach them why we stopped using file systems and maybe, just maybe, they will quit tryng to imitate 1950's technology in SQL.
--CELKO-- =========================== Please post DDL, so that people do not have to guess what the keys, constraints, Declarative Referential Integrity, datatypes, etc. in your schema are.
|
| Nikhil Patel | 24 Feb 2004 15:55 |
Hi, I am thinking of creating an identity column to use it as primary key in my table. But the problem is that once I delete some records, the SQL Server would not reuse the IDs of the deleted records. So there will be holes in the sequence. I know many people use Identity columns to generate unique id. How do they deal with the holes in the identity columns? Do I need to worry about the size of the identity column or whether or not the column would run out of IDs or any other problem?
thanks... -Nikhil
|
Quick links:
|
|
|