Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
DB Engine
SQL ServerMSDESQL Server CE
Services
Analysis (Data Mining)Analysis (OLAP)DTSIntegration ServicesNotification ServicesReporting Services
Programming
CLRConnectivitySQLXML
Other Technologies
ClusteringEnglish QueryFull-Text SearchReplicationService Broker
General
Data WarehousingPerformanceSecuritySetupSQL Server ToolsOther SQL Server Topics
DirectoryUser Groups
Related Topics
MS AccessOther DB ProductsMS Server Products.NET DevelopmentVB DevelopmentJava DevelopmentMore Topics ...

SQL Server Forum / Programming / SQL / February 2004

Tip: Looking for answers? Try searching our database.

problem using identity column as primary key

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Nikhil Patel - 24 Feb 2004 15:55 GMT
Hi,
   I am thinking of creating an identity column to use it as primary key in
my table. But the problem is that once I delete some records, the SQL Server
would not reuse the IDs of the deleted records. So there will be holes in
the sequence. I know many people use Identity columns to generate unique id.
How do they deal with the holes in the identity columns? Do I need to worry
about the size of the identity column or whether or not the column would run
out of IDs or any other problem?

thanks...
-Nikhil
Jacco Schalkwijk - 24 Feb 2004 15:53 GMT
Hi Nikhil,

If you want something that doesn't leave holes in the sequence, don't use
IDENTITY. To put it more strongly, you shouldn't attach any meaning to the
values in an identity column, it just is an automatically generated value.
It isn't even guaranteed to be unique on it's own (because you can reset it
with a different start value).

You don't have to worry about the number of values in an identity column.
Using an int will give you more than 2 billion rows, and that will last
quite a while.

Signature

Jacco Schalkwijk
SQL Server MVP

> Hi,
>     I am thinking of creating an identity column to use it as primary key in
[quoted text clipped - 7 lines]
> thanks...
> -Nikhil
Bob Boran - 24 Feb 2004 15:53 GMT
Identity is a great way of generating a primary key.  Yes, there is a high
potential of having 'holes' in the sequence (though if that is a serious
problem it can be overcome).  For the most part, having all you primary key
values in a clean sequence without and holes is not an issue, as the only
purpose of the PK is to have a unique value.

I for one have never had an issue of running out of IDs.  considering that
you can use a bigint ( -9223372036854775808 to +9223372036854775807), there
is not much chance of running out.

> Hi,
>     I am thinking of creating an identity column to use it as primary key in
[quoted text clipped - 7 lines]
> thanks...
> -Nikhil
Joe Celko - 24 Feb 2004 16:55 GMT
>> Identity is a great way of generating a primary key ... as the only
purpose of the PK is to have a unique value. <<

You are dangerous.  

1) It cannot generate a key by definition.  It is the PHYSICAL state of
the hardware that holds your database at a particular point in time.  

2)  The purpose of any key is to uniquely identify the entities in the
data model in a way that can be verified in the reality the model
represents.  Uniqueness is necessary but not sufficient.

3) It is not even self-validating since there is no check digit or
syntax.  That makes it lousy as just a unique number, since it has a
high error rate.  

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.
Louis Davidson - 24 Feb 2004 16:00 GMT
If you care about what the values will be in your primary key, then you
shouldn't use identity values. When you use an identity value for a key, you
are basically trading extreme ease of programming an artificial key for the
inflexibility of the method.  You will get gaps not only on a delete, but if
there is an error inserting a new row (or rows) into the table.

What makes them (in my opinion :) a good key is simplicity.  I let the
system generate a value, which I do not care what it is (could be
sequential, could be random, could be any unique value (note they are not
called sequences, rather they are called IDENTITY)) but once it has been
generated, it is like the DNA for the row that has been created.  I use it
anytime I am doing a programatic access of a row, like a join, or an update/
delete/ UI access, etc.

On the other hand, the identity value is generally a terrible key for people
to use to access a row.  This is why all rows must have at least one natural
key identified and protected via a key constraint.  If you use identities as
the primary key, then a unique constraint on these values.  The unique
constraints that you create will be your key that users will use.  Clearly
this means a bit more work for you than simply using the natural key as the
primary key, so why do it?

I personally like to have every updateable value have one and only one home.
If the natural key is the primary key, any related tables will contain the
value.  Now I have to cascade operations all around, and this could mean
10 - 20 places or even more for a large model.  With identity keys, I simply
migrate a single part key.

Second, single part keys (which I always have) give me a very standard
looking table that gives me the power to generate code that uses every table
basically the same from a key standpoint (which is a large part of the
work.)  So the more work I mentioned a bit earlier dissapates because I no
longer hand code as much, rather I build code once to do it.

Signature

----------------------------------------------------------------------------
-----------
Louis Davidson (drsql@hotmail.com)
Compass Technology Management

Pro SQL Server 2000 Database Design
http://www.apress.com/book/bookDisplay.html?bID=266

Note: Please reply to the newsgroups only unless you are
interested in consulting services.  All other replies will be ignored :)

> Hi,
>     I am thinking of creating an identity column to use it as primary key in
[quoted text clipped - 7 lines]
> thanks...
> -Nikhil
Joe Celko - 24 Feb 2004 16:55 GMT
>> I am thinking of creating an identity column to use it as primary key
in my table. <<

**by definition** a key is a subset of the attributes of an entity.  It
is a LOGICAL construct.  The proprietary, non-relational IDENTITY
property is a PHYSICAL state of a particular release of a particular
product on of a particular machine at of a particular time.  It cannot
have anything whatsoever to with your data model (you did build a data
model, build a data dictionary and research your industry weeks before
you started coding, didn't you?)

It is a handle to the row.  This means that you are faking a sequential
file's positional record number, so you can reference the physical
storage location?  Sure, I'd do this if I want to lose all the
advantages of an abstract data model, SQL set oriented programming,
carry extra data and destroy the portability of code!

More and more programmers who have absolutely no database training are
being told to design a database.  They are using GUIDs, IDENTITY, ROWID
and other proprietary auto-numbering "features" in SQL products to
imitate either a record number (sequential file system mindset) or OID
(OO mindset) since they don't know anything else.  

Experienced database designers tend toward intelligent keys they find in
industry standard codes, such as UPC, VIN, GTIN, ISBN, etc.  They know
that they need to verify the data against the reality they are modeling.
A trusted external source is a good thing to have.

The IDENTITY column is a holdover from the early programming languages
which were very close to the hardware.  For example, the fields (not
columns; big difference) in a COBOL or FORTRAN program were assumed to
be physically located in main storage in the order they were declared in
the program.  The languages have constructs using that model -- logical
and physical implementations are practically one!  The data has meaning
BECAUSE of the program reading it (i.e. the same bits could be a
character in one program and be an integer in another)

The early SQLs were based on existing file systems.  The data was kept
in physically contiguous disk pages, in physically contiguous rows, made
up of physically contiguous columns.  In short, just like a deck of
punch cards or a magnetic tape.  Most programmer still carry that mental
model, which is why I keep doing that rant about file vs. table, row vs.
record and column vs. field.  

But physically contiguous storage is only one way of building a
relational database and it is not the best one. The basic idea of a
relational database is that user is not supposed to know *how* or
*where* things are stored at all, much less write code that depends on
the particular physical representation in a particular release of a
particular product on particular hardware at a particular time.  

One of the biggest errors is the IDENTITY column (actually property, not
a column at all) in the Sybase/SQL Server family.  People actually
program with this "feature" and even use it as the primary key for the
table!  Now, let's go into painful details as to why this thing is bad.

The first practical consideration is that IDENTITY is proprietary and
non-portable, so you know that you will have maintenance problems when
you change releases or port your system to other products.  Newbies
actually think they will never port code!  Perhaps they only work for
companies that are failing and will be gone.  Perhaps their code is such
crap nobody else want their application.

But let's look at the logical problems.  First try to create a table
with two columns and try to make them both IDENTITY.  If you cannot
declare more than one column to be of a certain data type, then that
thing is not a datatype at all, by definition.  It is a property which
belongs to the PHYSICAL table, not the LOGICAL data in the table.  

Next, create a table with one column and make it an IDENTITY.  Now try
to insert, update and delete different numbers from it.  If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.  

Finally create a simple table with one IDENTITY and a few other columns.
Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented.  If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that column
in that particular table.  This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged as
deleted or unused to move the empty space to the physical end of the
physical file.  

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
 FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be?  The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick?  The
answer has been to use whatever the *physical* order of the result set
happened to be.  That non-relational phrase "physical order" again!

But it is actually worse than that.  If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order.

Can you explain from a logical model why the same rows in the second
query get different IDENTITY numbers?  In the relational model, they
should be treated the same if all the values of all the attributes are
identical.  

Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system.  Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the current release of a particular database
product, how do you verify that an entity has such a number in the
reality you are modeling?  

You will see newbies who design tables like this:

CREATE Drivers
(driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,  
ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times.  Your data integrity is trashed.  The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
PRIMARY KEY (ssn, vin));

To demonstrate, here is a typical idiot newbie schema -- you will them
all over the news groups.  I call them "idiots" because they always name
the IDENTITY property column "id" in EVERY table.  They don't understand
basic data modeling -- one and only name for an attribute.  About half
the time they don't use any DRI, but let's show it.

CREATE TABLE MotorPool
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL REFERENCES Personnel(id),
vin CHAR(17) NOT NULL REFERENCES Vehicle(id));

CREATE TABLE Personnel
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
ssn CHAR(9) NOT NULL UNIQUE,
..);

CREATE TABLE Vehicles
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
vin CHAR(17) NOT NULL UNIQUE,
.);

Now change the natural key in Personnel:

UPDATE Personnel
  SET ssn = '666666666'
WHERE ssn = '000000000';

Nothing happened in Motorpool, did it?  You can do the same thing with a
VIN.  

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering.  Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
users may cause the system to generate or delete a surrogate, but they
have no control over its value, nor is its value ever displayed to them
.."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning.  ACM Transactions
on Database Systems, 4(4).  pp. 397-434.  

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user.  That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1)  The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result that
some or all of the serial numbers might be changed.).

(2)  Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3)  It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities.  A solution - proposed in part [4] and more fully in [14] - is
to introduce entity domains which contain system-assigned surrogates.
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed
to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more
meaning.  ACM Transactions on Database Systems, 4(4).  pp. 397-434

>> But the problem is that once I delete some records [sic], the SQL
Server would not reuse the IDs of the deleted records [sic]. So there
will be holes in the sequence.<<

You need to read a book on SQL and RDBMS. Rows are not records; fields
are not columns; tables are not files; there is no sequential access or
ordering in an RDBMS, so "first", "next" and "last" are totally
meaningless.

You are confusing the parking spaces with the automobiles.  Automobiles
have VIN numbers, not parking space numbers!!

>> I know many people use Identity columns to generate unique id. <<

And that is a major reason so many database project produce bad data,
fail out right and why I could easily bill between $1000-2000 per day
when I did clean up work or SQL training.  Better than half of my clean
up jobs used IDENTITY, GUIDs or other autonumbering schemes that could
not be verified.  

>> How do they deal with the holes in the identity columns? <<

Your mental model is data in a sequential file.  You have no concept
whasoever of what an RDBMS is.  This question is what we used to do with
magnetic tape files in the early 1960's.  

You marked a record (not a row) as deleted on the tape as you processed
it; this was usually a bit or byte flag at the start of each record.  We
also used record numbers, where the zero or negative values meant that
the record was deleted.  

Record number were handy because you did relative reads from the current
position.  For example, an order might consists of an order header
record (the term "Header record" is still in use today! Look at some of
the postings here) followed by detail records.  The header would have a
count of the number of details.  So I could program my application to
jump to the next order header with a little math.  I would finally run
the tape thru a program that closed up the gaps and re-numbered the
records as it sorted them.  

Look at the CURSOR reads in Standard SQL today; the whole cursor model
is based on mag tape file systems.  The reason for this is that all of
the X3J Programming Language Standards have a sequential file model in
them.  

Every DB programmer ought to have to take a course in file systems and
have to write a polyphase merge sort once in their education.  That
teach them why we stopped using file systems and maybe, just maybe, they
will quit tryng to imitate 1950's technology in SQL.  

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.
Bob Boran - 24 Feb 2004 17:14 GMT
You obviously feel very strong about this issue.
Please lean back on this couch for a minute and lets talk about it.
Tell me about your days as a young DBA.  Any early tramas that may have left
you feeling that artificial primary keys are inadequite?  Let me turn on
this soft soothing music to help calm your troubled spirit.

There, now that your breathing has returned to normal..

I am not going to pretend that I have spent years getting a masters or
doctorate in Database Design.  I will even admit up-front that I don't even
have a CS degree.   But what I do have is 10 years of experience designing,
implementing and maintaining enterprise databases for my clients. (yes, go
ahead and make your negative remark here, I will even leave space)
------------
This space reserved for Celco

-----------
Ok, now that you are done.

My experience has taught me that using an artificial PK in every table is
the safest, easiest, most maintainable design technique for today's large
databases.  A nice clean number that requires very little storage space,
something that can be easily used as a FK without causing potential
integrity issues when data needs to be update, something that I know is
unique without having to use 2-10 columns looking for a natural key, and
most of all consistency in my design model.
Now, does that mean we don't also look for a natural key?  No, a natural key
is still identified (if it is available) and a Unique constraint is applied
to it to enforce that rule.

I will stop there in order to keep my post long enough to say what needed to
be said, but short enough to be readable.
Anith Sen - 24 Feb 2004 17:38 GMT
I can't help smiling, Bob :-)

Signature

Anith

Joe Celko - 24 Feb 2004 18:12 GMT
>> My experience has taught me that using an artificial PK in every
table is the safest, easiest, most maintainable design technique for
today's large databases. <<

You are right about one of the three things: auto numbering and
mimicking a tape file is the Easiest way to "unique-ify" a row.  It
requires no thought, no research and no effort.  You can skip that year
or two of learning the Relatioanl model and SQL; you can mimic what you
already know about file systems.  

Of course, this is like the lumberjack who bought a chain saw, thought
reading the manual was too hard and never put gasoline in it.  He could
still cut down trees with his new tool, but it took longer.  

But it is not safe and it is not maintainable.  I posted a horror story
a few weeks back about a motor pool system.  In bulk moving the tables
from one platform to another, the IDENTITY column got renumbered.
Unverifiable keys are not my idea of safe and maintainable.  

The insanely expensive manual reconstruction was a good thing in one
way; we found a lot of hanky-panky had been going out.  VIN numbers,
commercial auto tags and the generated ids did not always match to the
same vehicle.  

>> .. A nice clean number that requires very little storage space, <<

Who cares about storage space in the 21-st Century?  You go to jail or
kill people with bad data.  You do not go to jail or kill people with a
cheaper, higher density disk drive and extra RAM.  

>> .. something that can be easily used as a FK without causing
potential integrity issues when data needs to be update, .. <<

If you will go back to the prior posting, you will see a simple
demonstration that IDENTITY has no integrity as a Foreign Key.  I got
that one from the Motor Pool job I mentioned.  

>> .. something that I know is unique without having to use 2-10 columns
looking for a natural key, ..<<

1) At the risk of a good straight line, size is not important.  

a) It does not serious affect speed or space.  This is 2004 and not
1954.  Magnetic core has been replaced with silicon chips.  Timing for
various operations on a typical 1GHz PC in summer 2001:

execute single instruction: 1 nsec
fetch word from L1 cache memory:  2 nsec  
fetch word from main memory:  10 nsec  
fetch word from consecutive disk location:  200 nsec  
fetch word from new disk location (seek):  8,000,000 nsec

Anyone got the current numbers and the price of a byte of storage today?

b) If you want your data model to be correct and consistent, and your
reality has a (n) column natural key, then you damn well have to enforce
them.  

c) If you have a natural key, then an exposed physical locator is
redundant at best and demonstatably dangerous. Uniqueness is necessary
but it is not sufficient for a key.

>> ... and most of all consistency in my design model. <<

The only consistent is that ALL the tables are linked together with
"homemade pointer chains" that use links of one "magic, all-purpose,
universal" data type, thus returning us to the days of IDMS databases.  

>> Now, does that mean we don't also look for a natural key? <<

For most of the "ID-iots", yes, it does!  Read the DDL that gets posted
here.    

>> No, a natural key is still identified (if it is available) and a
Unique constraint is applied to it to enforce that rule. <<

LOL! If it is not available, then call it's press agent and see when it
IS available.  How do you tell the entities apart in the reality of the
model now?  

And thus once more the sole original purpose of all databases,
relational or otherwise, is defeated by lazy programmers!  We've got
redundancy again! But this time, the logical/physical tiers are crossed
and all the advantages of Software Engineering since 1970 are lost, too.

From an operational viewpoint, life really stinks.  When we had IDMS,
IMS, TOTAL and all the other network databases, they came with utility
programs to fix broken pointer chains.  SQL doesn't.  

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.
Bob Boran - 24 Feb 2004 19:16 GMT
> I posted a horror story
> a few weeks back about a motor pool system.  In bulk moving the tables
> from one platform to another, the IDENTITY column got renumbered.
> Unverifiable keys are not my idea of safe and maintainable.

Careful scripting and testing before doing any data moves would have solved
this problem.  I can't see how it applies to our discussion.

> The insanely expensive manual reconstruction was a good thing in one
> way; we found a lot of hanky-panky had been going out.  VIN numbers,
> commercial auto tags and the generated ids did not always match to the
> same vehicle.

Proper db maintenance and decent coding are your solutions to this problem.
There is nothing here that is intrinsic to Artificial Keys being a problem.

> >> .. A nice clean number that requires very little storage space, <<
>
> Who cares about storage space in the 21-st Century?

Now who is being lazy?

> If you will go back to the prior posting, you will see a simple
> demonstration that IDENTITY has no integrity as a Foreign Key.  I got
> that one from the Motor Pool job I mentioned.

Identity is not a datatype, it is just a way of generating a value.  If an
artificially generated value wasn't going to work for you, then any natural
value would run into the same problem.  If you can't keep your Foreign Key
entegrity, then you need to re-examine the design of the database, not the
choice of using an Identity column.

> >> .. something that I know is unique without having to use 2-10 columns
> looking for a natural key, ..<<
> a) It does not serious affect speed or space.

Have you never created a multi join query, using composite keys as the
foriegn key?  Goodness man, it can take up to 10X as long to execute!  Not
to mention the readability of the code.

How about just doing a join on a textual field?  That can take up to 3X as
long.

> b) If you want your data model to be correct and consistent, and your
> reality has a (n) column natural key, then you damn well have to enforce
> them.

Unique constraint provides that for me.

> The only consistent is that ALL the tables are linked together with
> "homemade pointer chains" that use links of one "magic, all-purpose,
> universal" data type, thus returning us to the days of IDMS databases.

Since this obviously works, I don't know that any rebuttal is necessary.

I said it once, I will say it again.  The ONLY purpose of a PK is to provide
a uniqueness to the record.

Finally, I can't see that you have made any points that will convice anyone
that natural keys have an advantage over artificial keys.  Even MS teaches
artificial PK.  The industry standard has become artificial PK.  Don't you
think maybe you got left behind somewhere?
Adam Machanic - 24 Feb 2004 19:43 GMT
I think this sums it up:

http://www.dbdebunk.com/page/page/857309.htm

> > I posted a horror story
> > a few weeks back about a motor pool system.  In bulk moving the tables
[quoted text clipped - 58 lines]
> artificial PK.  The industry standard has become artificial PK.  Don't you
> think maybe you got left behind somewhere?
--CELKO-- - 26 Feb 2004 02:47 GMT
>> http://www.dbdebunk.com/page/page/857309.htm <<

Good old Fabian!  As far as I know he still holds second place for
being thrown off the most CompuServe Forums; first place belongs to a
woman with logorhea who would get off her meds and spend days
complusively insulting people in random forums.
Joe Celko - 24 Feb 2004 21:20 GMT
>> Careful scripting and testing before doing any data moves would have
solved this problem. I can't see how it applies to our discussion. <<

If I had only perfect users who never re-entered the same data twice
(and got two unique autonumbers on the dups), who always typed in key
values without typos, etc. then I might feel the way you do and believe
that IDENTITY is safe.  

Any of the more experienced programmer here are probably having
flashbacks to the first time they said "But users won't do THAT!"; but
the users did.  

But here in the **real** world, we need check digits, syntax edits and
tools that can recover when **real** people screw up.  This is funny
because I am usually the guy who gets an irate reply to the effect,
"Shut up with all that theory and math crap, just give me the fastest,
easier kludge that will make this thing run NOW! NOW!" from the newbies.

While nothing might be idiot-proof, at least make it idiot-retardant.
"Against stupidity the gods themselves struggle in vain." - Die Jungfrau
von Orleans; Friedrich von Schiller (1759-1805).

>> Identity is not a datatype, it is just a way of generating a value.
<<

Exactly!  And what is the definition of a columns?  It has a known
datatype!  

>> If an artificially generated value wasn't going to work for you, then
any natural value would run into the same problem. <<

Wrong.  Let's go back to the Motor Pool.  Let's assume that we had used
a VIN instead of an IDENTITY.  Let's further assume that a sabateur (not
just a klutz) went into the tables and tried to trash all the VINs.  The
first problem is a CHECK() constraint with a regular expression to
validate the VIN.  Okay, he is a sabateur and wants to keep going.  He
writes a program to generate fake VINs or to scramble real ones so as to
pass the CHECK().  If started at a referencing table, the DRI would have
blocked him.  

Now he has to go to the referenced table and scrambles the real VINs
while adding fakes.  The ON UPDATE CASCADE actions push those scrambled
VINs down to the referencing tables; the DB is consistent but not
correct.  The fake VINs sit around the referenced table, unused.  

We catch the bad guy, or discover the problem in an ISO-9000 or
Sarbanes-Oxley audit.  Gasp!  What can we do?  Well, what we actually
did was go out to the Motor Pool and read the VINs off of the vehicles.

Try doing that with a sequential IDENTITY number.  

1) There is no CHECK() on IDENTITY, every number is as good as the next.
A klutz can do this, no sabateur needed.

2) Since I know that if (n) is in the set of PRIMARY KEYs , I have a
huge chance of also having (n-k) as a valid PRIMARY KEY, too.  That lets
me go to a refrencing table and decrement all the FOREIGN KEYs.  Notice
the use of uppercase here ("primary key" is not the same as "PRIMARY
KEY"; IDENTITY cannot be the first, but it can be the second).  

3) Ditto for trashing the referenced table. But there is a better way:
  INSERT INTO Foobar
  SELECT a,b,c FROM Foobar;

This is a relational version of a DoS attack.  IDENTITY will cheerfully
flood the table with redudant data. A klutz who is not sure if he sent
the data or not can do this, no sabateur needed.  

>> Since this [pointer chains] obviously works, I don't know that any
rebuttal is necessary. <<

Then why did we move from IDMS, et al to relational DBMS systems?  Maybe
because they did not work?

>> I said it once, I will say it again. The ONLY purpose of a PK is to
provide a uniqueness to the record [sic]. <<

And Dr. Codd, me, and everyone else in the Relational Database arena
will still say "Wrong!"; and we will also remind you that a row is not a
record.  

--CELKO--
===========================
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.
Adam Machanic - 24 Feb 2004 21:50 GMT
> 3) Ditto for trashing the referenced table. But there is a better way:
>    INSERT INTO Foobar
[quoted text clipped - 7 lines]
> will still say "Wrong!"; and we will also remind you that a row is not a
> record.

   Bob Boran very clearly indicated that his use of the IDENTITY column was
as a surrogate key, and that uniqueness is enforced via other (non-primary
key) constraints.  So this example and all of the rest of your examples fail
to win you this argument.  And your attacking Mr. Boran for his syntactical
error only further reduces your credibility.  Finally, grouping yourself
with Dr. Codd "and everyone else in the Relational Database arena" is
laughable; since when are you, a proponent of the SQL standard (which is
clearly not compliant with the relational model) a member of said
"Relational Database arena"?
Bob Boran - 24 Feb 2004 22:09 GMT
> If I had only perfect users who never re-entered the same data twice
> (and got two unique autonumbers on the dups),

Unique constraint would have caught this and not allowed the second entry.

> who always typed in key
> values without typos, etc.

Who lets the end user even know what the Identity value is?  Let alone let
them type it in somewhere.

> >> Identity is not a datatype, it is just a way of generating a value.
> <<
>
> Exactly!  And what is the definition of a columns?  It has a known
> datatype!

What?  *Bob looks perplexed trying to figure how Joe's reply applies.....*

> Wrong.  Let's go back to the Motor Pool.  Let's assume that we had used
> a VIN instead of an IDENTITY.  Let's further assume that a sabateur (not
[quoted text clipped - 4 lines]
> pass the CHECK().  If started at a referencing table, the DRI would have
> blocked him.

So, you would trade daily performance, scalability, ease of use,  for the
potential of a sabateur attacking your DB.  Can you say Daily offsite
backup?  How about competent DBA?  Security?  Even silly little "dangerous"
people like me realize that the real world has it's tradeoffs.  Perhaps you
have spent to much time in acedemia.

> And Dr. Codd, me, and everyone else in the Relational Database arena
> will still say "Wrong!"; and we will also remind you that a row is not a
> record.

The simple fact is that using Identity columns as PK DO work.  And it will
keep on working.  And all the theorizing in the world is not going to stop
it from working.

> --CELKO--
Petrik Salovaara - 25 Feb 2004 00:00 GMT
> The simple fact is that using Identity columns as PK DO work.  And it
> will keep on working.  And all the theorizing in the world is not
> going to stop it from working.

You are right. Not only it works, but it works much better than a
natural key. All the problems described by CELKO are easily solved
using decent design, integrity check tools and data transformation
utilities.

One problem with natural keys that hasn't come up in this discussion
is the fact that it is very difficult to find a natual key that is
quaranteed not to change. Hey, even person's SSN will change in some
situations!

Cheers,
Petrik
J. M. De Moor - 25 Feb 2004 01:10 GMT
Petrik

> One problem with natural keys that hasn't come up in this discussion
> is the fact that it is very difficult to find a natual key that is
> quaranteed not to change.

That's why some put ON UPDATE CASCADE in their DDL, right?

Joe De Moor
Louis Davidson - 25 Feb 2004 14:58 GMT
It is, but it makes for interesting situations with cascading that goes
large numbers of tables.  My rule is that every piece of data is modifiable
in only one place, and that change will be reflected in only one place in
the system.  I also like the fact that if I change a key structure the
change only effects my database code in one place (and the UI in only one
location if proper OO techniques are used).

I certainly do not think there is anything wrong with using natural keys for
primary keys, but I don't think there is anything wrong with using
artificial values as surrogate keys either, as long as they are backed up by
natural keys.

Signature

----------------------------------------------------------------------------
-----------
Louis Davidson (drsql@hotmail.com)
Compass Technology Management

Pro SQL Server 2000 Database Design
http://www.apress.com/book/bookDisplay.html?bID=266

Note: Please reply to the newsgroups only unless you are
interested in consulting services.  All other replies will be ignored :)

> Petrik
>
[quoted text clipped - 5 lines]
>
> Joe De Moor
--CELKO-- - 26 Feb 2004 02:43 GMT
{{ DevDex is once more not posting replies that I sent from the second
page of their display.  ARRRGH! On to Google!  Ugly, slow but
reliable. }}

>> ... makes for interesting situations with cascading that goes to
large numbers of tables.  <<

Nah!  It makes for interesting situations in a schema WITHOUT
cascading to all the tables that it should :) Hee, hee, hee!

I was at a DAMA show in Texas a few years ago and one of the attendees
had a good horror story.  His company had switched from an 8 digit
part number to a 9 digit one and had been using exposed locators as
the PRIMARY KEY in many of the tables.  The "logic of failure" was
that the natural key was (product_code, warehouse, expiration date),
so you could get some speed by adding an autonumbering column.

He was brought in to do a data warehouse and had to look at all the
tables for the first time.  Some of the tables had been storing part
numbers as CHAR(8) and some as INTEGER; some upgraded by altering the
column and some just added a CHAR(1) column to the side and put the
new digit in it (it was going to be constant for awhile, then would
vary).  Some of them bothered to do the check digit algorithm and some
did not.  Since the new code was a substring of the old code, some
joins were re-written to pull the substring out of one table.  The
bottom line was that he had almost a dozen different ways to represent
a part number in the databases and nobody knew it until they did the
data audit for the DW project.

>> My rule is that every piece of data is modifiable in only one
place, and that change will be reflected in only one place in the
system.  <<

Bingo!  I hope that more products add a CREATE DOMAIN and ALTER DOMAIN
statement in the near future.

>> I don't think there is anything wrong with using artificial values
as surrogate keys either, as long as they are backed up by natural
keys. <<

Unfortunately, a man with many watches is never in synch.
Petrik Salovaara - 26 Feb 2004 02:40 GMT
>> One problem with natural keys that hasn't come up in this discussion
>> is the fact that it is very difficult to find a natual key that is
>> quaranteed not to change.
>
> That's why some put ON UPDATE CASCADE in their DDL, right?

Yes. And by doing so they cause a loss of data that might be
very important in the future, and should be therefore stored
for auditing purposes.

My point being, if a natural key changes, it may be important
to know what it once upon a time was.
Louis Davidson - 25 Feb 2004 14:53 GMT
Amen, with only one caveat:

> Now, does that mean we don't also look for a natural key?  No, a natural key
> is still identified (if it is available) and a Unique constraint is applied
> to it to enforce that rule.

If you don't have a natural key, you have exactly the thing that Joe
suggests, a record number.

Louis

Signature

----------------------------------------------------------------------------
-----------
Louis Davidson (drsql@hotmail.com)
Compass Technology Management

Pro SQL Server 2000 Database Design
http://www.apress.com/book/bookDisplay.html?bID=266

Note: Please reply to the newsgroups only unless you are
interested in consulting services.  All other replies will be ignored :)

> You obviously feel very strong about this issue.
> Please lean back on this couch for a minute and lets talk about it.
[quoted text clipped - 28 lines]
> I will stop there in order to keep my post long enough to say what needed to
> be said, but short enough to be readable.
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.