SQL Server Forum / DB Engine / SQL Server / March 2008
Varchars/Chars vs Int/BigInt as keys
|
|
Thread rating:  |
ryan putman - 03 Mar 2008 20:21 GMT This conversation came up and I was wondering if anyone had some definitive proof one way or another....
If you have a key column for a table and that column will be used as a reference/foreign key to other tables often and frequently but also could reach values well into the millions at what point (if ever) do you opt for a char/varchar key as opposed to an int/bigint?
If the choice does become char/varchar at some point, are there particular advantages to using char vs. varchar? In the system under question, the id values are currently between 7 and 11 characters long......so no really small values or any too large neither.
Thanks in Advance Ryan Putman
Andrew J. Kelly - 04 Mar 2008 01:31 GMT The decision to use char vs. int should depend on what the type of data is. If the values are always going to be numbers then why make it a char or varchar when an INT is what it really is? INT's will always be more efficient than a 12 character datatype since the INT will only take up 4 bytes and the other 12.
 Signature Andrew J. Kelly SQL MVP Solid Quality Mentors
> This conversation came up and I was wondering if anyone had some > definitive proof one way or another.... [quoted text clipped - 11 lines] > Thanks in Advance > Ryan Putman ryan putman - 04 Mar 2008 13:26 GMT On Mar 3, 8:31 pm, "Andrew J. Kelly" <sqlmvpnooos...@shadhawk.com> wrote:
> The decision to use char vs. int should depend on what the type of data is. > If the values are always going to be numbers then why make it a char or [quoted text clipped - 4 lines] > -- > Andrew J. Kelly SQL MVP hmm...thanks for the reply but apparently I didn't state my question clearly enough.
The column in question is just some key column. The question is totally hypothetical, not analytical. If I have a table that could eventually reach well into the millions in rows, should I use a standard integer (or bigint) identity column as a primary key or should I use a generated var/char column.
The integer should provide faster linking and searching initially but is there a point where have a generated varchar key like A283CD is better than an integer ID of 2,828,828?
Dan Guzman - 04 Mar 2008 13:56 GMT > hmm...thanks for the reply but apparently I didn't state my question > clearly enough. > > The column in question is just some key column. I think Andrew understood your question but you may not have understood his answer. The key datatype is determined by the underlying data. For example, if you have a numeric CustomerID, then an integer type is appropriate. A char (or varchar) is appropriate for alphanumeric CustomerID.
> The integer should provide faster linking and searching initially but > is there a point where have a generated varchar key like A283CD is > better than an integer ID of 2,828,828? Since you mention "generated", I think you are specifically referring to a surrogate key. A surrogate key is an artificial value used as an alternative to the natural key. For a surrogate key data type, integers are most commonly used in SQL Server because values are easily generated using IDENTITY and provide good performance an low storage requirements. Uniqueidentifiers are used in cases where the value needs to be globally unique. Character values are used only when business requirements dictate.
When you use surrogate keys, it is important that you also have a constraint on the natural key to ensure uniqueness and prevent duplicate rows.
 Signature Hope this helps.
Dan Guzman SQL Server MVP http://weblogs.sqlteam.com/dang/
> On Mar 3, 8:31 pm, "Andrew J. Kelly" <sqlmvpnooos...@shadhawk.com> > wrote: [quoted text clipped - 20 lines] > is there a point where have a generated varchar key like A283CD is > better than an integer ID of 2,828,828? ryan putman - 04 Mar 2008 19:51 GMT On Mar 4, 8:56 am, "Dan Guzman" <guzma...@nospam-online.sbcglobal.net> wrote:
> I think Andrew understood your question but you may not have understood his > answer. The key datatype is determined by the underlying data. For > example, if you have a numeric CustomerID, then an integer type is > appropriate. A char (or varchar) is appropriate for alphanumeric > CustomerID. There is no underlying data. As I said, this is mainly a hypothetical question.
A colleague of mine made the point that if I was going to have a table with a very large number of rows (into the millions) that a char-based generated ID value would be more efficient than an Integer value......which I doubted because INTs have always been faster in my experience.
There is no data. Just the question.....is there ever a point where INTs become less efficient because of how large they get versus a char. I think they you guys have basically said that chars are never faster.....which is the answer I was looking for.....everyone just seemed to get hung up on the thought that there was a pre-existing data format already in place or something.
ryan putman - 04 Mar 2008 20:01 GMT To provide an example....
The identity seeded integer style ID would have values like this - 1,2,3,4,5,6,7,8,9,10,11,12,13,14....34,35,36,37......45,46,47 The generated char ID would have values like this - 1,2,3,4,5,6,7,8,9,A,B,C,D,E..........Y,Z,10,11.........19,1A,1B
Obviously, you would eventually be storing a much smaller looking char value compared to the INT value once the number of rows is into the millions but the question is still is that char more efficient or better in any way than the INT?
Andrew J. Kelly - 04 Mar 2008 21:36 GMT A char or varch takes up 1 byte for each character. An INT takes up 4 bytes regardless of the value. So as soon as the CHAR value get to 5 characters it becomes larger than the INT and that would happen well before you hit the millions let alone billions. And the INT is still more efficient than the same size CHAR. In SQL2008 the INT can even take up less space if the values are not near the max.
 Signature Andrew J. Kelly SQL MVP Solid Quality Mentors
> To provide an example.... > [quoted text clipped - 7 lines] > millions but the question is still is that char more efficient or > better in any way than the INT? Tom Cooper - 04 Mar 2008 21:56 GMT > Obviously, you would eventually be storing a much smaller looking char > value compared to the INT value once the number of rows is into the Smaller "looking" char value is right. It only looks smaller. Remember that a varchar takes 2 extra bytes in your table. So the characters for the first 36 rows 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E..........Y,Z each take 3 bytes to store which is slightly better than an int which takes 4 bytes to store. The next 1296 rows (00 thru ZZ) each take 4 bytes The next 46,656 rows 000 thru ZZZ each take 5 bytes The next 1,679,616 rows 0000 thru ZZZZ each take 6 bytes The next 60,466,176 rows 00000 thru ZZZZZ each take 7 bytes The next 2,176,782,336 rows 000000 thru ZZZZZZ each take 8 bytes But each one of those rows could be assigned an int that only took 4 bytes. So, as you can see, using varchar vs using an int will actually cost you storage space once you have more than about 1300 rows.
And searching/comparisions/joining on an int value is faster than a varchar value.
And using an identity attribute on the int to generate each value is much easier and straight forward that the code you would need to find the next varchar value when a row was inserted.
It's a slam dunk. Use an identity column and choose between smallint, int, and bigint depending on how many rows could be added to the table.
Tom
> To provide an example.... > [quoted text clipped - 7 lines] > millions but the question is still is that char more efficient or > better in any way than the INT? ryan putman - 05 Mar 2008 15:27 GMT Thanks, everyone, for your replies.
They were definitely helpful.
Not sure I can convince the people in charge but at least now I know I'm right :)
Tom Cooper - 05 Mar 2008 19:05 GMT If they won't listen to you, suggest they run a test. For example, a simple test would be create a table with an integer primary key with the identity clause and insert some number of rows into it and and see how long it takes and how much space the table takes up when you're done. Then do the same thing with a table using a varchar key with the scheme you outlined and see how long that takes and how much space that table takes up. For example, for the identity column and 100,000 rows, I ran
Create Table TestInt (PK int identity Primary Key) go Declare @Time1 datetime Declare @Time2 datetime Declare @InsertCounter int Set @InsertCounter = 1 Set @Time1 = Current_Timestamp While @InsertCounter <= 100000 Begin Insert TestInt Default Values Set @InsertCounter = @InsertCounter + 1 End Set @Time2 = Current_Timestamp Select DateDiff(ms, @Time1, @Time2) As ElaspedMS Exec sp_spaceused 'TestInt', 'True'
(I know that's not the fastest way to insert 100,000 rows, but since I presume your real system will insert the rows one at a time, I had this test do that also.) This test on my machine took about 50 seconds, and the sp_spaceused said reserved space was 1,352 KB, data 1,288 KB, index 16 KB and unused 48 KB.
With the varchar key, it's more complicated. You need a way to store the next key (I added a table with one row), a way to compute the next key (I have a stored proc, please note this should not be used in production, it's not bullet proof at all, it doesn't for example, handle concurrent requests for the next key). But this is only a test, so I was keeping it simple. So that code was
Create Table TestVC (PK varchar(6) Primary Key) Create Table NextVC (PK int Primary Key, NextVCPK varchar(6)) Insert NextVC (PK, NextVCPK) Values (1, '1') go Create Procedure GetNextKey @NextKey varchar(6) Output As Begin Declare @NewNextKey varchar(6) Declare @CharToProcess tinyint Declare @CharAsciiValue tinyint Declare @Done tinyint Select @NextKey = NextVCPK From NextVC Where PK = 1 Set @NewNextKey = @NextKey Set @CharToProcess = Len(@NewNextKey) /* if value is all Z's, Z or ZZ or ZZZ, etc then next value is a 1 followed by the appropriate number of zeros */ If @NewNextKey In ('Z', 'ZZ', 'ZZZ', 'ZZZZ', 'ZZZZZ') Begin Set @NewNextKey = '1' + Replicate('0', @CharToProcess) End Else Begin Set @Done = 0 While @Done = 0 Begin Set @CharAsciiValue = Ascii(Substring(@NewNextKey, @CharToProcess, 1)) /* Find the next value for this character */ Set @CharAsciiValue = Case /* change 0-8 to the next number, 9 to A, A-Y to Z and Z to 0 */ When @CharAsciiValue Between 48 And 56 Then @CharAsciiValue + 1 When @CharAsciiValue = 57 Then 65 When @CharAsciiValue Between 65 And 89 Then @CharAsciiValue + 1 Else 48 End Set @NewNextKey = Stuff(@NewNextKey, @CharToProcess, 1, Char(@CharAsciiValue)) If @CharAsciiValue <> 48 Set @Done = 1 /* we are done */ Else Set @CharToProcess = @CharToProcess - 1 End End Update NextVC Set NextVCPK = @NewNextKey Where PK = 1 End go Declare @Time1 datetime Declare @Time2 datetime Declare @InsertCounter int Declare @NextKey varchar(6) Set @InsertCounter = 1 Set @Time1 = Current_Timestamp While @InsertCounter <= 100000 Begin Exec GetNextKey @NextKey Output Insert TestVC (PK) Values (@NextKey) Set @InsertCounter = @InsertCounter + 1 End Set @Time2 = Current_Timestamp Select DateDiff(ms, @Time1, @Time2) As ElaspedMS Exec sp_spaceused 'TestVC', 'True'
This took about 120 seconds, and the sp_spaceused said reserved space was 2,568 KB, data 2,504 KB, index 16 KB and unused 48 KB.
So about twice as long to load the varchar keys, twice as much space, and much more complicated coding.
Tom
> Thanks, everyone, for your replies. > > They were definitely helpful. > > Not sure I can convince the people in charge but at least now I know > I'm right :) Andrew J. Kelly - 05 Mar 2008 19:14 GMT And don't forget that every index you add will be that much larger as well. That means more network and disk I/O and less that will fit into cache.
 Signature Andrew J. Kelly SQL MVP Solid Quality Mentors
> If they won't listen to you, suggest they run a test. For example, a > simple test would be create a table with an integer primary key with the [quoted text clipped - 105 lines] >> Not sure I can convince the people in charge but at least now I know >> I'm right :) TheSQLGuru - 04 Mar 2008 14:44 GMT Andrew hit the nail on the head. int WILL be more efficient for your application as stated.
Also, don't waste the extra 4 bytes for a bigint if millions are all you expect. int can have -2.1B to +2.1B entries. so unless you expect to hit 4BILLION + entries go with int identity seeded at the negative limit.
 Signature Kevin G. Boles Indicium Resources, Inc. SQL Server MVP kgboles a earthlink dt net
> On Mar 3, 8:31 pm, "Andrew J. Kelly" <sqlmvpnooos...@shadhawk.com> > wrote: [quoted text clipped - 20 lines] > is there a point where have a generated varchar key like A283CD is > better than an integer ID of 2,828,828? Andrew J. Kelly - 04 Mar 2008 14:49 GMT I think Dan answered the questions but I have one more comment. I still don't get why you feel the need to use a Varchar over an INT just because of the number of rows. An INT will hold plus and minus over 6 billion values with only taking up 4 bytes and will always be more efficient in storage and searching than the equivalent character datatype would be.
 Signature Andrew J. Kelly SQL MVP Solid Quality Mentors
> On Mar 3, 8:31 pm, "Andrew J. Kelly" <sqlmvpnooos...@shadhawk.com> > wrote: [quoted text clipped - 20 lines] > is there a point where have a generated varchar key like A283CD is > better than an integer ID of 2,828,828?
|
|
|