Primary Foreign Key

Correcting an incorrect assumption helped me learn about Query Optimizer shortcuts.

A couple of weeks ago I talked about Best Practices and how it was important to understand why that something was best practice.

Well another aspect to take from that post was the importance of knowing; if you do not know something, then it is important for you to learn it.

That being said something that I did not know, but recently learned, was that there is nothing stopping a Primary Key from also being a Foreign Key.

there is nothing stopping a Primary Key from also being a Foreign Key

Assumptions

When you think about it, this lack of knowledge came from incorrect assumptions. You read Primary KEY and Foreign KEY and you think to yourself, well they are both keys aren’t they? Same thing.

That is the trap that I fell down and the trap is not knowing and making invalid assumptions. So let’s hopefully help you with knowing what the differences between them are.

First let’s create our tables:


-- Create our Foo and Bar table.
IF OBJECT_ID(N'dbo.Bar', N'U') IS NOT NULL
DROP TABLE dbo.Bar;
GO
IF OBJECT_ID(N'dbo.Foo', N'U') IS NOT NULL
DROP TABLE dbo.Foo;
GO

CREATE TABLE dbo.Foo
(
 FooID int IDENTITY(1, 1) NOT NULL
 CONSTRAINT [PK_dbo_Foo] PRIMARY KEY CLUSTERED,
 FooValue char(8)
);
GO
CREATE TABLE dbo.Bar
(
 BarID int
 CONSTRAINT [PK_dbo_Bar] PRIMARY KEY CLUSTERED,
 BarValue char(8),
 CONSTRAINT [FK_dbo_Bar_dbo_Foo] FOREIGN KEY (BarID)
 REFERENCES dbo.Foo (FooID)
);
GO

-- Declare our holding table.
DECLARE @FooIDs TABLE (FooID int);
 -- Insert into Foo.
INSERT INTO dbo.Foo (FooValue)
 -- Gather the new ID's from foo.
OUTPUT inserted.FooID INTO @FooIDs (FooID)
SELECT LEFT(NEWID(), 8) FROM sys.all_objects;

-- Insert Foo's ID into the Bar table.
INSERT INTO dbo.Bar (BarID, BarValue)
SELECT FooID, RIGHT(NEWID(), 8) FROM @FooIDs;

-- Select our tables.
SELECT * FROM dbo.Foo;
SELECT * FROM dbo.Bar;
GO
FooAndBarFKPKTables
Foo And Bar selects

Primary Keys (PK)

a column or combination of columns that contain values that uniquely identify each row in the table

Primary key is a column or combination of columns that contain values that uniquely identify each row in the table.

That’s it; it just has to uniquely identify the row.

btw you are going to hear the word “unique” a lot with regard to Primary keys…

Now there are other types of keys that can do the same (Surrogate Keys, Composite Keys, Unique Keys, Alternate Keys, etc) but these are outside the scope of this post.

So if we attempt to insert another record into our Primary Key column/column combo that violates this unique, identifying property, we’re going to have a bad time.

PKViolation

We have to use IDENTITY_INSERT syntax because I’ve created the tables using IDENTITY and, if we were to insert a record into the identity column without turning IDENITY_INSERT on first, then another error pops up before the PK violation error that we want.

However, if we were to create our table without specifying the Primary Key constraint then the above insert would work and you would have duplicate entries populating your table, silently and deadly.

Foreign Keys (FK)

a column or combination of columns that is used to establish and enforce a link between the data in two tables to control the data that can be stored in the foreign key table

A Foreign key is a column or combination of columns that is used to establish and enforce a link between the data in two tables to control the data that can be stored in the foreign key table.

That’s it; it just has to establish and enforce a link between data.

If we try to violate this link, SQL Server will throw a different error and not let us.

FKViolation
If it is not in Foo, then it’s not getting in Bar

Yet if we were to create our table without specifying our Foreign key, then there would be no real link between our tables. So if our business depends on a record not being in Bar without being in Foo and we don’t have a constraint specified to that extent…

Unfortunately, I’m hard pressed to think of a way you can ensure this.

In fact, I don’t even like the above definition for Foreign keys as it states that two tables are necessary for a Foreign key constraint when only one is needed:

Example:


IF OBJECT_ID(N'dbo.HR', N'U') IS NOT NULL
DROP TABLE dbo.HR;
GO

CREATE TABLE dbo.HR
(
EmployeeID int
CONSTRAINT [PK_dbo_HR] PRIMARY KEY CLUSTERED,
FirstName varchar(20) NOT NULL,
SurName varchar(20) NOT NULL,
ManagerID int NULL
CONSTRAINT [FK_EmployeeID_Manager_ID] FOREIGN KEY
REFERENCES dbo.HR (EmployeeID)
);
GO

-- Check for foreign key
SELECT * FROM sys.foreign_keys WHERE [parent_object_id] = OBJECT_ID('dbo.HR');
GO

-- Check for primary key
SELECT * FROM sys.key_constraints WHERE [parent_object_id] = OBJECT_ID('dbo.HR');
GO

-- Check for everything.
EXEC sp_helpconstraint'dbo.HR';
GO

SameTableForeignKey
Foreign Key only involving 1 table.

 

Differences

If you check the two definitions for Primary key and Foreign key you’ll see that, even though they are both called keys, they serve two different purposes; namely identifying rows and enforcing links.

And those two purposes are not mutually exclusive!

A column/column combo that identifies a row can also be used to enforce a link back to another table (or itself, as shown above with Foreign keys).

The assumption, that if you were one then you couldn’t be the other, was incorrect. If your business rules call for it, don’t let a column being one type of key stop it from being the other.

Let’s not go to C to get to B

You may be thinking that this is a lot of hassle and that’s not an unfair thought.

Why not just not declare any key constraints and let the data fall as they may?

I will admit that is a fair bit of effort to constantly define and declare the different key constraints when creating tables, especially as Developers are focused on efficiency, but it is worth it!

Now, while the following appears to hold true for any foreign key constraint (I haven’t finished testing yet), I found these while testing the above so I’m going to include them here.

SQL Server loves primary key and foreign key constraints.

A primary key gets a unique index created on it to enforce that it is unique and, since it has an index placed upon it, it can be used to speed up query selection.

A foreign key is special though as it forces a constraint and the query optimiser can use these constraints to take certain shortcuts 🙂

Query Optimizer (QO) Examples

-- Join our tables
SELECT F.* FROM dbo.Foo AS [F] JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;
SELECT B.* FROM dbo.Foo AS [F] JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;

PrimaryForeignKey_InnerJoin

Thanks to our constraint the QO knows that if something were to exist in Bar, it must be in Foo and, since we are not selecting or ordering anything from Foo, it straight up ignores it.

Less reads, less IO; in general all around better performance.

Does this work with other joins though?
Like above, with for something to exist in Bar it must exist in Foo, see if you can figure out why the QO figures it is safe to ignore some joins.

-- L.Join
SELECT F.* FROM dbo.Foo AS [F] LEFT JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;
SELECT B.* FROM dbo.Foo AS [F] LEFT JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;
PrimaryForeignKey_LeftJoin
Is the left join to Bar needed here?
-- R.Join.
SELECT F.* FROM dbo.Foo AS [F] RIGHT JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;
SELECT B.* FROM dbo.Foo AS [F] RIGHT JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;

 

PrimaryForeignKey_RightJoin.PNG
This is basically the reverse of Left Join
-- F.Join
SELECT F.* FROM dbo.Foo AS [F] FULL OUTER JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;
SELECT B.* FROM dbo.Foo AS [F] FULL OUTER JOIN dbo.Bar AS [B] ON F.FooID = B.BarID;
PrimaryForeignKey_FullJoin
Is this surprising?

The “SET operators” (UNION, UNION ALL, INTERSECT, EXCEPT) act a bit differently.

I’ll let you take a look at them yourself though.

Final Note

There is a lot that I have yet to learn about SQL Server, in fact that is the main reason that I created this blog; so I could read back on these posts sometime in the future and smile at my ignorance.

Hopefully the main aspect that I take from this post though is that it is okay not to know something as long as you have the desire and the initiative to learn.

Oh, and a Primary Key can be a Foreigh Key too. 🙂

Take care of your System Databases

Note:

Making it clear to anyone reading this but this post is about SQL Server even though I start off talking a bit about PostgreSQL.

…I know, weird right?…


Back Story:

I have a PostgreSQL instance on my home laptop that I haven’t used yet.

I intend to start using it soon as a sort of hobby as I feel that there are things to be learned about databases from it. Like comparing features available to PostgreSQL that are not in SQL Server and vice-versa or the different ways the same tasks are accomplished in both platforms.

However SQL Server is the platform used in my work, I still have so much to learn with it (especially with 2016 coming out!!!), and I just find it so damn interesting so I haven’t touched PostgreSQL yet.

All that being said, I have signed up to few newsletters from PostgreSQL (General, Novice, etc) and they are fascinating.
Unfamiliar words like pglogical and rsync are combined with known words like publisher and subscriber and the community itself is so vast and supportive that it rivals the #SQLFamily (from what I’ve seen and yes. I am being biased to SQL Server 🙂 ).

Problem:

One of those newsletters was regarding a problem a user was having with creating databases.
When he would create a new database it was not empty as he expected but was filled with user tables, logins, etc.

What was going on?


Explanation:

The explanation was pretty much what you would expect, just called by a different name.

He had basically written to his Model database (called template1 in PostgreSQL) sometime ago without realising it.

PostgreSQL has the following syntax with creating databases:

PostgreSQL:

CREATE DATABASE DatabaseName WITH TEMPLATE TemplateName

The new database settings are created from whatever template is specified using the WITH TEMPLATE syntax (defaults to template1 apparently).

This works the same as SQL Server, the new databases inheriting the settings from the Model system database, but in our case it is implicit. There is no call for WITH TEMPLATE Model.
This is perfectly valid syntax.

SQL Server:

CREATE DATABASE DatabaseName

The only difference that I can tell at the moment is that PostgreSQL can have multiple different templates while SQL Server has just one; Model.

Is this restriction on database templates a good thing or a limitation? Personally I go with the former but you may feel differently.

Multiple Models?…


Take Aways:

This brought me back to the system databases and there was something that I realised.

A lot of new users, and I was included in this list not too long ago, do not think about the system databases.

I’m not sure I can fault them as well as it’s probably not a priority. There is so much to learn with regard to DDL statements, DML statements, Deadlocking, General T-SQL, etc, that the system databases are just a little folder under Databases that does not get opened.

SystemDatabasesFolder
Figure 1.1

However, and I can’t stress these enough, these are important!

  • Master: Logon accounts, sys info, etc
  • MSDB: backup history, restore history, job history etc,
  • Tempdb: temp tables, cursors, ordering, etc
  • Model: new databases, creation of tempdb

And that is just scratching the surface!

Take care of these databases, do not limit yourself to looking after just the user databases.

They are not the only databases that need to be backed-up and they are not the only databases that can experience corruption.


Parting Gift:

I’m hoping that you believe me with this but, unfortunately, the best lessons are learned.

You should have a destructive sandbox SQL Server, (NOT PRODUCTION!!!), hopefully a little laptop at home to call your own; something that nooby else would mind you destroying basically.

Choose a system database, anyone will do; delete that database, drop it, whatever you want just make it unavailable and see how far you can get using SQL Server.

Hell isn’t it?…

Now imagine that happened unexpectantly and unwanted on a Monday morning because you weren’t taking care of your system databases.

Take care of your System Databases.

Temporary Tables Naming Constraints

Introduction:

Kenneth Fisher (b | t)  recently wrote about Re-Evaluating Best Practices and, reading his post,  I couldn’t help but agree with him. Especially with regard to:

Times change, knowledge changes so best practices have to change. Don’t rest on your knowledge or the knowledge of others. Make sure you understand not only what the best practice is but why it’s the best practice. And then question it.

Now I’m not going to bring up the Microsoft PLE of 300 advice as that example has been taken out and waved in front of people’s faces so many times that I feel it’s lost it’s impact and, as far as I am aware, it’s the only case where the best practice is so widely ineffectual.

However, the statement…

Make sure you understand not only what the best practice is but why it’s the best practice.

… is, for me, the crucial statement in his post and the catalyst for the following post as I’ve fallen for a case where the best practices are not applicable; Naming Constraints.


Naming Constraints:

In this post, we are going to be looking at the best practice of giving logical, descriptive names to constraints in tables.

The following code is going to create a table called dbo.NamingConstraints with an Primary key column, a named constraint column and an unnamed constraint column.

Create dbo.NamingConstraints:

CREATE TABLE dbo.NamingConstraints
(
ID int IDENTITY(1, 1) CONSTRAINT [PK_NamingConstraint_ID] PRIMARY KEY,
NamedConstraint int CONSTRAINT [NamedConstraint_gt_0] CHECK (NamedConstraint > 0),
UnNamedConstraint varchar(50) CHECK (UnNamedConstraint <> 'Forbidden')
);
GO

We can check these constraints with the following two queries, the first for the  Primary key, and the second for the CHECK constraints, with the results in Figure 1.1.

Constraint Check:

-- Primary Key:
SELECT name, is_system_named, type_desc, unique_index_id
FROM sys.key_constraints
WHERE [parent_object_id] = OBJECT_ID('dbo.NamingConstraints');
-- Check Constraints:
SELECT name, is_system_named, type_desc, is_disabled, [definition]
FROM sys.check_constraints
WHERE [parent_object_id] = OBJECT_ID('dbo.NamingConstraints');
GO
NamingConstraintsTable_Constraints
Figure 1.1

As Figure 1.1 shows us when we don’t specify a name for a constraint, SQL Server will assign a name to that constraint for us.


Why Naming Constraints is Best Practice.

Constraints are best used to ensure referential and data integrity. Therefore they are commonly seen when data considered against business logic is attempted to be inserted into the database, and error messages are thrown.

When these error messages occur, they normally are passed into error logs from whatever application is integreated into our database. In these cases it is easier to deal with descriptive names than non descriptive ones.

Taking our two CHECK constraints as examples, if we were to introduce error messages…

Create constraint errors:

-- UnNamed Constraint Violated;
INSERT INTO dbo.NamingConstraints
(NamedConstraint, UnNamedConstraint)
VALUES (1, 'Forbidden');

-- Named Constraint Violated;
INSERT INTO dbo.NamingConstraints
(NamedConstraint, UnNamedConstraint)
VALUES (-1, 'Allowed');

Looking at the first error message reported (Figure 2.1), we know from the error message that something is wrong in the Table dbo.NamingConstraints and the column is UnNamedConstraint but that is it. If this table had multiple constraints, and we weren’t the one to create this table and the constraints, we would have to do some (potentially lengthy) investigation to figure out what is causing the error and then correct it.

NamingConstraintsTable_UnNamedViolation
Figure 2.1

Compare that with the error message for our named constraint (Figure 2.2).

NamingConstraintsTable_NamedViolation
Figure 2.2

As we have a proper, descriptive constraint name here, straight away we can say that the error occurred as we tried to insert a value that was not greater than 0.


When Naming Constraints is not applicable.

TL;DR

Do not name constraints on temporary tables.

Why?

Why as in what would a use case for this be? I use this a lot to step through code with different variables, especially with stored procedures.

Two windows, side by side, running them step by step and comparing the results in each.

I know, fun right?…

Or why as in why should you not name constraints on temporary tables?
Well that’s going to require a bit more detail.

SQL Server requires a unique name on it’s objects as they must comply with the rules of identifiers.

So if we were troubleshooting a procedure and attempted to pass results into a temporary table…

CREATE TABLE #Temp02
( 
Col1 int IDENTITY(1,1) NOT NULL,
Col2 varchar(256) CONSTRAINT [Col2_neq_Forbidden] CHECK (Col2 <> 'Fobidden')
);

INSERT INTO #Temp02 (Col2)
SELECT name FROM sys.all_objects;

SELECT * FROM #Temp02;
GO

 

… we should have no problem.

TempTableCreationNoErrors
Figure 3.1

But say we try to do that in two seperate windows…

SideBySideTempTableCreationErrors
Figure 3.2

… Big, angry error message telling us that it could not create the constraint and that an object alreadt exists in the database.

Now say that we were to not explicitly name the constraints on these tables?

CREATE TABLE #Temp02
( 
Col1 int IDENTITY(1,1) NOT NULL,
Col2 varchar(256)
);

INSERT INTO #Temp02 (Col2)
SELECT name FROM sys.all_objects;

SELECT * FROM #Temp02;
GO
NamingConstraints_NoConstraints
Figure 3.3

No problem! Since we have not explicitly named the constraint, SQL Server does not violate it’s rule for identifiers and so does not throw an error message!


Caveats

Yes, I know that this could be classed as an extreme edge case but that is not the caveat that I’m talking about here.

If you are in the habit of not skipping over reading the actual SQL code, you may be wondering to yourself.

Well, the temp tables are called the same name and they should follow the rules for identifiers as well, no? Why aren’t they throwing an error?

Well that’s because temp tables are treated a bit differently than other objects.

Consider the following example where we try to find our temp table in TempDB:

SELECT * FROM tempdb.sys.tables WHERE name = '#Temp02';
NamingConstraints_NoResults_EqTempTable
Figure 4.1

Nothing. It doesn’t exist. But we didn’t drop it and we haven’t closed the scope so it can’t have just disappeared!

If we change our select statement to the LIKE operator with an ending %…

 

SELECT * FROM tempdb.sys.tables WHERE name LIKE '#Temp02%';
NamingConstraints_Results_LikeTempTable
Figure 4.2

SQL Server, knowing that temp tables could get created multiple times concurrently (especially if created in Stored Procedures),  gets around the rule for identifiers with temp tables by adding a unique suffix onto each temp table that is created.

Therefore, it doesn’t violate the rule, it doesn’t error out and multiple concurrent sme-named temp tables can be created.

Why doesn’t this unique suffix happen with constraints aswell? Is this on purpose? By  Design?
Well the only answer I can give is, I don’t know.

But what I do know is that, in these cases with temp, don’t name your constraints.

SQL Server is messing with my Identity – solved.

In my last post, I was looking into creating new temporary tables from a SELECT INTO syntax when I ran across an issue that I couldn’t explain.

I realised that a situation like that cannot be allowed to continue on a blog post about SQL Server so I documented what I was experiencing (probably pretty poorly I may add) and said that when I had an answer, I would let you know.

Well I have an answer now and I confess that I wasn’t the one to figure it out.

However, I will try as best I can to explain it, mainly for myself though so I can look back on it.


Summary of the problem:

We have a table with an identity value of 1,000. When we select a subset of that table into a new table the identity value of the new table decreases to match the highest identity value of that subset.

From initial investigations, there is no visible evidence to show how this is achieved, so how is this happening?

Code:

SELECT * INTO #temp FROM dbo.A WHERE x1 = 1;

SELECT * FROM #temp;

Red Herrings:

When running a SELECT INTO query, if you were to enable STATISTICS PROFILE beforehand,

SET STATISTICS PROFILE ON;
SELECT A_ID, x1 INTO #Test FROM dbo.A WHERE x1 = 1;

…you will see an Argument column with the following code:

DEFINE:([Expr1007]=setidentity([LocalTesting].[dbo].[A].[A_ID],(-7),(0),N’#test’))

This is the red herring that I was talking about.

This Argument column threw me but don’t let it confuse you too. There is no -7 arithematic going on here.


Solution:

There’s actually two phases to a SELECT INTO statement and it is the second one that we are interested in here.

As Paul White (t | b) says:

The second phase performs an insert into the table the first phase created. This insert is done withidentity_insert semantics, so the identity values from the source table end up in the destination, unchanged. The highest value actually inserted is set as the last value used. You can use IDENT_CURRENT or sys.indentity_columns to see it.

So there is no addition/subtraction going on here.

SQL Server is simply going:

> Are we done inserting? We are? Great, what was that last identity value? 998? Great, that’s your new identity value for this table!

SQL Server is messing with my Identity

When working with SQL Server, you are going to be using TempDB; that is a certainty.

Using temporary tables, however, isn’t a certainty.
I’m taking it for granted that at one time or another you have used them but it’s not something I can say for certain.

if you aren’t using temporary tables AT ALL, please let me know what you’re using SQL Server for. I’ll either be impressed or shocked!more than likely shocked…

However, one of the features that I find with temporary tables is that they inherit the identity from whatever table they were created from, and more often than not, this is not the identity that you want them to have.

I mean if we wanted them to have that identity, we would have just used the base table, right?

Now I have ways around this that I’ll blog about later on .

However, while working with these temporary tables and their identites, I found something that I just can’t explain yet and thought I would blog about it so when I do figure it out I can look back at this and call myself all kinds of names for being so stupid.


So first of all, let’s set up our base table:

USE tempdb;
GO

-- Create our base table
CREATE TABLE dbo.A
(A_ID INT IDENTITY(1, 1),
x1 INT,
noise1 int DEFAULT 1,
noise2 char(1) DEFAULT 'S',
noise3 date DEFAULT GETUTCDATE(),
noise4 bit DEFAULT 0);

-- Create random data between the range of [0-3]
INSERT INTO dbo.A(x1)
SELECT s1000.n FROM
( SELECT TOP (10) n = 1 FROM sys.columns) AS s10 -- 10
CROSS JOIN
( SELECT TOP (10) n = 1 FROM sys.columns) AS s100 -- 10 * 10
CROSS JOIN
( SELECT TOP (10) n = ABS(CHECKSUM(NEWID())) % 4 FROM sys.columns) AS s1000; -- 100 * 10

SELECT * FROM dbo.A;

Random-ish results but should be something like this:

NoiseTable

Now, SQL Server has a useful little Database Engine Stored Procedure called sp_help that, along with a plethora of other useful information, can return a result set on identity.

If we query our table, we would get the following results:

EXECUTE sp_help 'dbo.A';

IdentityFromSpHelp
What this is telling us is that:

  • Our Identity column is [A_ID] (identity),
  • Our Identity value starts at 1 (seed), and
  • It goes up by 1 (increment) each time.

little tip: check out Tools -> Options -> Keyboard -> Query Shortcuts… you can just highlight a table and hit “Alt + F1” and you have the same results as running the above query…Efficiency for the win!…

which is great if we wanted to know what we started off with, but what about looking into the future? What about knowing what’s going to get inserted next?
Well for the identity column, we can!

DBCC CHECKIDENT(), passing in your table name.

NOTE: I’ve just given you a potentially dangerous command as it has some optimal parameters that can break stuff. Do me a favour and make sure you’re not on Production.

…see the above? How it’s in bold? Do me a favour and double check you’re not on Production? Cheers!…

CheckIdentResultsA

so we’ve looked into the future and we can tell that with sp_help and DBCC CHECKIDENT, our future identity will be:

DBCC CHECKIDENT().current identity value + sp_help.increment

 


Now say that part of our query is only interested in rows where x1 = 1.
The most basic way to create our temporary table with this information is probably the one that the majority would use:

SELECT * INTO #temp FROM dbo.A WHERE x1 = 1;

SELECT * FROM #temp;

…Nice, simple, easy. 3 things that we want….

SelectFrom#Temp

And if we were to ALT+ F1 #temp we would get the following:

TempIdentity

Same identity!…

So with the same identity, we have the same current identity value, right?

DBCC CHECKIDENT('#temp')

CheckIdentResultsTemp

WRONG!…

Now I like this.
I appreciate this.
I’m thankful for this.

But I’m not sure why exactly this happens, apart from SQL Server is smart enough to know that it’s not going to need the exact same current identity value for this new table.
Using a (slightly depreciated) command

SET STATISTICS PROFILE ON;
SELECT * INTO #temp FROM dbo.A WHERE x1 = 1;

IdentityChangeSQLSERver

..it seems like SQL Server is changing the identity for you…


Now just to make this even more confusing, we run the following which for me shows that the last x1 value is 2.

SELECT * FROM dbo.A ORDER BY A_ID DESC;

LatestAx1

So what I’m expecting is that if we were to input this into a temporary table, we wouldn’t expect to see that setidentity…-7.

SET STATISTICS PROFILE ON;
SELECT * INTO #temp2 FROM dbo.A WHERE x1 = 2;

setidentShouldntBethere

But it’s there!…

Fine, okay so, just means that our current identity value has been changed!

So we run our DBCC CHECKIDENT and…

DBCC CHECKIDENT ('#temp2')

takeAwayIdentButRemainTheSame

No change.


And that’s my problem!
I don’t have an answer for this at the moment but when I figure this out, I’ll let you know, and if anyone has the answer, feel free to share 🙂

But at least I have a partial answer:

Question: When subtracting 7 from a value equal no change?
Answer: When SQL Server is involved.

 

Output over cursors

I am not pro-cursor. I am not, however, extremely anti-cursor.

I think that if there is a cursor used for a certain problem it just means we don’t know enough about the problem, the data or the code to think of a better solution.

But I’ve had an experience with cursors and, for my cases, found a better way.

That is what this blog post is going to be about, a case where we were using a cursor until we realised a better solution for it.


Background:

We have a weird table.

It’s our fault though as we’ve made it weird, we haven’t properly normalised it and it’s basically just a massive catch-all for records and for some strange reason, only 1 in 10 records have a ticket number.

So let’s create it!

-- Create the test table.
CREATE TABLE
    [dbo].[ProductActions]
    (
    [product_action_id]    INT IDENTITY(0, 1) PRIMARY KEY,
    [product_id]           INT,
    [allowed_action_id]    INT,
    [ticket_number]        VARCHAR(20),
    [date_actioned]        DATE
    );

-- Populate it.
WITH
    L0    AS(SELECT 1 AS c UNION ALL SELECT 1),
    L1    AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2    AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3    AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4    AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5    AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B)
INSERT dbo.ProductActions
    (product_id, allowed_action_id, date_actioned)
SELECT TOP (10000000)
    product_id = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 5,
    allowed_action_id = (ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) + 5) / 5,
    date_actioned = CAST(DATEADD(DAY, (SELECT ABS(CHECKSUM(NEWID())) % 50000), 0) AS DATE)
FROM L5;

UPDATE dbo.ProductActions
SET ticket_number = (CASE
                     WHEN product_action_id % 10 = 0
                     THEN 'TCKT'
                           + REPLACE(QUOTENAME(product_action_id, ''''),'''','')
                     ELSE NULL
                     END);

 

 

…that took around 1 minute for 10,000,000 records, including the create and update statements…

Now say something diabolical happens, a bad delete statement, a disgruntled employee who just won the lottery, a chimpanzee riding on a segway, whatever but for some reason all the records holding our ticket numbers get deleted.
However, thankfully (not sure if that’s the right word) we have Excel sheets of the records that were dropped.

ticketNumbersDeleted

… 9 seconds! 9 seconds for that wealthy employee/chimpanzee to cause havoc…

However, we don’t have everything; we have the product_id and the allowed_action_id.
So the business tells us that the dates don’t matter, it just needs those records back and those ticket numbers!


What we did:

Cursors. (I know, I’m sorry.)

However, we had problems.
Our table had a lot of traffic so we weren’t guaranteed an unbroken sequential range of identifiers that we could update.

And we couldn’t remove these foreign inserts from our range as we had no way of identifying which was our records and which were foreign records.

So we created a cursor. Not a performance tuned cursor since we were relatively new to SQL at the time but a run-of-the-mill, Google what it is and grab the template off the internet cursor.

Steps:

  1. Import the records into a holding table
  2. Give the table a sequential identifier for ordering
  3. Get the first record
  4. Insert it into the original table
  5. Grab the identity of the insert we just inserted
  6. Update it to have the correct ticket number
  7. Grab the next record and repeat steps 4 – 6
  8. Finish

All together 8 steps (although steps 3 – 7 are done around 1 million times) but how fast is it?

Step 2:

ALTER TABLE
dbo.DeletedRecordsPaper
ADD [ID] INT IDENTITY(0, 1);
GO

…around 34 seconds, surprisingly long…sign of things to come?…

Steps 3 – 7:

DECLARE @prodID INT,
@allowed INT,
@entered INT;
-- Start of Step 3

DECLARE FillMissingRecords
CURSOR FOR
SELECT product_id, allowed_action_id
FROM dbo.DeletedRecordsPaper
ORDER BY ID ASC;

OPEN FillMissingRecords

FETCH NEXT
FROM FillMissingRecords
INTO @prodID,
@allowed

WHILE @@FETCH_STATUS = 0
BEGIN  -- Step 4
INSERT dbo.ProductActions
(product_id, allowed_action_id, date_actioned)
SELECT @prodID,
@allowed,
GETDATE();
-- Step 5
SELECT @entered = SCOPE_IDENTITY();
-- Step 6
UPDATE dbo.ProductActions
SET ticket_number = 'TCKT' + CAST(@entered AS varchar(10))
WHERE product_action_id = @entered;
-- Step 7
FETCH NEXT
FROM FillMissingRecords
INTO @prodID,
@allowed

END

CLOSE FillMissingRecords

DEALLOCATE FillMissingRecords;

How’d it do? Well it worked; 1 million records took 9 minutes and 35 seconds though.

…9 minutes and 35 seconds…9 minutes and 35 seconds?!?…I could have recreated the table 9 and a half times in that time! I knew it was going to be slow, I even went out and got a coffee while I was waiting but C’MON!!!!

So altogether, with the adding of the sequential identifier, it took 10 minutes to run.

Now I can’t remember why exactly but this happened a few times so we had to run this a few times.


What we learned:

1). Eventually monkies improve with using keyboards, so too do humans and 2). it’s SQL Server. Use Set Theory!

Enter OUTPUT

Looking at this issue the main problem that we could see was that we needed to know what records we had inserted to update them with a ticket number.

We thought that we couldn’t figure out what we had inserted without using SCOPE_IDENTITY, we didn’t know about the inserted table, we didn’t know about variable tables, essentially we didn’t know about OUTPUT.

Steps:

  1. Import the records into a holding table
  2. Declare a holding table for the identities
  3. Insert all the records into the original table and output the values to our holding table
  4. Update all the records inserted to have the correct ticket number
  5. Finish

All together 5 steps, already an improvement, but that’s just steps, not performance. Is it an actual performance improvement?

…to be honest, if it was anyways faster than 9 minutes, I’d be happy…

-- Step 2
DECLARE @entered
TABLE
(
entered_id INT
);

-- Step 3
INSERT dbo.ProductActions
(product_id, allowed_action_id, date_actioned)
OUTPUT inserted.product_action_id
INTO @entered
SELECT product_id,
allowed_action_id,
GETDATE()
FROM dbo.DeletedRecordsPaper;

-- Step 4
UPDATE pa
SET ticket_number = 'TCKT' + CAST(entered_id AS varchar(10))
FROM dbo.ProductActions AS [pa]
JOIN @entered AS e
ON pa.product_action_id = e.entered_id;

 

29 SECONDS!!!! 1 million rows in 29 seconds!

So we’ve gone from 1 million rows in 9 minutes 35 seconds (~1,730 rows a second) to 1 million in 29 seconds (~34,482 rows a second).

…those results flustered me so much that I forgot maths so if anybody wants to work out that improvement percentage, please let me know. I want to write that one down!…


Conclusion:

The most dangerous phrase in the English language is probably “what does this button do?”
The second most dangerous is “we’ve always done it that way”.
There’s always room for improvement testing (on the right system i.e. NOT PRODUCTION) and OUTPUT over CURSORS any day.

 

 

 

Beautifully SARGable, beautiful Query

This blog post is not what I was intending to write, at least not so early in my list of blog postings. Mainly due to the fact that it’s main issue is around SARG-ability which I think is a bit more advanced than normal.

…by ‘a bit more advanced’ I mean any conversation involving me and SARG-ability would carry along these lines…
> Do you know what SARG-ability is?
>> Me? Yeah, totally!
> Great! What is it so?
>>…eh….indexes?

Which isn’t totally wrong but since SARG-ability is a main part of this post, it’s best I give you a definition and an example (BONUS: a lovely video!)

So a query is SARGable if “the DBMS engine can take advantage of an index to speed up the execution of the query.”

This SARG-ability is what we are going to be searching for today because we’re going to be searching a table of 15,000,000 rows.

…why 15,000,000? Well I first wanted to populate as many rows as I could but thanks to the joys that are Cartesian products I ended up crashing my SSMS! So 15,000,000 will do us…

 

Now if this was a book and we had to search every single 15,000,000 pages of that book for the page with the result we want OR if we had the option to go to the back of the book, look up on the index where the the result is and go directly to that page, which do you think would be faster? Easier?

…I’m really hoping you went with the Index option…

Hence, in our example SARG-ability is good and it’s what we want to achieve today.

So with that, on to the actual post…


One of the most common things to search for in SQL Server is dates.  Nice and simple and everyone has (presumably) done it.

SELECT
*
FROM [<my_schema>].[<my_table>]
WHERE [<date_column>] = '<date>';

Fine, grand, no problem…if your table is small but remember, we’ve got a table with 15,000,000 rows. Checking every single row is just not going to cut it here.

Unfortunately that is what happened in this situation.
A requirement was given to Development to get all the rows where the date was this year, and Development were smart enough to know that there is a SQL Server function for that.

DECLARE @Year INT = 2016;
SELECT [Test_Date] FROM [dbo].[DateTest] WHERE YEAR([Test_Date]) = @Year;

So they sent it onto me to approve and apply to the QA environment and I did what I have come to learn is what most DBA’s do to Developers, I said no.

Why? Here’s why?
First lets create a test table to run this against…this insert took around 2 minutes on my machine.


-- First, create our table
CREATE TABLE [dbo].[DateTest] ([Date_Test_Id] INT IDENTITY(1, 1), [Test_Date] datetime2(3));

-- Populate it with 15,000,000 random rows
-- from 1st Jan 1900 to 1st Jan 2017.
INSERT INTO [dbo].[DateTest] ([Test_Date])
SELECT
TOP (15000000)
DATEADD(DAY, 0, ABS(CHECKSUM(NEWID())) % 42734)
FROM [sys].[messages] AS [m1]
CROSS JOIN [sys].[messages] AS [m2];

-- Create an index we can use for our tests
CREATE NONCLUSTERED INDEX [nci_DateTest_Test_Date] ON [dbo].[DateTest] ([Test_Date]);

-- Show the data
SELECT * FROM [DateTest];

…who else knew that there was 42734 days from the 1st of Jan 1900 to the 1st of Jan 2017? I didn’t!…

So now that 1). we have our table and b). we have an index we can use, we can run the developer’s query and be SARGable right?

DECLARE @Year INT = 2016;
SELECT [Test_Date] FROM [dbo].[DateTest] WHERE YEAR([Test_Date]) = @Year;
GO

NonSargable

Nope! Table scan, ignores our Index and reads all 15M (too lazy for all the zeros) for a measely 127,782 rows! It’s not the slowest, taking around 3.960 seconds but still, we want SARGable!!!

…your row results may differ, random is random after all…

So being the kind, gentle, caring soul that I am (as well as being on my fourth coffee) I went back with a SARGable method.

DECLARE @Year INT = 2016;
SELECT [Test_Date] FROM [dbo].[DateTest]
WHERE [Test_Date] < DATEADD(YEAR, (@Year - 1900) + 1, 0)    --    Start of Next Year
AND [Test_Date] > DATEADD(YEAR, (@Year - 1900), - 1);    --    End of Last Year
GO

SARGable_notpretty

Look at that beautiful, beautiful index seek. 127,782 rows in 1.807 seconds, over twice as fast!

But as beautiful as that index seek is, it is not what the developer saw.
> That’s an ugly query, it’s going to be horrible to troubleshoot that.
And off he went.

What he came back with was pleasantly surprising and reminds me that one of the best thing about SQL Server is, that for all it’s restrictive syntax and rules, there is no 1 way to do anything.
What he came back with was the beauty of DATEFROMPARTS.

DECLARE @Year INT = 2016;
SELECT [Test_Date] FROM [dbo].[DateTest] WHERE [Test_Date] BETWEEN DATEFROMPARTS(@year, 1, 1) AND DATEFROMPARTS(@year, 12, 31);
GO

SARGable_pretty

Boom! Index seek of 127,782 rows in 1.807 seconds.

I was happy cause it was SARG-able and he was happy as it was easier to troubleshoot.
So he asked if it could be applied to the QA environment and I did what I have come to learn is what most DBA’s do to Developers, I said no.

In case you’re wondering why, consider two things: the data type of the [Test_Date] column and this example.

SELECT * FROM [dbo].[DateTest] WHERE [Test_Date] = '2016-12-31';

378 row(s) affected.

UPDATE TOP (3) [dbo].[DateTest] SET [Test_Date] = '2016-12-31 12:34:56' WHERE [Test_Date] = '2016-12-31';

3 row(s) affected.

rows_Different_pretty_nonpretty

3 rows difference…I’ll let you figure out why.