37
HOW TO IMPLEMENT DATA INTEGRITY IN SQL SERVER Louis Davidson (drsql.org) [email protected]

HOW TO IMPLEMENT DATA INTEGRITY IN SQL SERVER Louis Davidson (drsql.org) [email protected]

Embed Size (px)

Citation preview

HOW TO IMPLEMENT DATA INTEGRITY IN SQL SERVERLouis Davidson (drsql.org)

[email protected]

Why did I choose data integrity as a topic?

• Answer 1• If I obviously lie to you, will you trust me?

• If your data obviously lies to your customer, will they trust it?

• For data to become information, it has to be as trustworthy as reasonably possible.

• Answer 2• If I were the judge and was convicting someone of poor

data integrity, I would sentence them to write/maintain ETL

• I wrote this slide at 12:49am, 8/14/2013 because I had to get up and fix a data integrity issue

First Line of Defense – Testing and Requirements• First, know what your user wants (Requirements)

• Build queries to check the data is within tolerances as you build• Define both illegal values and exceptional values

• Age for DayCare Student: • Legal: 1-8 Illegal: Everything else

• Outside the norm, but perhaps possible: 1, 2, 6, 7, 8

• Save these queries as you go

• Test during all phases of the project• Design

• Development

• Customer testing

• Production

• No matter if you follow any of my following advice and let your tables go naked, these scripts can be used to verify data is within tolerances

Requirements tell you what to test for, now WHERE/HOW to implement?

• Classic client server: “May I save this data, please?”

• Very trustworthy data protectionDB

• Built in tools that most programmers understand

• Flexible to change as users need them to

Middle Tier, Rules Engine

• Friendly for the user• Provide immediate feedback for nominal

rules, limiting bandwidth utilizationUI

No one place can satisfy well enough (But…)

• No interactive protection, limit to 100% true rules

• Extremely limited flexibilityDB

• Suffers under highly concurrent situations• Difficult for Inter-row, Inter-table rules• Difficult to use with tools like SQL, SSIS

Middle Tier, Rules Engine

• Must be recoded for every form/screen• Very limited rule set that can be enforcedUI

Database Layer Responsibilities

• 100% Rules• Always true

• Usually very simple rules

• Failure to meet the prescribed condition would be harmful to the software (and possibly the users of the software)

• Other layers repeat some rules and implement everything else

Database tier layered approach

• Keep it simple

• Enforce integrity via (Our Agenda for the next 1hr)

• Structure - providing correct places to store data

• Keys - protecting uniqueness

• Relationships - foreign keys

• Domains - limiting data points to size/values that are legit

• Conditions - required situations (Customer may have only 1 primary address; No overlapping ranges; etc)

But We Don’t Want Errors from the Data Tier!• A frequent concern of non-data tier programmer

• Even if you put no constraints you are apt to get errors • You are always likely to get deadlocks

• And if your indexing isn’t great, you may get them frequently

• Best to code error handler that handle any error condition regardless

• If the other tiers handle all of the errors, then the database protection should remain silent• Except perhaps during testing/coding

Structure

• Match the user's needs precisely to the design with room for growth

• Getting design to match the user's needs will get you way down the road to integrity

• Normalization will usually get the car fueled up and started• Naming stuff well doesn’t hurt either…

• Getting it right can only be done by understanding the users requirements• I promise, no more requirement talk

If your structure is wrong…Users will find a way• Requirement: Store information about books

• What is wrong with this table?• Lots of books have > 1 Author.

• What are common way users would “solve” the problem?• Any way they think of!

• What’s an another common way someone might fix this?

BookISBN BookTitle BookPublisher Author=========== ------------- --------------- -----------111111111 Normalization Apress Louis222222222 T-SQL Apress Michael333333333 Indexing Microsoft Kim444444444 DB Design Apress Jessica444444444-1 DB Design Apress Louis

, Louis& Louisand Louis

Close, but still quite messy

• Add a repeating group?

• But now how to represent who was the primary author on the book?

BookISBN BookTitle BookPublisher …=========== ------------- --------------- 111111111 Normalization Apress …222222222 T-SQL Apress …333333333 Indexing Microsoft …444444444 Design Apress …

Author1 Author2 Author3----------- ----------- -----------LouisMichaelKimJessica Louis

Now, the structure protects the data…

• And it gives you easy expansion

BookISBN BookTitle BookPublisher =========== ------------- ---------------111111111 Normalization Apress 222222222 T-SQL Apress 333333333 Indexing Microsoft444444444 Design Apress

BookISBN Author=========== =============111111111 Louis222222222 Michael333333333 Kim444444444 Jessica

ContributionType----------------Principal AuthorPrincipal AuthorPrincipal AuthorContributorPrincipal Author444444444 Louis

Keys

• Defending against duplication of data where it oughtn't be duplicated

• Artificial Key (Identity/GUID/Sequence generated value) should NOT be the only key

• When employed, Artificial Key is for tuning, Natural Key is for the user

• Avoid giving users sequentially created values • Well, I am account 0000001, what about account

0000002

Uniqueness Counts

• Requirement: Table of school mascots

• For a row to be truly unique, some manner of constraint needs to be on column(s) that have meaning

• It is a good idea to unit test your structures by putting in data that looks really wrong and see if it stops you, warns you, or something!

MascotId Name=========== -----------1 Smokey112 Smokey4567 Smokey 979796 Smokey

Color-----------Black/BrownBlack/WhiteSmoky Brown

School-----------UTCentral HighLess Central HighSouthwest Middle

~~~~~~~~~~~ ~~~~~~~~~~~

Key Constraints

• Applied to protect data from duplication• May help performance, but should exist even if never used for

a query

• Part of the data structure – applied with ALTER TABLE – unlike indexes, which are generally attached for performance

• NULLs• Primary Key – No NULLs Allowed

• Unique – NULL allows, but treated as a single value

• Table Clustering• Usually makes sense for the primary key to be clustered (not a

hard and fast rule though)

• Key constraints valuable with or without clustering

Demo – Key Constraints (and a wee bit more)

Relationships• Establishes a connection between two tables

• Probably the most trouble to implement from outside of the database• Concurrent users means data can change

• Caching all data is really costly (particularly to keep up to date with multiple caching servers for inserts, updates, and deletes!)

• Using foreign key constraints means these types of queries always return the same value:• SELECT COUNT(*)

FROM InvoiceLineItem

• SELECT COUNT(*) FROM Invoice JOIN InvoiceLineItem ON Invoice.InvoiceId = InvoiceLineItem.InvoiceId

Foreign Key Constraints

• Like CHECK CONSTRAINTs, are part of the table structure

• One table can reference another’s PRIMARY KEY key columns, or even the UNIQUE key columns

• Indexing the child’s reference key can be helpful in many cases

• Usually extremely fast, even on very large tables• As long as key’s underlying indexes maintained

• For integer keys, a B-Tree index can search millions of rows in a few reads

Foreign Key Cascading• Can define cascading operations

• UPDATE CASCADE – Deleting the parent deletes the children

• DELETE SET NULL – Updating the parent key set the child reference key to NULL

• DELETE SET DEFAULT – Deleting the parent row sets the child row to the default

• UPDATE NO ACTION – Fail if any child rows exist – THE DEFAULT

• Or other combinations of DELETE and UPDATE with CASCADE, SET NULL, SET DEFAULT, or NO ACTION

• DELETE CASCADE operations should be limited, to avoid surprises

• Use UPDATE CASCADE where you have updatable primary keys. Changing a primary key with references is messy.

• Multiple or Cyclic cascade paths require INSTEAD OF triggers or procedures to implement

Demo – Foreign Keys

Domains

• Defining the domain of an object or column• Table - Customers? All customers or certain types?

• Column• Integer? Or Whole number between 0 and 10,000,000

• True Unicode Value accepting 64K Characters? Or simple AlphaNumeric?

• Can you accept 2GB of Text (varchar(max))?

• Goal 0% chance of defects• No situational intelligence

• If there can be ANY variation, then the domain includes the variations

• Can't fight users doing dumb stuff

Please don’t do this. Please?

CREATE TABLE object ( objectId uniqueidentifier,

fillMeUp varchar(max)

)

Extreme Bucket Datatypes

• numeric(38,2)• Max value:

999,999,999,999,999,999,999,999,999,999,999,999.99

• Bill gates worth: < $99,999,999,999.99

• US National Debt + All personal Debt: < $99,999,999,999,999.99

• For a nutty value: Distance to nearest galaxy in inches, yes, inches

~74488200000000000000000.00

Extreme Bucket Datatypes - Strings• varchar(8000)

• abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz

• That is just 780 characters!

• Note: If you allow N characters, your apps should minimally test for N (successfully), and N + 1 characters (error)

On the other hand, don’t be over restrictive

Why did they name me 555 95472? Now I

can’t go to school because of the stupid

school database!

http://peanuts.wikia.com/wiki/File:555.jpg

Single Column Domain• What data is EVER "legal" for a column

• Most data integrity issues are due to lack of domain control• Missperllings: TN, TNN, TENN, TENNESEE, TINNESEE

• Bad values: -1 for Age, NULL for required value, Random default value chosen

• Implementation Includes• Intrinsic data type

• Optionality (NULL v NOT NULL)

• Default Value

• Simple predicates • Check constraint

• Domain table

• Forcing the Issue: Trigger

Multiple column

• Where the domain of one column is affected by the domain/value of another

• Examples: • if col1 = 1 then col2 in (1,2,3)

else col 2 in (3,4,5)

• If col1 = 'bob' then col2 is NOT NULL

• Usually implemented with a CHECK constraint

Multiple Column Concerns• Minimize these conditions to only where necessary to avoid

illogical/illegal data• RefusedToGiveBirthDateFlag = 1 AND BirthDate is null

• Questionable: if DiscountPercent > .5, then ApproverUserId is not null.

• Likely Contraindicated - Processing Situations• The user enters Date1 always before Date2

• The ship date must be after the order date

• Avoid domains based on data in other tables because data in other tables can shift, leading to messy situations• discountPercent > .5 and savingUser.needsApproverFlag = 1 then

ApproverUserId is not null

• What happens if you change/delete the user that is referenced in savingUser?

Check Constraints• Applied to complete implementation of 99.9% of simple

domains• May help performance because it gives the optimizer knowledge of

the data

• Part of the data structure – applied with ALTER TABLE

• Simple predicate implementation• If any column allows NULL, the expectation is that NULL is an

acceptable answer unless specifically coded for

• Hence, to fail CHECK condition, the answer must be FALSE (unlike WHERE clauses that succeed only when the result is TRUE)• 1=1 TRUE – Acceptable for WHERE or CHECK

• 1=NULL UNKNOWN – Succeeds for NULL Column CHECK CONSTRAINT ONLY

• 1=2 FALSE – Fails for both

Demo – Domains

Conditions

• Making sure that some condition is met reliably

• Examples• Row Modification Details

• Overlapping Ranges

• Big decisions here• Non-trivial to implement

• Feels natural to do it non-data tier code

• However non-data tier code:• Can be less reliable

• Can be greatly affected by concurrency

Tools

• Triggers• Instead of Trigger to Automatically Maintain Values

• After to validate complex conditions that must be constantly true

• SQL• Optimistic Locking to avoid heavy locking without lost updates

Demo – Protecting against Conditions

Performance Concerns…

• For most everything you will commonly need, it can be based on basic declarative integrity constraints

• By now, there will be some concern about performance

• Performance WILL be impacted, • Done well: almost negligible

• Done poorly: can lots of pain

• The next demo will do a non-scientific, single user job of showing the performance hit is noticeable, but not tremendous…

Demo –Performance

Summary

• Getting the structure correct is a great start towards data integrity

• Make sure column values are always within an acceptable tolerance so software doesn’t break

• Employ all of the tools SQL Server gives you to help ensure data integrity

• Use non-data tier software to ensure errors that return from the data tier are extremely rare

• The key word is: teamwork. You can’t do an adequate job of protecting data in the UI, Business/Object or Data tiers alone

Trust but verify

• Never stop testing the data, even into production

• Be vigilant• Test the structures to make sure constraints not

disabled and are trusted

• Test data that is not constrained in a 100% manner

• Use your slow periods wisely, running tests regularly

• Even 1 bad row that a customer notices means they may no longer trust the data…