The SQL Server optimizer does contain logic to remove redundant joins, but there are restrictions, and the joins have to be provably redundant. To summarize, a join can have four effects:
- It can add extra columns (from the joined table)
- It can add extra rows (the joined table may match a source row more than once)
- It can remove rows (the joined table may not have a match)
- It can introduce
NULL
s (for a RIGHT
or FULL JOIN
)
To successfully remove a redundant join, the query (or view) must account for all four possibilities. When this is done, correctly, the effect can be astonishing. For example:
USE AdventureWorks2012;
GO
CREATE VIEW dbo.ComplexView
AS
SELECT
pc.ProductCategoryID, pc.Name AS CatName,
ps.ProductSubcategoryID, ps.Name AS SubCatName,
p.ProductID, p.Name AS ProductName,
p.Color, p.ListPrice, p.ReorderPoint,
pm.Name AS ModelName, pm.ModifiedDate
FROM Production.ProductCategory AS pc
FULL JOIN Production.ProductSubcategory AS ps ON
ps.ProductCategoryID = pc.ProductCategoryID
FULL JOIN Production.Product AS p ON
p.ProductSubcategoryID = ps.ProductSubcategoryID
FULL JOIN Production.ProductModel AS pm ON
pm.ProductModelID = p.ProductModelID
The optimizer can successfully simplify the following query:
SELECT
c.ProductID,
c.ProductName
FROM dbo.ComplexView AS c
WHERE
c.ProductName LIKE N'G%';
To:
Rob Farley wrote about these ideas in depth in the original MVP Deep Dives book, and there is a recording of him presenting on the topic at SQLBits.
The main restrictions are that foreign key relationships must be based on a single key to contribute to the simplification process, and compilation time for the queries against such a view may become quite long particularly as the number of joins increases. It could be quite a challenge to write a 100-table view that gets all the semantics exactly correct. I would be inclined to find an alternative solution, perhaps using dynamic SQL.
That said, the particular qualities of your denormalized table may mean the view is quite simple to assemble, requiring only enforced FOREIGN KEYs
non-NULL
able referenced columns, and appropriate UNIQUE
constraints to make this solution work as you would hope, without the overhead of 100 physical join operators in the plan.
Example
Using ten tables rather than a hundred:
-- Referenced tables
CREATE TABLE dbo.Ref01 (col01 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref02 (col02 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref03 (col03 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref04 (col04 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref05 (col05 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref06 (col06 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref07 (col07 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref08 (col08 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref09 (col09 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref10 (col10 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
The parent table definition (with page-compression):
CREATE TABLE dbo.Normalized
(
pk integer IDENTITY NOT NULL,
col01 tinyint NOT NULL REFERENCES dbo.Ref01,
col02 tinyint NOT NULL REFERENCES dbo.Ref02,
col03 tinyint NOT NULL REFERENCES dbo.Ref03,
col04 tinyint NOT NULL REFERENCES dbo.Ref04,
col05 tinyint NOT NULL REFERENCES dbo.Ref05,
col06 tinyint NOT NULL REFERENCES dbo.Ref06,
col07 tinyint NOT NULL REFERENCES dbo.Ref07,
col08 tinyint NOT NULL REFERENCES dbo.Ref08,
col09 tinyint NOT NULL REFERENCES dbo.Ref09,
col10 tinyint NOT NULL REFERENCES dbo.Ref10,
CONSTRAINT PK_Normalized
PRIMARY KEY CLUSTERED (pk)
WITH (DATA_COMPRESSION = PAGE)
);
The view:
CREATE VIEW dbo.Denormalized
WITH SCHEMABINDING AS
SELECT
item01 = r01.item,
item02 = r02.item,
item03 = r03.item,
item04 = r04.item,
item05 = r05.item,
item06 = r06.item,
item07 = r07.item,
item08 = r08.item,
item09 = r09.item,
item10 = r10.item
FROM dbo.Normalized AS n
JOIN dbo.Ref01 AS r01 ON r01.col01 = n.col01
JOIN dbo.Ref02 AS r02 ON r02.col02 = n.col02
JOIN dbo.Ref03 AS r03 ON r03.col03 = n.col03
JOIN dbo.Ref04 AS r04 ON r04.col04 = n.col04
JOIN dbo.Ref05 AS r05 ON r05.col05 = n.col05
JOIN dbo.Ref06 AS r06 ON r06.col06 = n.col06
JOIN dbo.Ref07 AS r07 ON r07.col07 = n.col07
JOIN dbo.Ref08 AS r08 ON r08.col08 = n.col08
JOIN dbo.Ref09 AS r09 ON r09.col09 = n.col09
JOIN dbo.Ref10 AS r10 ON r10.col10 = n.col10;
Hack the statistics to make the optimizer think the table is very large:
UPDATE STATISTICS dbo.Normalized WITH ROWCOUNT = 100000000, PAGECOUNT = 5000000;
Example user query:
SELECT
d.item06,
d.item07
FROM dbo.Denormalized AS d
WHERE
d.item08 = 'Banana'
AND d.item01 = 'Green';
Gives us this execution plan:
The scan of the Normalized table looks bad, but both Bloom-filter bitmaps are applied during the scan by the storage engine (so rows that cannot match do not even surface as far as the query processor). This may be enough to give acceptable performance in your case, and certainly better than scanning the original table with its overflowing columns.
If you are able to upgrade to SQL Server 2012 Enterprise at some stage, you have another option: creating a column-store index on the Normalized table:
CREATE NONCLUSTERED COLUMNSTORE INDEX cs
ON dbo.Normalized (col01,col02,col03,col04,col05,col06,col07,col08,col09,col10);
The execution plan is:
That probably looks worse to you, but column storage provides exceptional compression, and the whole execution plan runs in Batch Mode with filters for all the contributing columns. If the server has adequate threads and memory available, this alternative could really fly.
Ultimately, I'm not sure this normalization is the correct approach considering the number of tables and the chances of getting a poor execution plan or requiring excessive compilation time. I would probably correct the schema of the denormalized table first (proper data types and so on), possibly apply data compression...the usual things.
If the data truly belongs in a star-schema, it probably needs more design work than just splitting off repeating data elements into separate tables.