Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
810 views
in Technique[技术] by (71.8m points)

sql server - Partitioning results in a running totals query

I'm looking for a fast way to create cumulative totals in a large SQL Server 2008 data set that partition by a particular column, potentially by using a multiple assignment variable solution. As a very basic example, I'd like to create the "cumulative_total" column below:

user_id | month | total | cumulative_total

1       | 1     | 2.0   | 2.0
1       | 2     | 1.0   | 3.0
1       | 3     | 3.5   | 8.5

2       | 1     | 0.5   | 0.5
2       | 2     | 1.5   | 2.0
2       | 3     | 2.0   | 4.0

We have traditionally done this with correlated subqueries, but over large amounts of data (200,000+ rows and several different categories of running total) this isn't giving us ideal performance.

I recently read about using multiple assignment variables for cumulative summing here:

http://sqlblog.com/blogs/paul_nielsen/archive/2007/12/06/cumulative-totals-screencast.aspx

In the example in that blog the cumulative variable solution looks like this:

UPDATE my_table
SET @CumulativeTotal=cumulative_total=@CumulativeTotal+ISNULL(total, 0)

This solution seems brilliantly fast for summing for a single user in the above example (user 1 or user 2). However, I need to effectively partition by user - give me the cumulative total by user by month.

Does anyone know of a way of extending the multiple assignment variable concept to solve this, or any other ideas other than correlated subqueries or cursors?

Many thanks for any tips.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you don't need to STORE the data (which you shouldn't, because you need to update the running totals any time any row is changed, added or deleted), and if you don't trust the quirky update (which you shouldn't, because it isn't guaranteed to work and its behavior could change with a hotfix, service pack, upgrade, or even an underlying index or statistics change), you can try this type of query at runtime. This is a method fellow MVP Hugo Kornelis coined "set-based iteration" (he posted something similar in one of his chapters of SQL Server MVP Deep Dives). Since running totals typically requires a cursor over the entire set, a quirky update over the entire set, or a single non-linear self-join that becomes more and more expensive as the row counts increase, the trick here is to loop through some finite element in the set (in this case, the "rank" of each row in terms of month, for each user - and you process only each rank once for all user/month combinations at that rank, so instead of looping through 200,000 rows, you loop up to 24 times).

DECLARE @t TABLE
(
  [user_id] INT, 
  [month] TINYINT,
  total DECIMAL(10,1), 
  RunningTotal DECIMAL(10,1), 
  Rnk INT
);

INSERT @t SELECT [user_id], [month], total, total, 
  RANK() OVER (PARTITION BY [user_id] ORDER BY [month]) 
  FROM dbo.my_table;

DECLARE @rnk INT = 1, @rc INT = 1;

WHILE @rc > 0
BEGIN
  SET @rnk += 1;

  UPDATE c SET RunningTotal = p.RunningTotal + c.total
    FROM @t AS c INNER JOIN @t AS p
    ON c.[user_id] = p.[user_id]
    AND p.rnk = @rnk - 1
    AND c.rnk = @rnk;

  SET @rc = @@ROWCOUNT;
END

SELECT [user_id], [month], total, RunningTotal
FROM @t
ORDER BY [user_id], rnk;

Results:

user_id  month   total   RunningTotal
-------  -----   -----   ------------
1        1       2.0     2.0
1        2       1.0     3.0
1        3       3.5     6.5 -- I think your calculation is off
2        1       0.5     0.5
2        2       1.5     2.0
2        3       2.0     4.0

Of course you can update the base table from this table variable, but why bother, since those stored values are only good until the next time the table is touched by any DML statement?

UPDATE mt
  SET cumulative_total = t.RunningTotal
  FROM dbo.my_table AS mt
  INNER JOIN @t AS t
  ON mt.[user_id] = t.[user_id]
  AND mt.[month] = t.[month];

Since we're not relying on implicit ordering of any kind, this is 100% supported and deserves a performance comparison relative to the unsupported quirky update. Even if it doesn't beat it but comes close, you should consider using it anyway IMHO.

As for the SQL Server 2012 solution, Matt mentions RANGE but since this method uses an on-disk spool you should also test with ROWS instead of just running with RANGE. Here is a quick example for your case:

SELECT
  [user_id],
  [month],
  total,
  RunningTotal = SUM(total) OVER 
  (
    PARTITION BY [user_id] 
    ORDER BY [month] ROWS UNBOUNDED PRECEDING
  )
FROM dbo.my_table
ORDER BY [user_id], [month];

Compare this with RANGE UNBOUNDED PRECEDING or no ROWSRANGE at all (which will also use the RANGE on-disk spool). The above will have lower overall duration and way less I/O, even though the plan looks slightly more complex (an additional sequence project operator).

I've recently published a blog post outlining some performance differences I observed for a specific running totals scenario:

http://www.sqlperformance.com/2012/07/t-sql-queries/running-totals


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...