Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
121 views
in Technique[技术] by (71.8m points)

python - Postgres - how to return rows with 0 count for missing data?

I have unevenly distributed data(wrt date) for a few years (2003-2008). I want to query data for a given set of start and end date, grouping the data by any of the supported intervals (day, week, month, quarter, year) in PostgreSQL 8.3 (http://www.postgresql.org/docs/8.3/static/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC).

The problem is that some of the queries give results continuous over the required period, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'),count(distinct post_id) 
from some_table where category_id=1 and entity_id = 77  and entity2_id = 115 
and date <= '2008-12-06' and date >= '2007-12-01' group by 
date_trunc('month',date) order by date_trunc('month',date);
          to_char   | count 
        ------------+-------
         2007-12-01 |    64
         2008-01-01 |    31
         2008-02-01 |    14
         2008-03-01 |    21
         2008-04-01 |    28
         2008-05-01 |    44
         2008-06-01 |   100
         2008-07-01 |    72
         2008-08-01 |    91
         2008-09-01 |    92
         2008-10-01 |    79
         2008-11-01 |    65
        (12 rows)

but some of them miss some intervals because there is no data present, as this one:

select to_char(date_trunc('month',date), 'YYYY-MM-DD'),count(distinct post_id) 
from some_table where category_id=1 and entity_id = 75  and entity2_id = 115 
and date <= '2008-12-06' and date >= '2007-12-01' group by 
date_trunc('month',date) order by date_trunc('month',date);

        to_char   | count 
    ------------+-------

     2007-12-01 |     2
     2008-01-01 |     2
     2008-03-01 |     1
     2008-04-01 |     2
     2008-06-01 |     1
     2008-08-01 |     3
     2008-10-01 |     2
    (7 rows)

where the required resultset is:

  to_char   | count 
------------+-------
 2007-12-01 |     2
 2008-01-01 |     2
 2008-02-01 |     0
 2008-03-01 |     1
 2008-04-01 |     2
 2008-05-01 |     0
 2008-06-01 |     1
 2008-07-01 |     0
 2008-08-01 |     3
 2008-09-01 |     0
 2008-10-01 |     2
 2008-11-01 |     0
(12 rows)

A count of 0 for missing entries.

I have seen earlier discussions on Stack Overflow but they don't solve my problem it seems, since my grouping period is one of (day, week, month, quarter, year) and decided on runtime by the application. So an approach like left join with a calendar table or sequence table will not help I guess.

My current solution to this is to fill in these gaps in Python (in a Turbogears App) using the calendar module.

Is there a better way to do this.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This question is old. But since fellow users picked it as master for a new duplicate I am adding a proper answer.

Proper solution

SELECT *
FROM  (
   SELECT day::date
   FROM   generate_series(timestamp '2007-12-01'
                        , timestamp '2008-12-01'
                        , interval  '1 month') day
   ) d
LEFT   JOIN (
   SELECT date_trunc('month', date_col)::date AS day
        , count(*) AS some_count
   FROM   tbl
   WHERE  date_col >= date '2007-12-01'
   AND    date_col <= date '2008-12-06'
-- AND    ... more conditions
   GROUP  BY 1
   ) t USING (day)
ORDER  BY day;
  • Use LEFT JOIN, of course.

  • generate_series() can produce a table of timestamps on the fly, and very fast.

  • It's generally faster to aggregate before you join. I recently provided a test case on sqlfiddle.com in this related answer:

  • Cast the timestamp to date (::date) for a basic format. For more use to_char().

  • GROUP BY 1 is syntax shorthand to reference the first output column. Could be GROUP BY day as well, but that might conflict with an existing column of the same name. Or GROUP BY date_trunc('month', date_col)::date but that's too long for my taste.

  • Works with the available interval arguments for date_trunc().

  • count() never produces NULL (0 for no rows), but the LEFT JOIN does.
    To return 0 instead of NULL in the outer SELECT, use COALESCE(some_count, 0) AS some_count. The manual.

  • For a more generic solution or arbitrary time intervals consider this closely related answer:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...