sql – 获取具有日期范围的自定义聚合的增量

我需要找到一种创建查询的有效方法,即报告聚合的增量,以及值的开始和结束日期.

要求

>源表包括开始日期,结束日期,类别ID,子类别ID,以及子类别是否处于活动状态的指示符.
>聚合用于cat_id上的is_active,只要is_active的任何sub_category也为1,函数的结果应为1.
>如果连续日期范围的聚合函数的结果相同,则应组合日期范围以减少结果集.
>类别/子类别组合永远不会有重叠日期,但其他子类别可能会跨越彼此的边界.

我试过的

我尝试创建一个CTE,为类别生成所有可能的范围,然后加入到主查询,以便分解跨越多个范围的子类别.然后我按范围分组并执行MAX(is_active).

虽然这是一个良好的开端(此时我需要做的就是将连续范围与相同的值组合在一起),但查询速度非常慢.我对Postgres的熟悉程度和其他SQL风格一样熟悉,并且我决定花更多的时间与更有经验的人联系并获得帮助.

来源数据

+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| id | start_dt   | end_dt     | cat_id | sub_cat_id | is_active | comment                                             |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| 1  | 2018-01-01 | 2018-01-31 | 1      | 1001       | 1         | (null)                                              |
| 2  | 2018-02-01 | 2018-02-14 | 1      | 1001       | 0         | (null)                                              |
| 3  | 2018-02-15 | 2018-02-28 | 1      | 1001       | 0         | cat 1 is_active is unchanged despite new record.    |
| 4  | 2018-03-01 | 2018-03-30 | 1      | 1001       | 1         | (null)                                              |
| 5  | 2018-01-01 | 2018-01-15 | 2      | 2001       | 1         | (null)                                              |
| 6  | 2018-01-01 | 2018-01-31 | 2      | 2002       | 1         | (null)                                              |
| 7  | 2018-01-15 | 2018-02-10 | 2      | 2001       | 0         | cat 2 should still be active until 2002 is inactive |
| 8  | 2018-02-01 | 2018-02-14 | 2      | 2002       | 0         | cat 2 is inactive                                   |
| 9  | 2018-02-10 | 2018-03-15 | 2      | 2001       | 0         | this record will cause trouble                      |
| 10 | 2018-02-15 | 2018-03-30 | 2      | 2002       | 1         | cat 2 should be active again                        |
| 11 | 2018-03-15 | 2018-03-30 | 2      | 2001       | 1         | cat 2 is_active is unchanged despite new record.    |
| 12 | 2018-04-01 | 2018-04-30 | 2      | 2001       | 0         | cat 2 ends in a zero                                |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+

预期结果

+------------+------------+--------+-----------+
| start_dt   | end_dt     | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 | 1      | 1         |
| 2018-02-01 | 2018-02-28 | 1      | 0         |
| 2018-03-01 | 2018-03-30 | 1      | 1         |
| 2018-01-01 | 2018-01-31 | 2      | 1         |
| 2018-02-01 | 2018-02-14 | 2      | 0         |
| 2018-02-15 | 2018-03-30 | 2      | 1         |
| 2018-04-01 | 2018-04-30 | 2      | 0         |
+------------+------------+--------+-----------+

这是一个select语句,可以帮助您编写自己的测试.

SELECT id,start_dt::date start_date,end_dt::date end_date,cat_id,sub_cat_id,is_active::int is_active,comment
FROM (VALUES 
    (1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
    (2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
    (3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
    (4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
    (5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
    (6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
    (7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
    (8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
    (9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
    (10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
    (11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
    (12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')

) src ( "id","start_dt","end_dt","cat_id","sub_cat_id","is_active","comment" )
最佳答案
WITH test AS (
    SELECT id, start_dt::date, end_dt::date, cat_id, sub_cat_id, is_active::int, comment  FROM ( VALUES 
        (1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
        (2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
        (3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
        (4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
        (5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
        (6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
        (7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
        (8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
        (9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
        (10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
        (11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
        (12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')
        ) test (id, start_dt, end_dt, cat_id, sub_cat_id, is_active, comment) 
    )
SELECT cat_id, start_date, end_date, active_state
FROM (
    SELECT cat_id, date as start_date, lead(date-1) over w as end_date
        , active_state, prev_active
        , nonactive_state, prev_nonactive
    FROM (
        SELECT cat_id, date 
            , active_state, prev_active
            , nonactive_state
            , lag(nonactive_state, 1, 0) over w as prev_nonactive
        FROM (
            SELECT cat_id, date, active_state, lag(active_state, 1, 0) over w as prev_active
                , (nonactive_state > active_state)::int as nonactive_state
            FROM (
                SELECT DISTINCT ON (cat_id, date)
                    cat_id, date
                    , (CASE WHEN sum(type) over w > 0 THEN 1 ELSE 0 END) as active_state
                    , (CASE WHEN sum(nonactive_type) over w > 0 THEN 1 ELSE 0 END) as nonactive_state
                FROM (
                    SELECT start_dt as date
                        , 1 as type
                        , cat_id
                        , 0 as nonactive_type
                    FROM test
                    WHERE is_active = 1
                  UNION ALL
                    SELECT end_dt + 1 as date
                        , -1 as type
                        , cat_id
                        , 0 as nonactive_type
                    FROM test
                    WHERE is_active = 1
                  UNION ALL
                    SELECT start_dt as date
                        , 0 as type
                        , cat_id
                        , 1 as nonactive_type
                    FROM test
                    WHERE is_active = 0
                  UNION ALL
                    SELECT end_dt + 1 as date
                        , 0 as type
                        , cat_id
                        , -1 as nonactive_type
                    FROM test
                    WHERE is_active = 0
                ) t
                WINDOW w as (partition by cat_id order by date)
                ORDER BY cat_id, date
            ) t2
            WINDOW w as (partition by cat_id order by date)
        ) t3
        WINDOW w as (partition by cat_id order by date)
    ) t4
    WHERE (active_state != prev_active) OR (nonactive_state != prev_nonactive)
    WINDOW w as (partition by cat_id order by date)
    ) t5
WHERE active_state = 1 OR nonactive_state = 1
ORDER BY cat_id, start_date

产量

| cat_id | start_date |   end_date | active_state |
|--------+------------+------------+--------------|
|      1 | 2018-01-01 | 2018-01-31 |            1 |
|      1 | 2018-02-01 | 2018-02-28 |            0 |
|      1 | 2018-03-01 | 2018-03-30 |            1 |
|      2 | 2018-01-01 | 2018-01-31 |            1 |
|      2 | 2018-02-01 | 2018-02-14 |            0 |
|      2 | 2018-02-15 | 2018-03-30 |            1 |
|      2 | 2018-04-01 | 2018-04-30 |            0 |

这将start_dt和end_dt日期组合成一个列,并且
引入了一个类型列,其中1表示开始日期,-1表示结束日期.
对类型求和会产生一个正值的值
相应的日期在[start_dt,end_dt]间隔内,且为0
除此以外.

这是Itzik Ben-Gan的Packing
Intervals
中提出的想法之一,但我是第一个
从DSM学习(在Python / Pandas编程环境中)
here.

通常在使用上述技术处理间隔时,间隔
定义日期何时处于“开启”状态,而不是“开启”则自动暗示“关闭”.
然而,在这个问题中,它出现了
active_state = 1的行意味着最终的active_state为“on”,但这些间隔之外的日期不一定是“off”. 2018-03-31是外部日期的示例
active_state = 1个间隔,但不是“关闭”.
类似地,active_state = 0的行意味着最终的active_state是“off”,只要日期不与active_state = 1的间隔相交.

为了处理这两种不同的区间,我应用了上述技术(求和1 / -1类型)两次:一次用于is_active = 1的行,一次用于is_active = 0的行.
这给了我们一个处理日期的句柄,这些日期肯定在active_state(“on”)和明确在nonactive_state(“off”)的日期.
由于活动胜过非活动,因此使用以下方法修改被视为非活动的日期:

(nonactive_state > active_state)::int as nonactive_state

(也就是说,当active_state = 1且nonactive_state = 1时,上面的赋值用于将nonactive_state更改为0.)

转载注明原文:sql – 获取具有日期范围的自定义聚合的增量 - 代码日志