Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
131 views
in Technique[技术] by (71.8m points)

google bigquery - Making pairs of rows without repeating values with given allocation window using SQL

I have two different data sources with events (say, backend and frontend events). An event can be reported by only the first, only the second or by both sources. I'm trying to find an approach to combine these two sources into one, where all events will be reported once. Also, I don't want to lose events.

I have no identifiers, which I could use to join these sources. Instead, I have only event type, event datetime and a time window which I could use to join these events. The hard part begins when I have several events from both sides, catched by the same window - say, three events from source 'A' and two events from source 'B'. I don't know, how these events should be combined with each other, but it's not the issue - I want them to be combined pairwise without repetitions, when one event from source 'A' matches only one event from source 'B', and vice versa. And it's desirable (but not obligatory) to combine closest events first.

And I do this in BigQuery, so I can't use recursive queries.

Here is an example (note that I don't know actual true_parent value):

with raw_data as (
    SELECT 1 source_number, 1 dt, '1.1' name, 1 event_type, null true_parent
    union all select 1, 60, '1.2', 1, null
    union all select 1, 69, '1.3', 1, null
    union all select 2, 0, '2.1', 1, '1.1'
    union all select 2, 0, '2.2', 1, null
    union all select 2, 2, '2.3', 1, '1.2'
    union all select 2, 2, '2.4', 1, null
    union all select 2, 69, '2.5', 1, '1.3'
        
    union all select 1, 60, '1.1', 2, null
    union all select 1, 60, '1.2', 2, null
    union all select 1, 69, '1.3', 2, null
    union all select 2, 0, '2.1', 2, '1.1'
    union all select 2, 0, '2.2', 2, '1.2'


    union all select 1, 0, '1.1', 3, null
    union all select 1, 1, '1.2', 3, null
    union all select 1, 2, '1.3', 3, null
    union all select 2, 101, '2.1', 3, '1.3'
    union all select 2, 0, '2.2', 3, '1.1'
    union all select 2, 3, '2.3', 3, '1.2'

    union all select 1, 1, '1.1', 4, null 
    union all select 1, 100, '1.2', 4, null 
    union all select 1, 200, '1.3', 4, null 
    union all select 2, 5, '2.1', 4, '1.1' 
    union all select 2, 15, '2.2', 4, '1.2' 
    union all select 2, 102, '2.3', 4, '1.3' 
)

, windows as (
    select 1 source_number, 20 time_window
    union all select 2, 80
)

, dat as (
    select 
        *
    from raw_data
    left join windows using(source_number)
)

, parent_selection as (
select 
    c.event_type,
    c.name,
    c.source_number,
    c.dt,
    p.name parent,
    c.true_parent 
from dat c
    left join dat p
    on c.event_type = p.event_type
    and c.source_number > p.source_number
    and ABS(c.dt - p.dt) <= c.time_window + p.time_window
)

select distinct
    * 
    except (true_parent) 
    replace(case when true_parent is null then name else parent end as parent) 
from parent_selection 
where true_parent = parent or true_parent is null
order by event_type, parent, name

I used this child-parent abstraction, because it's handy to group by parent in next steps, but I also will appreciate any other abstraction, which could be used to make this pairwise connections.

I just want an algorithm to replace the last part of the query, as I don't know actual true_parent value.

Output:

event_type  name    source_number   dt  parent
1           1.1     1               1   1.1
1           2.1     2               0   1.1
1           1.2     1               60  1.2
1           2.3     2               2   1.2
1           1.3     1               69  1.3
1           2.5     2               69  1.3
1           2.2     2               0   2.2
1           2.4     2               2   2.4
2           1.1     1               60  1.1
2           2.1     2               0   1.1
2           1.2     1               60  1.2
2           2.2     2               0   1.2
2           1.3     1               69  1.3
3           1.1     1               0   1.1
3           2.2     2               0   1.1
3           1.2     1               1   1.2
3           2.3     2               3   1.2
3           1.3     1               2   1.3
3           2.1     2               101 1.3
4           1.1     1               1   1.1
4           2.1     2               5   1.1
4           1.2     1               100 1.2
4           2.2     2               15  1.2
4           1.3     1               200 1.3
4           2.3     2               102 1.3

Explanation:

In event_type 1 1.1 should be combined with 2.1 or 2.2, and 1.2 - with 2.3 or 2.4, and 2.5 - with 2.3 according to closest dt value. I don't care if 1.1 will be combined with 2.1 or 2.2, but if one of them was added to the pair, second one shouldn't.

In event_type 2 1.1 and 1.2 should be combined with 2.1 or 2.2, order doesn't matter. 2.3 wouldn't be combined with any other event.

In event_type 3 2.1 can be combined only with 1.2 or 1.3, but not with 1.1, because 1.1 doesn't fit its time window. So, 2.1 is combined with 1.3 as it's closer then 1.2. Remaining 2.2 and 2.3 can be combined with 1.1 and 1.2, but not with 1.3, because it was already occupied by 2.1.

question from:https://stackoverflow.com/questions/65927420/making-pairs-of-rows-without-repeating-values-with-given-allocation-window-using

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I've finally found a satisfying solution. The idea is to zip join two sources, ordered by datetime - but it still should meet the limitation of time window. So I made this steps to achieve all the goals:

  1. Calculate session numbers for each source
  2. Shift every parent's session if no parents found
  3. Shift all downstream sessions accordingly
  4. Adjust children's sessions to parent's
  5. Group each session on parent

It still fails on event_type = 6, but this can be fixed by time_window tweaking.

with raw_data as (
SELECT 1 source_number, 1 dt, '1.1' name, 1 event_type
union all select 2, 0, '2.1', 1
union all select 2, 1, '2.2', 1 
union all select 2, 69, '2.5', 1 
    
union all select 3, 60, '3.1', 1
union all select 3, 60, '3.2', 1 
union all select 3, 69, '3.3', 1 

union all select 1, 1, '1.1', 2
union all select 1, 100, '1.2', 2
union all select 2, 100, '2.1', 2
union all select 2, 200, '2.2', 2
union all select 2, 202, '2.3', 2

union all select 3, 1, '3.1', 2
union all select 3, 10, '3.2', 2
union all select 3, 100, '3.3', 2
union all select 4, 5, '4.1', 2
union all select 4, 15, '4.2', 2
union all select 4, 200, '4.3', 2


union all select 5, 1, '5.1', 2
union all select 5, 5, '5.2', 2
union all select 5, 15, '5.3', 2
union all select 5, 99, '5.4', 2
union all select 5, 100, '5.5', 2
union all select 5, 101, '5.6', 2

union all select 6, 50, '6.1', 2
union all select 6, 140, '6.2', 2
union all select 6, 200, '6.3', 2
union all select 6, 290, '6.4', 2

union all select 7, 50, '7.1', 2
union all select 7, 200, '7.2', 2
union all select 7, 210, '7.3', 2
union all select 7, 1000, '7.4', 2

union all select 1, 1, '1.1', 6
union all select 2, 55, '2.1', 6
union all select 3, 85, '3.1', 6
union all select 4, 255, '4.1', 6

union all select 1, 1, '1.1', 7
union all select 1, 1000, '1.2', 7
union all select 2, 1001, '2.1', 7
union all select 3, 1020, '3.1', 7
union all select 4, 1030, '4.1', 7

)

, windows as (
    select 1 source_number, 0 time_window
    union all select 2, 60
    union all select 3, 60
    union all select 4, 120
    union all select 5, 120
    union all select 6, 150
    union all select 7, 500
)

, dat as (
    select 
        *
    from raw_data
    left join windows using(source_number)
)

, sessions as (
    select 
        *, 
        row_number() over(partition by event_type, source_number order by dt) session
    from dat
)

, calc_parent_shift as (
    select
        a.*,
        case 
            when count(b.session) = 0 
            then greatest(countif(count(b.session) = 0) over (w_session) - 1, 0) 
            else 0
        end as parent_shift
    from sessions a
        left join sessions b
        on a.event_type = b.event_type
        and a.source_number > b.source_number
        and a.session <= b.session
        and ABS(a.dt - b.dt) <= a.time_window + b.time_window
    group by 1, 2, 3, 4, 5, 6
    window w_session as (
        partition by a.event_type, a.session order by a.dt
    )
)

, shift_parent_session as (
    select 
        * except(session),
        session + max(parent_shift) over (w_shift) as session,
        session as old_session
    from calc_parent_shift
    window w_shift as (
        partition by event_type, source_number
        order by session
    )
)

, shift_child_session as (
    select
        a.* except(session),
        ifnull(array_agg(b.session order by b.source_number, b.session)[offset(0)] - a.old_session, 0) as child_shift,
        a.session + greatest ( 
              max(ifnull(array_agg(b.session order by b.source_number, b.session)[offset(0)] - a.old_session, 0)) over (w) 
            - max(a.parent_shift) over (w)
            , 0
        ) as session
    from shift_parent_session a
        left join shift_parent_session b
        on a.event_type = b.event_type
        and a.source_number > b.source_number
        and a.session <= b.session
        and ABS(a.dt - b.dt) <= a.time_window + b.time_window
    group by 1, 2, 3, 4, 5, 6, 7, a.session
    window w as (
        partition by a.event_type, a.source_number
        order by a.session
    )
)

, session_groups as (
    select 
        a.* except (parent_shift, child_shift, old_session),
        min(b.source_number) parent_source_number
    from shift_child_session a
        left join shift_child_session b
        on a.event_type = b.event_type
        and a.session = b.session
        and ABS(a.dt - b.dt) <= a.time_window + b.time_window
        and a.source_number >= b.source_number
    group by 1, 2, 3, 4, 5, 6
)

, result as (
    select 
        event_type,
        session,
        parent_source_number,
        array_agg(source_number) source_number,
        array_agg(dt) dt,
        array_agg(name) name
    from session_groups
    group by 1, 2, 3
    order by 1, 2, 3
)

select * from result
# where event_type = 6
order by event_type, session

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...