Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
312 views
in Technique[技术] by (71.8m points)

apache spark - SQL get date of "when a status change" (considering same status can appears many times)

I try to solves following action in "Apach Spark Built-in SQL API" first or pyspark-sql if less ressource intensive.

Contexte: I have a set of files that are printed in accordance with a planning (not all files are printed every days) and printed parts (aka printed_id) receives a status (aka printed_status) that can be A, B, C.

These events are recorded in a table like (but with million records):

|  file | printed_id | printed_date | printed_status |
|-------|------------|--------------|----------------|
| file1 |        100 |   2020-07-01 |              A | <<
| file1 |        190 |   2020-07-02 |              A |
| file1 |        302 |   2020-07-06 |              B | <<
| file1 |        343 |   2020-07-09 |              B |
| file1 |        464 |   2020-07-10 |              A | <<
| file1 |        523 |   2020-07-20 |              A |
| file1 |        567 |   2020-07-25 |              A |
| file1 |        578 |   2020-07-28 |              C | <<
| file1 |        670 |   2020-07-29 |              C |
| file1 |        691 |   2020-07-30 |              C |

I would like to keep only rows where a status appears (regardsless of what the value is). It corresponds to rows marked with << in the table above. (my finla goal is then to compute timespan between each status change).

My expected output should be (including the << row):

|  file | printed_status | status_first_occurence_on |
|-------|----------------|---------------------------|
| file1 |              A |                2020-07-01 |
| file1 |              B |                2020-07-06 |
| file1 |              A |                2020-07-10 | <<
| file1 |              C |                2020-07-28 |

And not the output obtained with SELECT file, printed_status, MIN(printed_date) AS status_first_occurence_on FROM myTable GROUP BY file, printed_status:

|  file | printed_status | status_first_occurence_on |
|-------|----------------|---------------------------|
| file1 |              A |                2020-07-01 |
| file1 |              B |                2020-07-06 |
| file1 |              C |                2020-07-28 |

I read the sql-first-date-for-the-first-occurrence-of-a-value, it's close but not exactly applicable in my case.

And as I work on very huge dataset I'm very concerns about computation cost. I'm affraid that applying Row_count() function on tow iterations of the same huge table then to join them will cost a lot of computation ressource. But I could be wrong as I don't know how computation cost wroks.

Thank you for your help and answers.

question from:https://stackoverflow.com/questions/65859377/sql-get-date-of-when-a-status-change-considering-same-status-can-appears-many

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use lag to compare the printed status in the current row with the previous row. If they are the same, you can flag it up, and then filter the rows that are not flagged.

select file, printed_id, printed_date, printed_status
from (
    select *,
        printed_status <=> lag(printed_status) 
            over(partition by file order by printed_date) flag
    from mytable
) where not flag;

+-----+----------+------------+--------------+
| file|printed_id|printed_date|printed_status|
+-----+----------+------------+--------------+
|file1|       100|  2020-07-01|             A|
|file1|       302|  2020-07-06|             B|
|file1|       464|  2020-07-10|             A|
|file1|       578|  2020-07-28|             C|
+-----+----------+------------+--------------+

(<=> is a null-safe equal operator, which returns false if one of the compared items is null. That's useful when lag returns null for first row in each partition, because we want the flag to be false rather than null.)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...