I try to solves following action in "Apach Spark Built-in SQL API" first or pyspark-sql if less ressource intensive.
Contexte: I have a set of files that are printed in accordance with a planning (not all files are printed every days) and printed parts (aka printed_id) receives a status (aka printed_status) that can be A, B, C.
These events are recorded in a table like (but with million records):
| file | printed_id | printed_date | printed_status |
|-------|------------|--------------|----------------|
| file1 | 100 | 2020-07-01 | A | <<
| file1 | 190 | 2020-07-02 | A |
| file1 | 302 | 2020-07-06 | B | <<
| file1 | 343 | 2020-07-09 | B |
| file1 | 464 | 2020-07-10 | A | <<
| file1 | 523 | 2020-07-20 | A |
| file1 | 567 | 2020-07-25 | A |
| file1 | 578 | 2020-07-28 | C | <<
| file1 | 670 | 2020-07-29 | C |
| file1 | 691 | 2020-07-30 | C |
I would like to keep only rows where a status appears (regardsless of what the value is). It corresponds to rows marked with << in the table above. (my finla goal is then to compute timespan between each status change).
My expected output should be (including the << row):
| file | printed_status | status_first_occurence_on |
|-------|----------------|---------------------------|
| file1 | A | 2020-07-01 |
| file1 | B | 2020-07-06 |
| file1 | A | 2020-07-10 | <<
| file1 | C | 2020-07-28 |
And not the output obtained with SELECT file, printed_status, MIN(printed_date) AS status_first_occurence_on FROM myTable GROUP BY file, printed_status
:
| file | printed_status | status_first_occurence_on |
|-------|----------------|---------------------------|
| file1 | A | 2020-07-01 |
| file1 | B | 2020-07-06 |
| file1 | C | 2020-07-28 |
I read the sql-first-date-for-the-first-occurrence-of-a-value, it's close but not exactly applicable in my case.
And as I work on very huge dataset I'm very concerns about computation cost. I'm affraid that applying Row_count() function on tow iterations of the same huge table then to join them will cost a lot of computation ressource. But I could be wrong as I don't know how computation cost wroks.
Thank you for your help and answers.
question from:
https://stackoverflow.com/questions/65859377/sql-get-date-of-when-a-status-change-considering-same-status-can-appears-many 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…