I have a folder in HDFS, let's call it /data/users/
Inside that folder, a new csv file is added every 10 days. Basically the new file will contain only active users, so, for example
- file_01Jan2020.csv: contains data for 1000 users who are currently active
- file_10Jan2020.csv: contains data for 950 users who are currently active (same data in file_01Jan2020.csv with 50 less records)
- file_20Jan2020.csv: contains data for 920 users who are currently active (same data in file_10Jan2020.csv with 30 less records)
In reality, these file are much bigger (~8 million records per file and decreases by MAYBE 1K EVERY 10 DAYS). Also, the newer files will never have new records that doesn't exist in the older files. it will just have less number of records.
I want to create a table in hive using the data in this folder. What I am doing now is:
- Create External table from the data in the folder /data/users/
- Create Internal table with the same structure
- Write the data from external table to internal table where,
- Duplicates are removed
- If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created
I am concerned about the step where I create the external table, since the data are really big, that table will be huge after sometime, and I was wondering if there is a more efficient way of doing this instead of each time loading all the files in the folder.
So my question is: What is the best possible way to ingest data from a HDFS folder into a hive table that , given that, the folder contain lots of files with lottts of duplications.
question from:
https://stackoverflow.com/questions/65838586/create-table-in-hive-from-data-folder-in-hdfs-remove-duplicated-rows 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…