Create table in hive from data folder in HDFS - remove duplicated rows

Question

Welcome To Ask or Share your Answers For Others

Create table in hive from data folder in HDFS - remove duplicated rows

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Create table in hive from data folder in HDFS - remove duplicated rows

I have a folder in HDFS, let's call it /data/users/ Inside that folder, a new csv file is added every 10 days. Basically the new file will contain only active users, so, for example

file_01Jan2020.csv: contains data for 1000 users who are currently active
file_10Jan2020.csv: contains data for 950 users who are currently active (same data in file_01Jan2020.csv with 50 less records)
file_20Jan2020.csv: contains data for 920 users who are currently active (same data in file_10Jan2020.csv with 30 less records)

In reality, these file are much bigger (~8 million records per file and decreases by MAYBE 1K EVERY 10 DAYS). Also, the newer files will never have new records that doesn't exist in the older files. it will just have less number of records.

I want to create a table in hive using the data in this folder. What I am doing now is:

Create External table from the data in the folder /data/users/
Create Internal table with the same structure
Write the data from external table to internal table where,
- Duplicates are removed
- If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created

I am concerned about the step where I create the external table, since the data are really big, that table will be huge after sometime, and I was wondering if there is a more efficient way of doing this instead of each time loading all the files in the folder.

So my question is: What is the best possible way to ingest data from a HDFS folder into a hive table that , given that, the folder contain lots of files with lottts of duplications.

question from:https://stackoverflow.com/questions/65838586/create-table-in-hive-from-data-folder-in-hdfs-remove-duplicated-rows

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:32:33+0000

I'd suggest partition data by date, that way you dont have to go through all the records every time you read the table.

Categories

Create table in hive from data folder in HDFS - remove duplicated rows

Create table in hive from data folder in HDFS - remove duplicated rows

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags