Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
277 views
in Technique[技术] by (71.8m points)

Create table in hive from data folder in HDFS - remove duplicated rows

I have a folder in HDFS, let's call it /data/users/ Inside that folder, a new csv file is added every 10 days. Basically the new file will contain only active users, so, for example

  • file_01Jan2020.csv: contains data for 1000 users who are currently active
  • file_10Jan2020.csv: contains data for 950 users who are currently active (same data in file_01Jan2020.csv with 50 less records)
  • file_20Jan2020.csv: contains data for 920 users who are currently active (same data in file_10Jan2020.csv with 30 less records)

In reality, these file are much bigger (~8 million records per file and decreases by MAYBE 1K EVERY 10 DAYS). Also, the newer files will never have new records that doesn't exist in the older files. it will just have less number of records.

I want to create a table in hive using the data in this folder. What I am doing now is:

  • Create External table from the data in the folder /data/users/
  • Create Internal table with the same structure
  • Write the data from external table to internal table where,
    • Duplicates are removed
    • If a record doesn't exist in one of the files, then I'll mark it as 'deleted' and set the 'deleted' in a new column that I defined in the internal table I created

I am concerned about the step where I create the external table, since the data are really big, that table will be huge after sometime, and I was wondering if there is a more efficient way of doing this instead of each time loading all the files in the folder.

So my question is: What is the best possible way to ingest data from a HDFS folder into a hive table that , given that, the folder contain lots of files with lottts of duplications.

question from:https://stackoverflow.com/questions/65838586/create-table-in-hive-from-data-folder-in-hdfs-remove-duplicated-rows

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'd suggest partition data by date, that way you dont have to go through all the records every time you read the table.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...