I am on Python3. I have a nameinternat_country
dataframe with 13,527 companies (and 5 columns). I'd like to merge this company list to a Contact_info.txt
40+GB file (+190M companies and 29 columns) on the variable name_country
.
My desired output would be a dataframe (I call it mergeall
in the code below) with the 13,527 company list (from nameinternat_country
, my left df), and the merged 29 columns from the matched cases with Contact_info.txt
. The mergeall
would have 13,527 rows and 34 columns (5 columns from nameinternat_country
+ 29 original columns from Contact_info.txt
). Unmatched cases would show missing values.
The problem comes with Contact_info.txt
having +40GB (I can not load it as dataframe for memory issues). So I first need to chunk it, and then do the merges chunk by chunk. Here the code I have (note: I subset the Contact_info.txt
file to its first 5,000 rows just to be more efficient in my trials):
mergeall = pd.DataFrame() #create df to store merges in chunk below
ChunkSize = 1000 #num of rows per chunk
for chunk in pd.read_csv('ORBIS financial/Contact info.txt', sep="", nrows=5000, chunksize=ChunkSize):
chunk["name_country"]= chunk["NAME_INTERNAT"]+", "+chunk["Country"] #create new (merging) column in txt files
mergeall = pd.concat([mergeall, nameinternat_country.merge(chunk, how='left', on='name_country')])
The mergeall
dataframe gives me 67,635 rows x 34 columns. Is the error in the pd.concat
function?
Thanks a lot.
question from:
https://stackoverflow.com/questions/65893159/loop-merge-df-and-txt-and-append-matched-output 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…