You could try something like this.
First, create your chunker.
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
Now create a set of ids:
ids = set()
Now iterate over the chunks:
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
However, now, within the body of the loop, drop also ids already in the set of known ids:
chunk = chunk[~chunk['Author ID'].isin(ids)]
Finally, still within the body of the loop, add the new ids
ids.update(chunk['Author ID'].values)
If ids
is too large to fit into main memory, you might need to use some disk-based database.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…