Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.9k views
in Technique[技术] by (71.8m points)

pandas - Removing *NEARLY* Duplicate Observations - Python

I am attempting to remove some observations in a pandas DataFrame where the similarities are ALMOST 100% but not quite. See frame below:

enter image description here

Notice how "John", "Mary", and "Wesley" have nearly identical observations, but have one column being different. The real data set has 15 columns, and 215,000+ observations. In all of the cases I could visually verify, the similarities were likewise: out of 15 columns, the other observation would match up to 14 columns, every time. For the purpose of the project I have decided to remove the repeated observations (and store them into another DataFrame just in case my boss asks to see them).

I have evidently thought of remove_duplicates(keep='something'), but that would not work since the observations are not ENTIRELY similar. Has anyone ever encounter such an issue? Any idea on a remedy?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This can be formulated as a pair-wise Hamming distance calculation between all records, separating out subsequent pairs below some threshold. Fortunately, numpy/scipy/sklearn already has done the heavy lifting. I've included two functions that produce identical output - one that is fully vectorized (but consumes O(N^2) memory) and another that consumes O(N) memory but is only vectorized along a single dimension. At your scale, you almost certainly don't want the fully vectorized version - it will likely give an OOM error. In both cases, the basic algorithm is as follows:

  • encode each feature value as an integer value (thanks sklearn!)
  • for all row pairs, calculate the Hamming distance (sum of different values)
  • if two rows are found at threshold or below Hamming distance, discard the latter until no rows remain below that threshold

code:

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
from scipy.spatial.distance import pdist, squareform
import numpy as np


def dedupe_fully_vectorized(df, threshold=1):
    """
    fully vectorized memory hog version - best not to use for n > 10k
    """
    # convert field data to integers
    enc = OrdinalEncoder()
    X = enc.fit_transform(df.to_numpy())

    # calc the (unnormalized) hamming distance for all row pairs
    d = pdist(X, metric="hamming") * df.shape[1]
    s = squareform(d)

    # s contains all pairs (j,k) and (k,j); exclude all pairs j < k as "duplicates"
    s[np.triu_indices_from(s)] = -1
    dupe_pair_matrix = (0 <= s) * (s <= threshold)

    df_dupes = df[np.any(dupe_pair_matrix, axis=1)]
    df_deduped = df.drop(df_dupes.index).sort_index()
    return (df_deduped, df_dupes)


def dedupe_partially_vectorized(df, threshold=1):
    """
    - Iterate through each row starting from the last; examine all previous rows for duplicates.  
    - If found, it is appended to a list of duplicate indices.
    """
    # convert field data to integers
    enc = OrdinalEncoder()
    X = enc.fit_transform(df.to_numpy())

    """
    - loop through each row, starting from last
    - for each `row`, calculate hamming distance to all previous rows
    - if any such distance is `threshold` or less, mark `idx` as duplicate
    - loop ends at 2nd row (1st is by definition not a duplicate)
    """
    dupe_idx = []          
    for j in range(len(X) - 1):
        idx = len(X) - j - 1
        row = X[idx]
        prev_rows = X[0:idx]
        dists = np.sum(row != prev_rows, axis=1)
        if min(dists) <= threshold:
            dupe_idx.append(idx)
        dupe_idx = sorted(dupe_idx)
    df_dupes = df.iloc[dupe_idx]
    df_deduped = df.drop(dupe_idx)
    return (df_deduped, df_dupes)

Now lets test things out. Sanity check first:

df = pd.DataFrame(
    [
        ["john", "doe", "m", 23],
        ["john", "dupe", "m", 23],
        ["jane", "doe", "f", 29],
        ["jane", "dole", "f", 28],
        ["jon", "dupe", "m", 23],
        ["tom", "donald", "m", 12],
        ["john", "dupe", "m", 65],
    ],
    columns=["first", "last", "s", "age"],
)


(df_deduped_fv, df_dupes_fv) = dedupe_fully_vectorized(df)
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)

df_deduped_fv == df_deduped # True

# df_deduped
#   first    last  s  age
# 0  john     doe  m   23
# 2  jane     doe  f   29
# 3  jane    dole  f   28
# 5   tom  donald  m   12

# df_dupes
#   first  last  s  age
# 1  john  dupe  m   23
# 4   jon  dupe  m   23
# 6  john  dupe  m   65

I've tested this on dataframes up to ~40k rows (as below) and it seems to work (the two methods give identical results), but may take several seconds. I've haven't tried it at your scale, but it may be slow:

arr = np.array("abcdefgh")
df = pd.DataFrame(np.random.choice(arr, (40000, 15))
# (df_deduped, df_dupes) = dedupe_partially_vectorized(df)

If you can avoid doing all pairwise comparisons, such as grouping by name, that will improve performance significantly.

fun aside/issues with approach

You may notice you can get interesting "Hamming chains" (I don't know if this is a term) where very different records are connected by a chain of one-edit difference records:

df_bad_news = pd.DataFrame(
    [
        ["john", "doe", "m", 88],
        ["jon", "doe", "m", 88],
        ["jan", "doe", "m", 88],
        ["jane", "doe", "m", 88],
        ["jane", "doe", "m", 12],
    ],
    columns=["first", "last", "s", "age"],
)


(df_deduped, df_dupes) = dedupe(df)

# df_deduped
#   first last  s  age
# 0  john  doe  m   88

# df_dupes
#   first last  s  age
# 1   jon  doe  m   88
# 2   jan  doe  m   88
# 3  jane  doe  m   88
# 4  jane  doe  m   12

Performance will be greatly improved if there is a field you can groupby on (it was mentioned in the comments that name is expected to be identical). Here the pairwise calculation is n^2 in memory. It is possible to trade some time efficiency for memory efficiency as needed.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...