python - PySpark: match the values of a DataFrame column against another DataFrame column

Question

Welcome To Ask or Share your Answers For Others

python - PySpark: match the values of a DataFrame column against another DataFrame column

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - PySpark: match the values of a DataFrame column against another DataFrame column

In Pandas DataFrame, I can use DataFrame.isin() function to match the column values against another column.

For example: suppose we have one DataFrame:

df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 
                     'col2': [1, 2, 3, 4, 5, 6]})
df_A

    col1  col2
0    A     1
1    B     2
2    C     3
3    B     4
4    C     5
5    D     6

and another DataFrame:

df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 
                     'col2': [10, 20, 30, 40, 50, 60, 70]})
df_B

    col1  col2
0    C    10
1    E    20
2    D    30
3    C    40
4    F    50
5    G    60
6    H    70

I can use .isin() function to match the column values of df_B against the column values of df_A

E.g.:

df_B[df_B['col1'].isin(df_A['col1'])]

yields:

    col1  col2
0    C    10
2    D    30
3    C    40

What's the equivalent operation in PySpark DataFrame?

df_A = pd.DataFrame({'col1': ['A', 'B', 'C', 'B', 'C', 'D'], 
                     'col2': [1, 2, 3, 4, 5, 6]})
df_A = sqlContext.createDataFrame(df_A)

df_B = pd.DataFrame({'col1': ['C', 'E', 'D', 'C', 'F', 'G', 'H'], 
                     'col2': [10, 20, 30, 40, 50, 60, 70]})
df_B = sqlContext.createDataFrame(df_B)


df_B[df_B['col1'].isin(df_A['col1'])]

The .isin() code above gives me an error messages:

u'resolved attribute(s) col1#9007 missing from 
col1#9012,col2#9013L in operator !Filter col1#9012 IN 
(col1#9007);;
!Filter col1#9012 IN (col1#9007)
+- 
LogicalRDD [col1#9012, col2#9013L]
'

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:34:56+0000

replyed Oct 17, 2021 by 深蓝 (71.8m points)

This kind of operation is called left semi join in spark:

df_B.join(df_A, ['col1'], 'leftsemi')

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - PySpark: match the values of a DataFrame column against another DataFrame column

python - PySpark: match the values of a DataFrame column against another DataFrame column

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags