You should consider using pyspark sql module functions instead of writing a UDF
, there are several regexp
based functions:
First let's start with a more complete sample data frame:
df = sc.parallelize([["a","b","foo is tasty"],["12","34","blah blahhh"],["yeh","0","bar of yums"],
['haha', '1', 'foobar none'], ['hehe', '2', 'something bar else']])
.toDF(["col1","col2","col_with_text"])
If you want to filter lines based on whether they contain one of the words in words_list
, you can use rlike
:
import pyspark.sql.functions as psf
words_list = ['foo','bar']
df.filter(psf.col('col_with_text').rlike('(^|s)(' + '|'.join(words_list) + ')(s|$)')).show()
+----+----+------------------+
|col1|col2| col_with_text|
+----+----+------------------+
| a| b| foo is tasty|
| yeh| 0| bar of yums|
|hehe| 2|something bar else|
+----+----+------------------+
If you want to extract the strings matching the regular expression, you can use regexp_extract
:
df.withColumn(
'extracted_word',
psf.regexp_extract('col_with_text', '(?=^|s)(' + '|'.join(words_list) + ')(?=s|$)', 0))
.show()
+----+----+------------------+--------------+
|col1|col2| col_with_text|extracted_word|
+----+----+------------------+--------------+
| a| b| foo is tasty| foo|
| 12| 34| blah blahhh| |
| yeh| 0| bar of yums| bar|
|haha| 1| foobar none| |
|hehe| 2|something bar else| |
+----+----+------------------+--------------+
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…