In short:
regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))
This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes')
)
In Detail:
Create a sample dataframe
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
Do the needed imports
from pyspark.sql.functions import regexp_extract, col
On df
extract Employee
name from column using regexp_extract(column_name, regex, group_number)
.
Here regex('(.)(by)(s+)(w+)'
) means
- (.) - Any character (except newline)
- (by) - Word by in the text
- (s+) - One or many spaces
- (w+) - Alphanumeric or underscore chars of length one
and group_number is 4 because group (w+)
is in 4th position in expression
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(s+)(w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
Databricks notebook
Note:
regexp_extract(col('Notes'), '.bys+(w+)', 1))
seems much cleaner version and check the Regex in use here
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…