2021年4月22日星期四

How to regexp_extract if a matching pattern resides anywhere in the string - pyspark

I was trying to get some insights on regexp_extract in pyspark and I tried to do a check with this option to get better understanding.

Below is my dataframe

data = [('2345', 'Checked|by John|for kamal'),  ('2398', 'Checked|by John|for kamal '),  ('2328', 'Verified|by Srinivas|for kamal than some random text'),          ('3983', 'Verified|for Stacy|by John')]    df = sc.parallelize(data).toDF(['ID', 'Notes'])    df.show()    +----+-----------------------------------------------------+  |  ID|               Notes                                 |  +----+-----------------------------------------------------+  |2345|Checked|by John|for kamal                            |  |2398|Checked|by John|for kamal                            |  |2328|Verified|by Srinivas|for kamal than some random text |  |3983|Verified|for Stacy|by John                           |  +----+-----------------------------------------------------+  

So here I was trying to identify whether an ID is checked or verified by John

With the help of SO members I was able to crack the use of regexp_extract and came to below solution

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(Checked|Verified)(\\|by John)', 1))    result.show()    +----+------------------------------------------------+------------+  |  ID|               Notes                                |Employee|  +----+------------------------------------------------+------------+  |2345|Checked|by John|for kamal                           | Checked|  |2398|Checked|by John|for kamal                           | Checked|  |2328|Verified|by Srinivas|for kamal than some random text|        |  |3983|Verified|for Stacy|by John                          |        |  +----+--------------------+----------------------------------------+  

For few ID's this gives me perfect result ,But for last ID it didn't print Verified. Could someone please let me know whether any other action needs to be performed in the mentioned regular expression?

What I feel is '(Checked|Verified)(\|by John)' is matching only adjacent values. I tried * and $,Still it didnt print Verified for ID 3983

https://stackoverflow.com/questions/67223763/how-to-regexp-extract-if-a-matching-pattern-resides-anywhere-in-the-string-pys April 23, 2021 at 11:50AM

没有评论:

发表评论