python - How to filter out data using regex patterns to collect statistics on the results

Question

Welcome To Ask or Share your Answers For Others

python - How to filter out data using regex patterns to collect statistics on the results

posted Jan 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to filter out data using regex patterns to collect statistics on the results

I am trying to collect stats on my data that has two columns- subject and _time. I have a list of regex patterns that I would like to filter the data to derive a summary dataframe showing the pattern that has a match or more, the timestamp of the first instance matched, the timestamp of the last instance matched, and then its count (number of times that pattern matched with the subjects).

Here is a list of my regex patterns:

['^Payment Issue', 'Invoice\s#(([0-9])\w+)\sfrom\s([a-zA-Z]\w+)', '(Cares+message)']

Here is the data that is collected as a dictionary and turned into a DataFrame:

    subject           type          _time
0  Payment Issue      email   2020-12-09 19:45:48.013140
1  Payment Issue      email    2020-13-09 19:45:48.013140
2  Invoice\          email   2020-12-09 19:45:48.013140
3  Hello!             email    2020-14-11 19:45:48.013140

I want to create a new DataFrame that has the following columns- pattern, first_seen, last_seen, count. "regex_pattern" is the pattern that matched with the subject, "first_seen" would be the value of the earliest _time, "last_seen" would be the latest _time, and count would be the number of times the pattern matched with subjects.

This is what I want the output to look like:

      pattern                                           first_seen            last_seen                           count
0  ^Payment Issue                                    2020-12-09 19:45:48.013140   2020-13-09 19:45:48.013140         2
1  Invoice\s#(([0-9])\w+)\sfrom\s([a-zA-Z]\w+)  2020-12-09 19:45:48.013140   2020-12-09 19:45:48.013140         1    
2  (Cares+message)                                             NA                                NA                 0

Here is my python so far:

# Collect a list of regex patterns to query against the data
patterns = ["^Payment Issue", "Invoice\s#(([0-9])\w+)\sfrom\s([a-zA-Z]\w+)", "(Cares+message)"]


# Collect the table that holds the dictionary of emails sent
email_result_dict = {'subject': ['Payment Issue ', 'Payment Issue ', 'Invoice\ ', 'Hello!'],
                     'type': ['email', 'email', 'email', 'email'],
                     '_time': [' 2020-12-09 19:45:48.013140', '2020-13-09 19:45:48.013140',
                               '2020-12-09 19:45:48.013140', '2020-14-09 19:45:48.013140']}


# Put data results in to a DataFrame
df2 = pd.DataFrame.from_dict(email_result_dict)


# List of subject values from the data
value_list = []

# Gather all subjects
snd = [result['subject'] for result in email_result_dict]
value_list.extend(snd)

df = pd.DataFrame()

for j in patterns:
    p = re.compile(j, re.IGNORECASE)
    for i in value_list:
        matches = p.findall(i)
        if len(matches) != 0:
            df = df.append({'regex': j, 'matched': matches}, ignore_index=True)

# Create the output DataFrame
new_dataFrame = df.groupby('regex')['matched'].count().reset_index()
new_dataFrame.columns = ['regex', 'count']

This script runs but is missing the first_seen and last_seen data. Any ideas or suggestions as to how to get my expected output in a dataframe? I have tried everything and keep getting stuck on getting the first_seen and last_seen values for each match that the count is greater than 0.

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - How to filter out data using regex patterns to collect statistics on the results

python - How to filter out data using regex patterns to collect statistics on the results

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags