I think this is one of the coursera text mining assignment. Well you can use regex and extract to get the solution. dates.txt i.e
doc = []
with open('dates.txt') as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_sorter():
# Get the dates in the form of words
one = df.str.extract(r'((?:d{,2}s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|.|s|,)s?d{,2}[a-z]*(?:-|,|s)?s?d{2,4})')
# Get the dates in the form of numbers
two = df.str.extract(r'((?:d{1,2})(?:(?:/|-)d{1,2})(?:(?:/|-)d{2,4}))')
# Get the dates where there is no days i.e only month and year
three = df.str.extract(r'((?:d{1,2}(?:-|/))?d{4})')
#Convert the dates to datatime and by filling the nans in two and three. Replace month name because of spelling mistake in the text file.
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())
date_sorter()
Output:
9 1971-04-10
84 1971-05-18
2 1971-07-08
53 1971-07-11
28 1971-09-12
474 1972-01-01
153 1972-01-13
13 1972-01-26
129 1972-05-06
98 1972-05-13
111 1972-06-10
225 1972-06-15
31 1972-07-20
171 1972-10-04
191 1972-11-30
486 1973-01-01
335 1973-02-01
415 1973-02-01
36 1973-02-14
405 1973-03-01
323 1973-03-01
422 1973-04-01
375 1973-06-01
380 1973-07-01
345 1973-10-01
57 1973-12-01
481 1974-01-01
436 1974-02-01
104 1974-02-24
299 1974-03-01
If you want to return only the index then return pd.Series(dates.sort_values().index)
Parsing of first regex
#?: Non-capturing group
((?:d{,2}s)? # The two digits group. `?` refers to preceding token or group. Here the digits of 2 or 1 and space occurring once or less.
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* # The words in group ending with any letters `[]` occuring any number of times (`*`).
(?:-|.|s|,) # Pattern matching -,.,space
s? #(`?` here it implies only to space i.e the preceding token)
d{,2}[a-z]* # less than or equal to two digits having any number of letters at the end (`*`). (Eg: may be 1st, 13th , 22nd , Jan , December etc ) .
(?:-|,|s)?# The characters -/,/space may occur once and may not occur because of `?` at the end
s? # space may occur or may not occur at all (maximum is 1) (`?` here it refers only to space)
d{2,4}) # Match digit which is 2 or 4
Hope it helps.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…