New to Python. Using 2.7.3. Class assignment for a mandatory programming class for a legal assistat degree.
I'd like to read UTF8 Text File of draft court statements and clean up as follows.
Read input text file line by line
In each line,
(1) capitalize first letters of sentence (including first character of the line)
(2) make certain that all commas, periods, and semi-colons are followed by a space character
Write output text file line by line
This is what I have so far based on reading other stackoverflow posts. It's not working well. Please help. Thank you.
import codecs
import sys
import os
import re
reload(sys)
sys.setdefaultencoding('utf8')
with codecs.open('test.txt', 'r', encoding='utf8') as file:
filedata = file.read().replace('
', '
')
re.sub(r'(?<=[.,;])(?=[^s])', r' ', filedata)
rtn = re.split('([.!?] *)', filedata)
filedata = ''.join([i.capitalize() for i in rtn])
filedata = filedata[0].upper() + filedata[1:]
with codecs.open('output.txt', 'w') as file:
file.write(filedata)
Example text file
instead of arguing,ask her, "what can i do?"forgo postponing the problem. instead, talk to her.that single gesture will promote peace.
Desired output:
Instead of arguing, ask her, "What can I do?" Forgo postponing the problem. Instead, talk to her. That single gesture will promote peace.
question from:
https://stackoverflow.com/questions/65870003/regex-and-python-clean-up-utf8-text-file 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…