Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
193 views
in Technique[技术] by (71.8m points)

Regex and Python - Clean Up UTF8 Text File

New to Python. Using 2.7.3. Class assignment for a mandatory programming class for a legal assistat degree.

I'd like to read UTF8 Text File of draft court statements and clean up as follows.

Read input text file line by line

In each line, (1) capitalize first letters of sentence (including first character of the line) (2) make certain that all commas, periods, and semi-colons are followed by a space character

Write output text file line by line

This is what I have so far based on reading other stackoverflow posts. It's not working well. Please help. Thank you.


import codecs
import sys
import os
import re

reload(sys)
sys.setdefaultencoding('utf8')


with codecs.open('test.txt', 'r', encoding='utf8') as file:
    filedata = file.read().replace(' 
', '
')

re.sub(r'(?<=[.,;])(?=[^s])', r' ', filedata)

rtn = re.split('([.!?] *)', filedata)
filedata = ''.join([i.capitalize() for i in rtn])
filedata = filedata[0].upper() + filedata[1:] 


with codecs.open('output.txt', 'w') as file:
    file.write(filedata)

Example text file

instead of arguing,ask her, "what can i do?"forgo postponing the problem. instead, talk to her.that single gesture will promote peace.

Desired output:

Instead of arguing, ask her, "What can I do?"  Forgo postponing the problem.  Instead, talk to her.  That single gesture will promote peace.
question from:https://stackoverflow.com/questions/65870003/regex-and-python-clean-up-utf8-text-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

you can try this

import re

filedata = 'instead of arguing,ask her, "what can i do?"forgo postponing the problem. instead, talk to her.that single gesture will promote peace.'

print(filedata)

# add space, match .,; but not followed by space s
filedata = re.sub(r'([,.;"])((?!s).)', r'1 2', filedata)

# clean space in quotation: " What can i do?"
filedata = re.sub(r'"s([^"]+")', r'"1', filedata)

# make uppercase first letter of sentence or after dot and quote
filedata = re.sub(r'(^.|.sw|"s?w)', lambda m: m.group(1).upper(), filedata)

print(filedata)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...