Using the snippet below, I've attempted to extract the text data from this PDF file.
import pyPdf
def get_text(path):
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
content = ""
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "
" # Extract text from page and add to content
# Collapse whitespace
content = " ".join(content.replace(u"xa0", " ").strip().split())
return content
The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).
Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...
Does anybody know why this might be happening? I don't even know where to start!
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…