Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
441 views
in Technique[技术] by (71.8m points)

c# - How to parse text from MS Word document to string

I am trying to find a way to parse a word document's text to a string in my project.I have more than 600 word(.doc) files that I need to get the text content(with the new lines and tabs if possible) and assign it to a string for each one.

I've been reading stuff about the Open XML SDK but it looks quite complicated for something that looks so simple.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Open XML SDK is only for 2007 and newer formats and it is not trivial to use.

If performance is not an issue you could use Word Automation and have Word do this for you. It will look something like this:

var app = new Application();
var doc = app.Documents.Open(documentLocation);

string rangeText = doc.Range().Text;

doc.Save();
doc.Close();

Marshal.ReleaseComObject(doc);    
Marshal.ReleaseComObject(app);

Take a look at http://www.codeproject.com/Articles/18703/Word-2007-Automation or http://www.codeproject.com/Articles/21247/Word-Automation for more complete examples and instructions. Note that this may become a bit more tricky if your documents are move complex (footnotes, text boxes, tables...).

Another option is have word save the document as a text and then read the text file. Take a look at this - http://msdn.microsoft.com/en-us/library/microsoft.office.tools.word.document.saveas(v=vs.80).aspx


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...