Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
355 views
in Technique[技术] by (71.8m points)

c# - Parse email content from quoted reply

I'm trying to figure out how to parse out the text of an email from any quoted reply text that it might include. I've noticed that usually email clients will put an "On such and such date so and so wrote" or prefix the lines with an angle bracket. Unfortunately, not everyone does this. Does anyone have any idea on how to programmatically detect reply text? I am using C# to write this parser.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I did a lot more searching on this and here's what I've found. There are basically two situations under which you are doing this: when you have the entire thread and when you don't. I'll break it up into those two categories:

When you have the thread:

If you have the entire series of emails, you can achieve a very high level of assurance that what you are removing is actually quoted text. There are two ways to do this. One, you could use the message's Message-ID, In-Reply-To ID, and Thread-Index to determine the individual message, it's parent, and the thread it belongs to. For more information on this, see RFC822, RFC2822, this interesting article on threading, or this article on threading. Once you have re-assembled the thread, you can then remove the external text (such as To, From, CC, etc... lines) and you're done.

If the messages you are working with do not have the headers, you can also use similarity matching to determine what parts of an email are the reply text. In this case you're stuck with doing similarity matching to determine the text that is repeated. In this case you might want to look into a Levenshtein Distance algorithm such as this one on Code Project or this one.

No matter what, if you're interested in the threading process, check out this great PDF on reassembling email threads.

When you don't have the thread:

If you are stuck with only one message from the thread, you're doing to have to try to guess what the quote is. In that case, here are the different quotation methods I have seen:

  1. a line (as seen in outlook).
  2. Angle Brackets
  3. "---Original Message---"
  4. "On such-and-such day, so-and-so wrote:"

Remove the text from there down and you're done. The downside to any of these is that they all assume that the sender put their reply on top of the quoted text and did not interleave it (as was the old style on the internet). If that happens, good luck. I hope this helps some of you out there!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...