There are four steps that you will have to take in order to isolate the plain text part of your email body:
1. Get the MIME boundary string
We can use a regular expression to search your headers (let's assume they're in a separate variable, $headers
):
$matches = array();
preg_match('#Content-Type: multipart/[^;]+;s*boundary="([^"]+)"#i', $headers, $matches);
list(, $boundary) = $matches;
The regular expression will search for the Content-Type
header that contains the boundary string, and then capture it into the first capture group. We then copy that capture group into variable $boundary
.
2. Split the email body into segments
Once we have the boundary, we can split the body into its various parts (in your message body, the body will be prefaced by --
each time it appears). According to the MIME spec, everything before the first boundary should be ignored.
$email_segments = explode('--' . $boundary, $message);
array_shift($email_segments); // drop everything before the first boundary
This will leave us with an array containing all the segments, with everything before the first boundary ignored.
3. Determine which segment is plain text.
The segment that is plain text will have a Content-Type
header with the MIME-type text/plain
. We can now search each segment for the first segment with that header:
foreach ($email_segments as $segment)
{
if (stristr($segment, "Content-Type: text/plain") !== false)
{
// We found the segment we're looking for!
}
}
Since what we're looking for is a constant, we can use stristr
(which finds the first instance of a substring in a string, case insensitively) instead of a regular expression. If the Content-Type
header is found, we've got our segment.
4. Remove any headers from the segment
Now we need to remove any headers from the segment we found, as we only want the actual message content. There are four MIME headers that can appear here: Content-Type
as we saw before, Content-ID
, Content-Disposition
and Content-Transfer-Encoding
. Headers are terminated by
so we can use that to determine the end of the headers:
$text = preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?
/is', "", $segment);
The s
modifier at the end of the regular expression makes the dot match any newlines. .*?
will collect as few characters as possible (ie. everything up to
); the ?
is a lazy modifier on .*
.
And after this point, $text
will contain your email message content.
So to put it all together with your code:
<?php
// read from stdin
$fd = fopen("php://stdin", "r");
$email = "";
while (!feof($fd))
{
$email .= fread($fd, 1024);
}
fclose($fd);
$matches = array();
preg_match('#Content-Type: multipart/[^;]+;s*boundary="([^"]+)"#i', $email, $matches);
list(, $boundary) = $matches;
$text = "";
if (isset($boundary) && !empty($boundary)) // did we find a boundary?
{
$email_segments = explode('--' . $boundary, $email);
foreach ($email_segments as $segment)
{
if (stristr($segment, "Content-Type: text/plain") !== false)
{
$text = trim(preg_replace('/Content-(Type|ID|Disposition|Transfer-Encoding):.*?
/is', "", $segment));
break;
}
}
}
// At this point, $text will either contain your plain text body,
// or be an empty string if a plain text body couldn't be found.
$savefile = "savehere.txt";
$sf = fopen($savefile, 'a') or die("can't open file");
fwrite($sf, $text);
fclose($sf);
?>