I have one email with html format and need to download it and need to make one csv semicolon field separator result to a new file.
Example of the email recieved:
Content-Type: text/html; charset=UTF-8
<b>Thu Jul 11 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st= yle=3D"padding: 8px;background-color: #cce6ff">Name</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">DH</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">Mon</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
(UTC)</th></tr><tr><th style=3D"padding: 8px;">Arael Amarel</th><th style=
=3D"padding: 8px;">30549214</th><th style=3D"padding: 8px;">099981496</th><=
th style=3D"padding: 8px;">43</th><th style=3D"padding: 8px;">-</th><th sty=
le=3D"padding: 8px;">2019-07-11T10:06:34.311Z</th></tr><tr><th style=3D"pad=
ding: 8px;background-color: #dddddd">MATIN TARDEI</th><th style=3D"padding=
: 8px;background-color: #dddddd">45159820</th><th style=3D"padding: 8px;bac=
kground-color: #dddddd">094432451</th><th style=3D"padding:
8px;background-=
color: #dddddd">32</th><th style=3D"padding: 8px;background-color: #dddddd"=
-</th><th style=3D"padding: 8px;background-color: #dddddd">2019-07-
11T10:2=
8:41.198Z</th></tr>
Needed csv output:
Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z
If i open this mail on Client there make the table all ok, but I think it′s there a problem of format with procmail if I put in .html file this content (saved by procmail) of procmail and open it it′s make impossible to process the content if I look this content all the end of line are marked with a "=" wich means a lot of problems, furtermore they are some serveral problems in the aligment of the table and other stuff which make it a nightmare to process the content to extact.
I had maked one procmailrc with the filter to convert the html format to plain
procmailrc file:
MAILDIR=/new/mail/htmlconvert
:0
* ^Content-Type: text/html.*;
{
:0c
$MAILDIR/converted/
:0fwb
| `which html2text`
:0fwh
| `which formail` -i "Content-Type: text/plain; charset=UTF-8"
}
This is a try number 1, didn't work the converter uses I tough html2text converter if I use html2text directly from the file originated de result is:
html2text
===============================================================================
1px solid #dddddd;border-collapse: collapse;text-align: left;">
px;background-color: #cce6ff">NAME
px;background-color: #cce6ff">CI
= px;background-color: #cce6ff">DH
px;backgro= und-color: #cce6ff">FG
px;background-color: #c= ce6ff">Mon
px;background-color: #cce6ff">DATE= (UTC)
px;">Arael Amarel
px;">30549214
px;">099981496
<= th style=3D"padding: 8px;">43
px;">-
px;">2019-07-11T10:06:34.311Z
px;background-color: #dddddd">MATIN TARDEI
8px;background-color: #dddddd">45159820
px;bac= kground-color: #dddddd">094432451
px;background-= color: #dddddd">32
px;background-color: #dddddd"= >-
px;background-color: #dddddd">2019-07-11T10:2= 8:41.198Z
px;">
Already tryied lynx -dump -force-html to the file and the result is′t nothing good to reach the format csv output.
html2text -nobs (file)
Name;CI;DH;FG;Mon;DATE (UTC)
Arael Amarel;30549214;099981496;43;-;2019-07-11T10:06:34.311Z
MATIN TARDEI;45159820;094432451;32;-;2019-07-11T10:28:41.198Z
Update:
I have applied the solution of tripleee to the procmailrc, however the format of the mail is still the same of the original source, the qprint
didn't change the format with this change. However have tried to make it directly to the file and works fine.
The actual solution:
qprint -d -n <1563019338.1197_0.localhost.localdomain |
html2text -style pretty |
awk '/^-------------------------------------------------------------------------------/{p=1}p'
The - line is the separator of the body of the mail and the before content, this shows out:
-------------------------------------------------------------------------------
NAME CI CD FG HJ DATE (UTC)
Yaiaa Fereeira 52104575 097325303 20 - 2019-07-12T10:46:24.716Z
Gabtiel Aosta Sclavi 42445135 098322361 42 - 2019-07-12T11:07:36.110Z
Need now to make this content to the csv out, I thought it will be more easy to the first part but want to automate it to the procmail to do it with the mail download.
The result of procmail changing the procmailrc is the mail with the body still having the "=" as line end, but in the header have:
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Update
The email result source with qprint in the procrc
Return-Path:
Delivered-To:
Return-path:
Envelope-to:
Delivery-date: Sat, 13 Jul 2019 08:03:48 -0300
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8
Date: Sat, 13 Jul 2019 11:03:02 +0000 (UTC)
From:
Mime-Version: 1.0
To:
Message-ID:
Subject:Fri Jul 12 2019
X-Spam-Flag: NO
<b>Fri Jul 12 2019</b><hr><table style=3D"border=
: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr><th st=
yle=3D"padding: 8px;background-color: #cce6ff">NAME</th><th styl=
e=3D"padding: 8px;background-color: #cce6ff">CI</th><th style=3D"padding: 8=
px;background-color: #cce6ff">CD</th><th style=3D"padding: 8px;backgro=
und-color: #cce6ff">FG</th><th style=3D"padding: 8px;background-color: #c=
ce6ff">HJ</th><th style=3D"padding: 8px;background-color: #cce6ff">DATE=
(UTC)</th></tr><tr><th style=3D"padding: 8px;">Yaiaa Fereeira</th><th st=
yle=3D"padding: 8px;">52104575</th><th style=3D"padding: 8px;">097325303</t=
h><th style=3D"padding: 8px;">20</th><th style=3D"padding: 8px;">-</th><th =
style=3D"padding: 8px;">2019-07-12T10:46:24.716Z</th></tr>
I have the log in the stdin because procmail can`t write logfile as you can see in this log detail:
1 message for [email protected] at aaa.com (25330 octets).
reading message [email protected]@aaa.com:1 of 1 (25330 octets)........................procmail: Error while writing to "/info/in/log"
procmail: [20191] Mon Jul 15 08:55:34 2019
procmail: Assigning "FORMAIL=/usr/bin/formail"
procmail: Assigning "QPRINT=/usr/local/bin/qprint"
procmail: Match on "^Content-Type: text/html;"
procmail: Assigning "LASTFOLDER=converted/new/1563191734.20191_0.localhost.localdomain"
Subject: Sun Jul 14 2019
Folder: converted/new/1563191734.20191_0.localhost.localdomain 24985
procmail: Executing " qprint -d -n | html2text -nobs "
procmail: Executing " formail -I "Content-Type: text/html; charset=UTF-8"
procmail: Skipped "Mail"
procmail: Skipped "/"
From [email protected] Mon Jul 15 08:55:34 2019
Subject: Sun Jul 14 2019
Folder: **Bounced** 24985
fetchmail: MDA returned nonzero status 73
not flushed
See Question&Answers more detail:
os