Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
237 views
in Technique[技术] by (71.8m points)

python - Wrong encoding of email attachment

I have a python 2.7 script running on windows. It logs in gmail, checks for new e-mails and attachments:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

file_types = ["pdf", "doc", "docx"] # download attachments with these extentions

login = "login"
passw = "password"

imap_server = "imap.gmail.com"
smtp_server = "smtp.gmail.com"
smtp_port = 587

from smtplib import SMTP
from email.parser import HeaderParser
from email.MIMEText import MIMEText
import sys
import imaplib
import getpass
import email
import datetime
import os
import time

if __name__ == "__main__":
    try:
        while True:
            session = imaplib.IMAP4_SSL(imap_server)
            try:
                rv, data = session.login(login, passw)
                print "Logged in: ", rv
            except imaplib.IMAP4.error:
                print "Login failed!"
                sys.exit(1)

            rv, mailboxes = session.list()
            rv, data = session.select(foldr)
            rv, data = session.search(None, "(UNSEEN)")
            for num in data[ 0 ].split():
                rv, data = session.fetch(num, "(RFC822)")
                for rpart in data:
                    if isinstance(rpart, tuple):
                        msg = email.message_from_string(rpart[ 1 ])
                        to = email.utils.parseaddr(msg[ "From" ])[ 1 ]
                text = data[ 0 ][ 1 ]
                msg = email.message_from_string(text)
                got = []
                for part in msg.walk():
                    if part.get_content_maintype() == 'multipart':
                        continue
                    if part.get('Content-Disposition') is None:
                        continue
                    filename = part.get_filename()
                    print "file: ", filename
                    print "Extention: ", filename.split(".")[ -1 ]
                    if filename.split(".")[ -1 ] not in file_types:
                        continue
                    data = part.get_payload(decode = True)
                    if not data:
                        continue
                    date = datetime.datetime.now().strftime("%Y-%m-%d")
                    if not os.path.isdir("CONTENT"):
                        os.mkdir("CONTENT")
                    if not os.path.isdir("CONTENT/" + date):
                        os.mkdir("CONTENT/" + date)
                    ftime = datetime.datetime.now().strftime("%H-%M-%S")
                    new_file = "CONTENT/" + date + "/" + ftime + "_" + filename
                    f = open(new_file, 'wb')
                    print "Got new file %s from %s" % (new_file, to)
                    got.append(filename.encode("utf-8"))
                    f.write(data)
                    f.close()
            session.close()
            session.logout()
            time.sleep(60)
    except:
        print "TARFUN!"

And the problem is that the last print reads garbage:
=?UTF-8?B?0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv?=
for example so later checks don't work. On linux it works just fine. For now I tryed to d/e[n]code filename to utf-8. But it did nothing. Thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you read the spec that defines the filename field, RFC 2183, section 2.3, it says:

Current [RFC 2045] grammar restricts parameter values (and hence Content-Disposition filenames) to US-ASCII. We recognize the great desirability of allowing arbitrary character sets in filenames, but it is beyond the scope of this document to define the necessary mechanisms. We expect that the basic [RFC 1521] 'value' specification will someday be amended to allow use of non-US-ASCII characters, at which time the same mechanism should be used in the Content-Disposition filename parameter.

There are proposed RFCs to handle this. In particular, it's been suggested that filenames be handled as encoded-words, as defined by RFC 5987, RFC 2047, and RFC 2231. In brief this means either RFC 2047 format:

"=?" charset "?" encoding "?" encoded-text "?="

… or RFC 2231 format:

"=?" charset ["*" language] "?" encoded-text "?="

Some mail agents are already using this functionality, others don't know what to do with it. The email package in Python 2.x is among those that don't know what to do with it. (It's possible that the later version in Python 3.x does, or that it may change in the future, but that won't help you if you want to stick with 2.x.) So, if you want to parse this, you have to do it yourself.

In your example, you've got a filename in RFC 2047 format, with charset UTF-8 (which is usable directly as a Python encoding name), encoding B, which means Base-64, and content 0YfQsNGB0YLRjCAxINGC0LXQutGB0YIg0LzQtdGC0L7QtNC40YfQutC4LmRv. So, you have to base-64 decode that, then UTF-8-decode that, and you get u'часть 1 текст методички.do'.

If you want to do this more generally, you're going to have to write code which tries to interpret each filename in RFC 2231 format if possible, in RFC 2047 format otherwise, and does the appropriate decoding steps. This code isn't trivial enough to write in a StackOverflow answer, but the basic idea is pretty simple, as demonstrated above, so you should be able to write it yourself. You may also want to search PyPI for existing implementations.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...