Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
324 views
in Technique[技术] by (71.8m points)

Read a file in buffer from FTP python

I am trying to read a file from an FTP server. The file is a .gz file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.

I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?

When trying StringIO I got the error:

>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:Python27libftplib.py", line 117, in __init__
self.connect(host)
File "C:Python27libftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:Python27libsocket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed

I just need to know how can I get data into some variable and loop on it until the file from FTP is read.

I appreciate your time and help. Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Make sure to login to the ftp server first. After this, use retrbinary which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.

from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

# Setup a cheap way to catch the data (could use StringIO too)
data = []
def handle_binary(more_data):
    data.append(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
data = "".join(data)

Bonus points: how about we decompress the string while we're at it?

Easy mode, using data string above

import gzip
import StringIO
zippy = gzip.GzipFile(fileobj=StringIO.StringIO(data))
uncompressed_data = zippy.read()

Little bit better, full solution:

from ftplib import FTP
import gzip
import StringIO

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@

sio = StringIO.StringIO()
def handle_binary(more_data):
    sio.write(more_data)

resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
sio.seek(0) # Go back to the start
zippy = gzip.GzipFile(fileobj=sio)

uncompressed = zippy.read()

In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...