Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
120 views
in Technique[技术] by (71.8m points)

javascript - Is there any way to download csv file from “website button click” using Python?

I want to automate the download of a CSV file "Projects.csv" from this website:

https://www.vcsprojectdatabase.org/#/projects/st_/c_/ss_0/so_/di_/np_

The CSV can be downloaded manually by clicking the CSV icon but I'm not sure how can I automate this download in python and store the CSV file locally on my drive.

So far I have tried inspecting the button element via chrome developer console to find the correct url in the Network tab like so?

https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport

But I'm not sure if this URL should include the name of file at the end like this:

https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport/Projects.csv

This is what I have tried but it just writes a blank file:

import requests

url = 'https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport/Projects.csv'

r = requests.get(url)
with open('a.csv', 'wb') as f:
    f.write(r.content) 

How do I get the CSV file to properly download and save?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First of all, you should understand that HTTP protocol based on requests. Final result of JavaScript execution will be formed HTTP request which let server respond with file content. You need to "reverse" web page, find how to create proper request and repeat it as similar as it can be done.

So, let's try to do this step by step:

  1. Click right mouse button on element which execute download and press "Inspect element" enter image description here
  2. In source code you can see name of JavaScript function this element executes enter image description here
  3. Type the name of function in console without parentheses and click button which should appear near console return (This button will open this JavaScript function in source code) enter image description here
  4. In source code we see that function execute submit on HTML element which has id frmDownload. So, go back to "Inspector" tab and type this id into search box. enter image description here
  5. Now we found that this element is HTML form. This form send POST request to URL https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport with next data:

    searchTerm=
    country=
    sectoral_scope=0
    recentProjects=
    sort=projectId
    dir=DESC
    formatType=csv
    

    This information is enough to try repeat this request in Python.

Let's write small script which form and send same request and save result into .csv file:

import requests

data = {
    "searchTerm": "",
    "country": "",
    "sectoral_scope": "0",
    "recentProjects": "",
    "sort": "projectId",
    "dir": "DESC",
    "formatType": "csv"
}

file = requests.post("https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport", data)

with open("res.csv", "wb+") as f:
    f.write(file.content)

Launch it and it ... works. res.csv contains proper result.

BUT THAT'S NOT ALL. Usually everything is not so easy. To let our request look same as sent by browser we should take a look on request headers. To capture HTTP request from browser we can open "Network" tab:

enter image description here

Now let's press download button on web page and download csv file. In requests table now we can see our post request. Click on it and take a look on "Headers" tab into "Request headers" section.

enter image description here

There's Cookie header, which mostly in such as requests is not important and can be missed. But if you have some issues with request you should take a look on previous requests, find request with Set-Cookie header in server response and repeat it.

Let's improve our script and copy important (Host, Content-Length, Connection we don't include, cause Python requests module will add them automatically; DNT and Upgrade-Insecure-Requests are not necessary at all) headers from browser.

import requests

data = {
    "searchTerm": "",
    "country": "",
    "sectoral_scope": "0",
    "recentProjects": "",
    "sort": "projectId",
    "dir": "DESC",
    "formatType": "csv"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language":  "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.vcsprojectdatabase.org/",
    "Content-Type": "application/x-www-form-urlencoded"
}

file = requests.post("https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport", data,
                     headers=headers)

with open("res.csv", "wb+") as f:
    f.write(file.content)

P.S. Don't forget to ask website owner for permission ??


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...