web scraping google news with python

Question

Welcome To Ask or Share your Answers For Others

web scraping google news with python

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

web scraping google news with python

I am creating a web scraper for different news outlets, for Nytimes and the Guardian it was easy since they have their own API.

Now, I want to scrape results from this newspaper GulfTimes.com. They do not provide an advanced search in their website, so I resorted to Google news. However, Google news Api has been deprecated. What i want is to retrieve the number of results from an advanced search like keyword = "Egypt" and begin_date="10/02/2011" and end_date="10/05/2011".

This is feasible in the Google News UI just by putting the source as "Gulf Times" and the corresponding query and date and simply counting manually the number of results but when I try to do this using python, I get a 403 error which is understandable.

Any idea on how I would do this? Or is there another service besides Google news that would allow me to do this? Keeping in mind that I would issue almost 500 requests at once.

import json
import urllib2
import cookielib
import re
from bs4 import BeautifulSoup


def run():
   Query = "Egypt"
   Month = "3"
   FromDay = "2"
   ToDay = "4"
   Year = "13"
   url='https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q='+Query+'&as_occt=any&as_drrb=b&as_mindate='+Month+'%2F'+FromDay+'%2F'+Year+'&as_maxdate='+Month+'%2F'+ToDay+'%2F'+Year+'&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
   cj = cookielib.CookieJar()
   opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
   request = urllib2.Request(url)   
   response = opener.open(request)
   htmlFile = BeautifulSoup(response)
   print htmlFile


run()

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:46:36+0000

You can use awesome requests library:

import requests

URL = 'https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q={query}&as_occt=any&as_drrb=b&as_mindate={month}%2F%{from_day}%2F{year}&as_maxdate={month}%2F{to_day}%2F{year}&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'


def run(**params):
    response = requests.get(URL.format(**params))
    print response.content, response.status_code


run(query="Egypt", month=3, from_day=2, to_day=2, year=13)

And you'll get status_code=200.

And, btw, take a look at scrapy project. Nothing makes web-scraping more simple than this tool.

Categories

web scraping google news with python

web scraping google news with python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags