I have a Flask app that takes a URL from the user and then crawls that website and returns the links found on that website. Previously, I had an issue where the crawler would only run once and after that, it wouldn't run again. I found the solution to that by using CrawlerRunner
as opposed to
CrawlerProcess
. This is what my code looks like:
from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from urllib.parse import urlparse
from uuid import uuid4
import urllib3, requests, urllib.parse
app = Flask(__name__)
executor = Executor(app)
http = urllib3.PoolManager()
runner = CrawlerRunner()
list = set([])
list_validate = set([])
list_final = set([])
@app.route('/', methods=["POST", "GET"])
def index():
if request.method == "POST":
url_input = request.form["usr_input"]
# Modifying URL
if 'https://' in url_input and url_input[-1] == '/':
url = str(url_input)
elif 'https://' in url_input and url_input[-1] != '/':
url = str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] != '/':
url = 'https://' + str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] == '/':
url = 'https://' + str(url_input)
# Validating URL
try:
response = requests.get(url)
error = http.request("GET", url)
if error.status == 200:
parse = urlparse(url).netloc.split('.')
base_url = parse[-2] + '.' + parse[-1]
start_url = [str(url)]
allowed_url = [str(base_url)]
# Crawling links
class Crawler(CrawlSpider):
name = "crawler"
start_urls = start_url
allowed_domains = allowed_url
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
def parse_links(self, response):
base_url = url
href = response.xpath('//a/@href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
def start_spider():
d = runner.crawl(Crawler)
def start(d):
for link in list_validate:
error = http.request("GET", link)
if error.status == 200:
list_final.add(link)
original_stdout = sys.stdout
with open('templates/file.txt', 'w') as f:
sys.stdout = f
for link in list_final:
print(link)
d.addCallback(start)
def run():
reactor.run(0)
unique_id = uuid4().__str__()
executor.submit_stored(unique_id, start_spider)
executor.submit(run)
return redirect(url_for('crawling', id=unique_id))
elif error.status != 200:
return render_template('index.html')
except requests.ConnectionError as exception:
return render_template('index.html')
else:
return render_template('index.html')
@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
return render_template('start-crawl.html', refresh=True)
else:
executor.futures.pop(id)
return render_template('finish-crawl.html')
I also have this code to refresh the page every 5 seconds in start-crawl.html
:
{% if refresh %}
<meta http-equiv="refresh" content="5">
{% endif %}
The problem is it renders start-crawl.html
only while it's crawling and not while it's validating. So basically, what is happing is it takes the URL, crawls it while rendering start-crawl.html
. Then it goes to finish-crawl.html
while validating.
I believe the issue could be in start_spider()
, in the line d.addCallback(start)
. I think that because it might be executing that line in the background which I don't want. I believe what might be happening here is in start_spider()
, d = runner.crawl(Crawler)
is getting executed and then d.addCallback(start)
is happening in the background which is why it takes me to finish-crawl.html
while it's validating. I want the entire function to be executed in the background and not just that part. That is why I have: executor.submit_stored(unique_id, start_spider)
.
I want this code to take a URL, then crawl and validate it while rendering start-crawl.html
.Then when it finishes I want it to render finish-crawl.html
.
Anyways if that isn't the issue, does anyone know what it is and how to fix it? Please ignore the complicity of this code and anything that isn't a "programming convention". Thanks in advance to everyone.
question from:
https://stackoverflow.com/questions/65713913/why-is-acrapy-spider-not-functioning-with-flask-correctly