The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.
A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.
My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.
This is the code I use to create sample Scrapy http responses for testing from an local html file:
# scrapyproject/tests/responses/__init__.py
import os
from scrapy.http import Response, Request
def fake_response_from_file(file_name, url=None):
"""
Create a Scrapy fake HTTP response from a HTML file
@param file_name: The relative filename from the responses directory,
but absolute paths are also accepted.
@param url: The URL of the response.
returns: A scrapy HTTP response which can be used for unittesting.
"""
if not url:
url = 'http://www.example.com'
request = Request(url=url)
if not file_name[0] == '/':
responses_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(responses_dir, file_name)
else:
file_path = file_name
file_content = open(file_path, 'r').read()
response = Response(url=url,
request=request,
body=file_content)
response.encoding = 'utf-8'
return response
The sample html file is located in scrapyproject/tests/responses/osdir/sample.html
Then the testcase could look as follows:
The test case location is scrapyproject/tests/test_osdir.py
import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file
class OsdirSpiderTest(unittest.TestCase):
def setUp(self):
self.spider = osdir_spider.DirectorySpider()
def _test_item_results(self, results, expected_length):
count = 0
permalinks = set()
for item in results:
self.assertIsNotNone(item['content'])
self.assertIsNotNone(item['title'])
self.assertEqual(count, expected_length)
def test_parse(self):
results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
self._test_item_results(results, 10)
That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…