CONCLUSION - 07-25-2021
After looking at this problem in more detail, I believe that it is NOT technically possible to use Python Requests
to scrape the website and table in your question.
Which means that your question cannot be solved in the manner that you would prefer.
Why?
The website employs anti-scraping mechanisms.
The GBK
values are only one part of these mechanisms. The table that you're trying to scrape has 1504 pages. A new unique GBK
value is created each time you navigate from page 1 to the next page and so forth. Thus there are 1503 unique GBK
value.
The site also uses a unique session management cookie for each page.
# page 1 cookie
JSESSIONID=0AC56294FE6857A236F0E68A9106E1AE.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPEbke3aQqqqm_6QLwIaUKu0tMygss.En464jhvNz1mMzbOatzmLLtv9x_xiCP6JaO_JzcbvHqtsnQYydBa6B_YjSg6sFm7cVBBOhB35_.TZuwDsbOnDinJkNwMs3AaMPtM83dP9YnogFKHpNJo5.RHMTKT6_XNPr0mxebR6stRrQ7LFfACcWqHHhbc.j6gZfZzxsgwnPE3RGP6aT9nYuMJbvK2EGrdAv0O12G03KTk_BMk.xLeEwrQq5VjyH1tB7t4wQ.jQ1geshvbDPCs8_VHCkd2.6uIag5Md.lngzeDshhSjMrmBjyy0HTqAXQ3; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583
# page 2 cookie
JSESSIONID=651AD12FD349FFB1842E08CA578EA37D.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPeKke3e7qqqm_6Q_YEqK9dBPNnJQF00YvHDMLHlJeb.4rrpTsgfwZxU0S5OXIAB2aduoOTmj7RuKIL.LUXRaRqfh5ZByuTFX3LxK1Ia3sr3V45c.PPx6Eas5EF5EkQztquzrX78QIbjrJUcQoKoOKcqgX5UuRIN0gCyGDyI6FFj.JbPhwYf65Hcx9BzDQnrlGAPHM3WGvmKf7OJnLY1SGIuxtdyVUE359Ll2lr0QJxUq1Dacqz_WsFa_ZantBbP7MklHX6J21wmDnyo6s4xCeeTYwsGq.kGUbE74Dx.QjQBCM_SiLKccTog8_EdBDg; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583
# page 3 cookie
JSESSIONID=2121D74E0EFCEC3BE104DAA2791481B6.7; neCYtZEjo8GmS=5El51n08q7nzOG_bzzqhGWyfW_Lx9tCv1uZA6QjBcUq0rH0d1XYIvTKzN3MfNn2cZasqfZoM8Yo5NTpuq9gM.IG; neCYtZEjo8GmT=53HTPeKke3e7qqqm_6wBfEGBZsTF9_uGtgepzPXNOzFh0RNtGcE1Cf4hEQNppVywcI5mk3SlLkzvNll6ovr4XmfL2Ujy3AFZR5leVY2H3_584At3GmIwmnsEjOx5v5e_lMon3AbX9t2W8UiLoK.9SBX0vgNRfkqdpyPjWKk3Zs8gQG0k3_6UwxGTvEwWkaWL8vquJgCGlvLEFTjNvd07eHiR482UfpLPFP6yAkx8Wi9pM79cL.26KE3U2L79hgBKLHyOdNyj3VKOkDsaXefNdPXd.YqT4kevShGxzMM2PuzqnuuQnW.GQ5mr9Rx8VxUjEa; acw_tc=276aedd816272186939626726e424a5dd554d4b095225e2cac90fc6d2da583
So you would need to acquire both the unique GBK
value and the unique session management cookie for each page between 2 and 1504.
I also noted that site employs some type of latency. The first page can take some time to fully load. If you attempt to navigate to another page before this page load is complete you will get this message "请勿频繁操作!"
It's worth nothing that some pages took up to 2 minutes to load. When they didn't load the message above was displayed.
Like I previously mentioned you should consider scraping this site with selenium,
which might work to bypass the anti-scraping mechanisms.
UPDATE POST (SELENIUM)- 07-27-2021
I have attempted to scrape your target website with selenium
. The chromedriver
continually fails to connect to the site. Even if these switches are used:
chrome_options.add_experimental_option("useAutomationExtension", False)
chrome_options.add_experimental_option("excludeSwitches", ['enable-automation'])
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
I also tried using the undetected-chromedriver, but that also failed to property connect.
Even when I set a high timeout period with either driver.set_page_load_timeout()
or driver.implicitly_wait()
the session still fails with chromedriver
.
I was able to access the website when I used selenium
with the geckodriver.
There are still time out issues, but adding a WebDriverWait
couple with an expected_conditions
seems to overcome some of the timeout issues.
Scraping this website will be a long and arduous process, because of the anti-scraping mechanisms being used.
ORIGINAL POST - 07-24-2021
First let me say, that your question doesn't have enough details to make a solid recommendation for solving your issue.
I looked into your problem. I found that your target website uses some Javascript to dynamically create the GBK
value that is used each post request.
6SQk6G2z:GBK-5lkb7acLMDDxywZsCHoJagJlT50f1gw4.jaVgaBpBcGZDs1T_pcR_OPFgvOm_6oM8PfyL3L6xDPxFqgIqgwbVAEw8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ
This GBK
value is used when navigating between the pages 1 through 1504. I noted that the value changes for each page.
import difflib
# page 2
a = "6SQk6G2z:GBK-5lkb7acLMDDxywZsCHoJagJlT50f1gw4.jaVgaBpBcGZDs1T_pcR_OPFgvOm_6oM8PfyL3L6xDPxFqgIqgwbVAEw8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ"
# page 1504
b = "6SQk6G2z:GBK-59tY9cXfYPiYfpgB1rj16jFZNwQuke.NUV5ZljqD6daOH4pxgaFcRE7bERjrvfoY4OTl5PAWUo70VNRIqnYOi_TQCSWzvrcCgfTtEFl_ZdMHRVLhosJLSFwHiPdVn4cXZ7VnF5xahstqJHD6EBfd71iZT8HQBmx1dssd7RWA2Gdv8lGhJbS0ZeaxIVkfK5qaO.lxHVvG_9cq4weBdHeUQlGlIWhxKFYePkTr9Jp0eN2yDTZljeX0XWWOxIjEkdj89FOqaNDB2slUE.54oC96baGe7lttoz_2AoTbjHSTjfDh.eSyT6vA6.5dP5X.4XsFVYSnYKIznIdkjTURmm3kjvGM_iQoYT3V5gAKs1c6r6cE"
s = difflib.SequenceMatcher(None, a, b, autojunk=False)
for tag, i1, i2, j1, j2 in s.get_opcodes():
if tag != 'equal':
print('{:7} a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(
tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
# output
insert a[14:14] --> b[14:49] '' --> '9tY9cXfYPiYfpgB1rj16jFZNwQuke.NUV5Z'
replace a[15:18] --> b[50:55] 'kb7' --> 'jqD6d'
replace a[19:24] --> b[56:60] 'cLMDD' --> 'OH4p'
delete a[25:49] --> b[61:61] 'ywZsCHoJagJlT50f1gw4.jaV' --> ''
insert a[51:51] --> b[63:158] '' --> 'FcRE7bERjrvfoY4OTl5PAWUo70VNRIqnYOi_TQCSWzvrcCgfTtEFl_ZdMHRVLhosJLSFwHiPdVn4cXZ7VnF5xahstqJHD6E'
insert a[52:52] --> b[159:243] '' --> 'fd71iZT8HQBmx1dssd7RWA2Gdv8lGhJbS0ZeaxIVkfK5qaO.lxHVvG_9cq4weBdHeUQlGlIWhxKFYePkTr9J'
insert a[53:53] --> b[244:276] '' --> '0eN2yDTZljeX0XWWOxIjEkdj89FOqaND'
replace a[54:55] --> b[277:291] 'c' --> '2slUE.54oC96ba'
replace a[56:57] --> b[292:311] 'Z' --> 'e7lttoz_2AoTbjHSTjf'
insert a[58:58] --> b[312:369] '' --> 'h.eSyT6vA6.5dP5X.4XsFVYSnYKIznIdkjTURmm3kjvGM_iQoYT3V5gAK'
delete a[60:63] --> b[371:371] 'T_p' --> ''
delete a[64:74] --> b[372:372] 'R_OPFgvOm_' --> ''
replace a[75:84] --> b[373:374] 'oM8PfyL3L' --> 'r'
replace a[85:99] --> b[375:376] 'xDPxFqgIqgwbVA' --> 'c'
delete a[100:377] --> b[377:377] 'w8y4jd0P5yTWo3dx1cNLnCOYTa4mVr7azAXa9YiDEhOz7M1Qsw6BJIOSq0QVp.Ng.NWri7ByAK6dwme99ZEOnjraxZex1xLVGakyVVCoOEhFGfphV8D1GDFKLt1dG.4_XuCPDIoLNGmy4Dzd92SxlNWCQ707A8tvqP7jQq2wyRBV0M3y0moSs8I03rIXeYNKE3AkMmI8Xp4M6GZd0seJqGvGrN7vA8lJbiBfmEgtcSvPZF0hrfkVRvQGq9uHRx9JOLtdkujsYHk6TW7rYBVsQ' --> ''
The GBK
value is created with this call in the HTML of the page.
javascript:commitForECMA(callbackC,"content.jsp?tableId=27&tableName=TABLE27&tableView=杩涘彛鍖荤枟鍣ㄦ浜у搧锛堟敞鍐?&Id=60456",null)
This is the Javascript that is called.
function commitForECMA($_17, $_12, $_19) {
request = createXMLHttp();
request.onreadystatechange = $_17;
if ($_19 == null) {
_$b6(request, _$JI('ZM6r2MG'), _$JI("Op0YV"), $_12);
request.setRequestHeader(_$JI("RACeXwDYXwcTV8Ur2"), _$JI("9wDYgwceLwDT7iCYX3Ce9FKyvHKwPFa"));
} else {
var $_16 = "";
var $_11 = $_19.elements;
var $_14 = $_11.length;
for (var $_4 = 0; $_4 < $_14; $_4++) {
var $_6 = _$kH($_11, $_4);
if ($_6.type != _$JI("aQ6YPMK20") && _$kH($_6, _$JI('Cwbm7wKV')) != "") {
if ($_16.length > 0) {
$_16 += "&" + $_6.name + "=" + _$kH($_6, _$JI('swbm7wKV'));
} else {
$_16 += $_6.name + "=" + _$kH($_6, _$JI('8wbm7wKV'));
}
$_16 += _$JI("xx2J03Up2Hsl");
}
}
_$b6(request, _$JI('iM6r2MG'), _$JI("IVlesYq"), $_12);
$_16 = encodeURI($_16);
$_16 = encodeURI($_16);
request.setRequestHeader(_$JI("53CmOFDVz3CeXwoxBMq"), _$JI("wMbZz3CmOFDV"));
request.setRequestHeader(_$JI("ZACeXwDYXwcTV8Ur2"), _$JI("F3UraMD2O3UpNMCgB8cT6w6QzRbenM1TTQbS2MbJBRDY9"));
}
request.send($_16);
if ($_19 != null) {
$_19.reset();
}
}
truncated....
function createXMLHttp() {
if (window.XMLHttpRequest) {
return new XMLHttpRequest();
} else if (window.ActiveXObject) {
var $_17 = [_$JI("5sYJ3sVanh2fJslf0woqXJ1ga"), _$JI("osYJ3sVanh2fJslf0woqXJcga"), _$JI("ZsYJ3sVanh2fJslf0woqXWnga"), _$JI("fsYJ3sVanh2fJslf0woq"), _$JI("3sK2OQbeuMCR0h2fJslf0woq")];
for (var $_16 = 0; $_16 < $_17.length; $_16++) {
try {
return new ActiveXObject(_$kH($_17, $_16));
} catch ($_19) {}
}
throw new Error("您的浏览器不支持访问此网页");
}
}
truncated....
function callback() {
if (request.readyState == 1) {
_$_J(document.getElementById(_$JI("x3CeXwDYXwq")), '=', _$JI('3FKyXRUxEYlTW'), _$JI("EHDxnHOaB3vE5HDxnHOSNMKQGQ6xOHK2z3Kw2Qne7MCm9FKyvhbwNROg"));
}
if (request.readyState == 4) {
if (request.status == 200) {
oldContent.length = 0;
oldContent[0] = request.responseText;
_$_J(document.getElementById(_$JI("H3CeXwDYXwq")), '=', _$JI('OFKyXRUxEYlTW'), request.responseText);
request = null;
} else {
_$_J(document.getElementById(_$JI("w3CeXwDYXwq")), '=', _$JI('eFKyXRUxEYlTW'), "<br><br><br><span style=font-size:x-large;color:#215add>请勿频繁操作!</span>");