Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
159 views
in Technique[技术] by (71.8m points)

selenium - Duplicates removal in urls list

I would appreciate assistance with the following. I'm scraping data from amazon Laptops product info. I have obtained the URLs and put them into a list using the following code

urls_list=[]
import numpy as np
from collections import OrderedDict 
pages = np.arange(0,4)
for page in pages:
    time.sleep(3)
    page=driver.get('https://www.amazon.com/s?k=laptops&page='+str(page)+'&qid=1611094287&ref=sr_pg_')
    base_url='https://www.amazon.com/'
    sleep(randint(2,3))
    soup=BeautifulSoup(driver.page_source, 'html.parser')
    las=soup.find_all('h2',class_="a-size-mini a-spacing-none a-color-base s-line-clamp-2")
    base_url='https://www.amazon.com/'
for s in las:
    last= s.find_all('a',href=True)
    for links in last:
        #print(links['href'])
        was=links['href']
        was=was.replace('https:/','')
        base=(base_url+was)
        urls_list.append(base)
        print(urls_list)

it returns a list of URLs that has a lot of duplicates. I would like assistance on how I would get the URLs I need without them getting duplicated. calling the len returns, 357 URLs instead of 66

    'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//HP-Touchscreen-Dual-Core-Processor-14-ds0110nr/dp/B07RX2XV4N/ref=sr_1_51?dchild=1&keywords=laptops&qid=1611094287&sr=8-51']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//HP-Touchscreen-Dual-Core-Processor-14-ds0110nr/dp/B07RX2XV4N/ref=sr_1_51?dchild=1&keywords=laptops&qid=1611094287&sr=8-51', 'https://www.amazon.com//Traditional-Computers-Supports-Expansion-Personal/dp/B08HRVCN8K/ref=sr_1_52?dchild=1&keywords=laptops&qid=1611094287&sr=8-52']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//HP-Touchscreen-Dual-Core-Processor-14-ds0110nr/dp/B07RX2XV4N/ref=sr_1_51?dchild=1&keywords=laptops&qid=1611094287&sr=8-51', 'https://www.amazon.com//Traditional-Computers-Supports-Expansion-Personal/dp/B08HRVCN8K/ref=sr_1_52?dchild=1&keywords=laptops&qid=1611094287&sr=8-52', 'https://www.amazon.com//HP-Dual-Core-Processor-Bluetooth-Microsoft/dp/B08KLMKLZR/ref=sr_1_53?dchild=1&keywords=laptops&qid=1611094287&sr=8-53']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//HP-Touchscreen-Dual-Core-Processor-14-ds0110nr/dp/B07RX2XV4N/ref=sr_1_51?dchild=1&keywords=laptops&qid=1611094287&sr=8-51', 'https://www.amazon.com//Traditional-Computers-Supports-Expansion-Personal/dp/B08HRVCN8K/ref=sr_1_52?dchild=1&keywords=laptops&qid=1611094287&sr=8-52', 'https://www.amazon.com//HP-Dual-Core-Processor-Bluetooth-Microsoft/dp/B08KLMKLZR/ref=sr_1_53?dchild=1&keywords=laptops&qid=1611094287&sr=8-53', 'https://www.amazon.com//HP-Stream-Laptop-Intel-Renewed/dp/B08LDLRNGF/ref=sr_1_54?dchild=1&keywords=laptops&qid=1611094287&sr=8-54']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//HP-Touchscreen-Dual-Core-Processor-14-ds0110nr/dp/B07RX2XV4N/ref=sr_1_51?dchild=1&keywords=laptops&qid=1611094287&sr=8-51', 'https://www.amazon.com//Traditional-Computers-Supports-Expansion-Personal/dp/B08HRVCN8K/ref=sr_1_52?dchild=1&keywords=laptops&qid=1611094287&sr=8-52', 'https://www.amazon.com//HP-Dual-Core-Processor-Bluetooth-Microsoft/dp/B08KLMKLZR/ref=sr_1_53?dchild=1&keywords=laptops&qid=1611094287&sr=8-53', 'https://www.amazon.com//HP-Stream-Laptop-Intel-Renewed/dp/B08LDLRNGF/ref=sr_1_54?dchild=1&keywords=laptops&qid=1611094287&sr=8-54', 'https://www.amazon.com//X3-Air-13-3-inch-180%C2%B0Rotation-Dual-band/dp/B087M2CQG6/ref=sr_1_55?dchild=1&keywords=laptops&qid=1611094287&sr=8-55']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//HP-Touchscreen-Dual-Core-Processor-14-ds0110nr/dp/B07RX2XV4N/ref=sr_1_51?dchild=1&keywords=laptops&qid=1611094287&sr=8-51', 'https://www.amazon.com//Traditional-Computers-Supports-Expansion-Personal/dp/B08HRVCN8K/ref=sr_1_52?dchild=1&keywords=laptops&qid=1611094287&sr=8-52', 'https://www.amazon.com//HP-Dual-Core-Processor-Bluetooth-Microsoft/dp/B08KLMKLZR/ref=sr_1_53?dchild=1&keywords=laptops&qid=1611094287&sr=8-53', 'https://www.amazon.com//HP-Stream-Laptop-Intel-Renewed/dp/B08LDLRNGF/ref=sr_1_54?dchild=1&keywords=laptops&qid=1611094287&sr=8-54', 'https://www.amazon.com//X3-Air-13-3-inch-180%C2%B0Rotation-Dual-band/dp/B087M2CQG6/ref=sr_1_55?dchild=1&keywords=laptops&qid=1611094287&sr=8-55', 'https://www.amazon.com//HP-i5-10210U-i7-8665U-Keyboard-Graphics/dp/B08QNGWYLJ/ref=sr_1_56?dchild=1&keywords=laptops&qid=1611094287&sr=8-56']
['https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A00982583SKT9G1E8WEMD&url=%2FHigh-Performance-Ultra-Thin-Entertainment-pre-Installed-Professional%2Fdp%2FB079GQK6BC%2Fref%3Dsr_1_49_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-49-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_next_aps_sr_pg4_1?ie=UTF8&adId=A01913892JJI6KO2E97JD&url=%2FStationaryLab-Aluminum-Organizer-Adjustable-Compatible%2Fdp%2FB08P38JJV8%2Fref%3Dsr_1_50_sspa%3Fdchild%3D1%26keywords%3Dlaptops%26qid%3D1611094287%26sr%3D8-50-spons%26psc%3D1&qualifier=1611274529&id=2949216330232123&widgetName=sp_atf_next', 'https://www.amazon.com//HP-Touchscreen-Dual-Core-Processor-14-ds0110nr/dp/B07RX2XV4N/ref=sr_1_51?dchild=1&keywords=laptops&qid=1611094287&sr=8-51', 'https://www.amazon.com//Traditional-Computers-Supports-Expansion-Personal/dp/B08HRVCN8K/ref=sr_1_52?dchild=1&keywords=laptops&qid=1611094287&sr=8-52', 'https://www.amazon.com//HP-Dual-Core-Processor-Bluetooth-Microsoft/dp/B08KLMKLZR/ref=sr_1_53?dchild=1&keywords=laptops&qid=1611094287&sr=8-53', 'https://www.amazon.com//HP-Stream-Laptop-Intel-Renewed/dp/B08LDLRNGF/ref=sr_1_54?dchild=1&keywords=laptop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Remove duplicates from list

To get rid of duplicates in a list you can transform it in a dict what would give you unique keys and than transform in listagain:

urls_list = list(dict.fromkeys(urls_list))

Loops

Not sure how you get that number of urls from your example, second loop wont make this work, it would only give you 22 urls, cause of its indent.

Take a look of my example, this should make it work and also avoid the fact, that it becomes to duplicates for example from the sponsored links.

Example

from selenium import webdriver
import numpy as np
import time
driver = webdriver.Chrome(executable_path='C:Program FilesChromeDriverchromedriver.exe')
pages = np.arange(0,4)

urls_list=[]

for page in pages:
    time.sleep(3)
    page=driver.get('https://www.amazon.com/s?k=laptops&page='+str(page)+'&qid=1611094287&ref=sr_pg_')
    sleep(2)
    soup=BeautifulSoup(driver.page_source, 'html.parser')
    
    for a in soup.select('h2 a'):
        base_url='https://www.amazon.com/'
        final_url = base_url+a['href'].replace('https:/','')
        if final_url not in urls_list:
            urls_list.append(final_url)

print(urls_list)
driver.close()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...