I want to scrape data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx
This page uses Power BI. Unfortunately, finding a way to scrape Power BI is hard, because everyone wants to scrape using/into Power BI, not from it. The closest answer was this question. Yet unrelated.
Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.
Therefore, I used Selenium. I wanted to Select All
at the begining (sending Ctrl+A
key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A
doesn't work.
I also tried to read the HTML contents, but Power BI puts div
elements on the screen using position:absolute
and distinguishing the location of a div
in the table (both row and column) is an effortful activity.
Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table.
Note: I realized that all of the data is not loaded and even shown at the same time. A div
of class scroll-bar-part-bar
is responsible to act as a scroll bar, and moving that loads/shows other parts of the data.
The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")
driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]
I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…