Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
192 views
in Technique[技术] by (71.8m points)

python - Get element with a randomized class name

It looks like the <class id> for <img class> on Instagram's web page is changing every day. Right now it is FFVAD and tomorrow it will be something else. For example (I made it shorter, links are long):

<img class="FFVAD" alt="Tag your best friend" decoding="auto" style="" sizes="293px" src="https://scontent-lax3-2.cdninstagram.com/vp/0436c00a3ac9428b2b8c977b45abd022/5BAB3EBC/t51.2885-15/s640x640/sh0.08/e35/33110483_592294374461447_8669459880035221504_n.jpg">

By saying that, I need to fix the script and hardcode the Class ID in order to be able scrape the web-page.

var = driver.find_elements_by_class_name('FFVAD')

Somebody told me that I could use img.get_attribute('class') to find the class ID and store it for later. But I still don't understand how this can be achieved, so selenium or soup could grab the Class ID from the html tag and store or parse it later.

All I got now is this. It's little dirty, and not right, but the idea is there.

import requests
import selenium.webdriver as webdriver

url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    imgs_dedupe = driver.find_elements_by_class_name('FFVAD')

    for img in imgs_dedupe:
        posts = img.get_attribute('class')
        print posts

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

When I run it, I get this output, and because there are 3 images on the page, I get 3x Class ID

python tag_print.py 
FFVAD
FFVAD
FFVAD
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You're currently searching for the element by a hardcoded class name.

If the class name is randomized, you cannot hardcode it any longer. You have to either:

  • Search the element by some other characteristics (e.g. element hierarchy, some other attributes, etc; XPath can do that)

    In [10]: driver.find_elements_by_xpath('//article//img')
    Out[10]:
    [<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")>,
     <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")>,
     <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>]
    
    • You can also search by the element's visual characteristics: size, visibility, position. This cannot be done solely by XPath though, you'll have to get all <img> tags and inspect each one with JS by hand.
      (See an example below because it's long.)
  • Learn this class name somehow from other page logic (it must be present somewhere else if the page's logic itself can find and use it, and that logic must be found by something else, etc etc)

    In this case, the class name is a part of a local variable in the renderImage function, so it's only salvageable via DOM by exploring its AST. The function itself is buried somewhere inside webpack machinery (it seems to pack all resources into a few global objects with one-letter names). Alternatively, you can read all included JS files as raw data and look for the definition of renderImage in them. So, in this case, it's disproportionally hard, though theoretically possible still.


Example of getting elements by visual characteristics

On any page whatsoever, this would find 3 images of the same size, located side by side (this is the way they are at https://www.instagram.com/kitties).

Since HTMLElements can't be passed to Python directly (at least, I couldn't find any way to), we need to pass some unique IDs instead to locate them by, like unique XPath's.

(The JS code could probably be more elegant, I don't have much experience with the language)

In [22]: script = """
  //https://stackoverflow.com/questions/2661818/javascript-get-xpath-of-a-node/43688599#43688599
  function getXPathForElement(element) {
      const idx = (sib, name) => sib 
          ? idx(sib.previousElementSibling, name||sib.localName) + (sib.localName == name)
          : 1;
      const segs = elm => !elm || elm.nodeType !== 1 
          ? ['']
          : elm.id && document.querySelector(`#${elm.id}`) === elm
              ? [`id("${elm.id}")`]
              : [...segs(elm.parentNode), `${elm.localName.toLowerCase()}[${idx(elm)}]`];
      return segs(element).join('/');
  }

  //https://plainjs.com/javascript/styles/get-the-position-of-an-element-relative-to-the-document-24/
  function offsetTop(el){
    return window.pageYOffset + el.getBoundingClientRect().top;
  }

  var expected_images=3;
  var found_groups=new Map();
  for (e of document.getElementsByTagName('img')) {
    let group_id = e.offsetWidth + "x" + e.offsetHeight;
    if (!(found_groups.has(group_id))) found_groups.set(group_id,[]);
    found_groups.get(group_id).push(e);
  }
  for ([k,v] of found_groups) {
    if (v.length != expected_images) {found_groups.delete(k);continue;}
    var offset_top = offsetTop(v[0]);
    for (e of v){
      let _c_oft = offsetTop(e);
      if (_c_oft !== offset_top){
        found_groups.delete(k);
        break;
      }
    }
  }
  if (found_groups.size != 1) {
    console.log(found_groups);
    throw 'Unexpected pattern of images after filtering';
  }

  var found_group = found_groups.values().next().value;


  result=[]
  for (e of found_group) {
    result.push(getXPathForElement(e));
  }
  return result;
"""

In [23]: d.execute_script(script)
Out[23]:
[u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[1]/a[1]/div[1]/div[1]/img[1]',
 u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[2]/a[1]/div[1]/div[1]/img[1]',
 u'id("react-root")/section[1]/main[1]/div[1]/article[1]/div[1]/div[1]/div[1]/div[3]/a[1]/div[1]/div[1]/img[1]']

In [27]: [d.find_element_by_xpath(xp) for xp in _]
Out[27]:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="55c48964-8cd0-4472-b35b-214a5a9bfbf7")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="b7f7c8a4-e343-49ca-b416-49f72e67ae07")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="1ab4eeb4-10c4-4da4-996c-ee6744445dcc", element="728f6148-6a03-4c9a-9933-36859d65eb51")>]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...