Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
237 views
in Technique[技术] by (71.8m points)

javascript - Identify & Extract the title/description of an Image (Data Scraping Pinterest)

How can Javascript/jQuery be used to identify the description or title corresponding to an image on a webpage with multiple images and descriptions?

The page title can be extracted very easily, but the title may not correspond to the image especially if there are many images present on the page

var title = document.title;

I believe this has been done successfully by Pinterest's Pin-it bookmarklet. I'm guessing it has to do with an algorithm to find the nearest h1, h2, h3 or the image's alt attributes, then fallback to the document.title if the algorithm fails to identify the image's description on the page.

Any ideas greatly appreciated!

EDIT

This is for data scraping other websites

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The OP has provided a great question to expand on. I recently created a jsFiddle for another SO Answer to data scrape URL, Title, and Thumbnail from the new Yahoo! Screen Video Player webpages.

I have just re-written that jsFiddle so it's Pinterest specific and have made direct use of Metatag Object Numbers (more on that later) which makes this jsFiddle very different from that one.

The overall process involves using Yahoo's Query Language along with jQuery .ajax() function to get the desired scraped data, usually available in the webpages source metatag section.


First, let me explain a few things.

The Pinterest Link that I will use will be a direct link to a pinned item. This means that webpage will contain the primary pinned item along with many other smaller pinned items, unlike the homepage which contains a multitude of only pinned items.

That Pinterest Link has for it's Webpage Title the pinned item's Title along with a few words that makes up the pinned item's Description. This most likely is not desired, and just the pinned item's Title is all that's needed.

Viewing the HTML Source Page for the Pinterest Link shows us the metatags that are currently used. Here's most of them:

<meta property="fb:app_id" content="274266067164"/>

<meta property="og:site_name" content="Pinterest"/>
<meta property="og:type" content="pinterestapp:pin"/>
<meta property="og:url" content="http://pinterest.com/pin/40250990391375228/"/>
<meta property="og:title" content="FUNNY!!"/>
<meta property="og:description" content="Someone please do this."/>
<meta property="og:image" content="http://media-cache0.pinterest.com/upload/62980094758941134_yXgT124O_c.jpg"/>
<meta property="og:see_also" content="http://9gag.com/gag/2934786" />

<meta property="pinterestapp:pinboard" content="http://pinterest.com/amjo32/funny/"/>
<meta property="pinterestapp:pinner" content="http://pinterest.com/amjo32/"/>
<meta property="pinterestapp:source" content="http://9gag.com/gag/2934786"/>
<meta property="pinterestapp:likes" content="21"/>
<meta property="pinterestapp:repins" content="30"/>
<meta property="pinterestapp:comments" content="0"/>
<meta property="pinterestapp:actions" content="51"/>

<meta name="twitter:card" content="photo">
<meta name="twitter:url" content="http://pinterest.com/pin/40250990391375228/">
<meta name="twitter:site" content="@pinterest">

<meta name="google-site-verification" content="NvDayNupl7R0MDceeuRcs7xUf9yqUsxg6WGjEeRdAnc" />
<meta name="application-name" content="Pinterest" />
<meta name="msapplication-TileColor" content="#ffffff" />

As you can see, those metatags contains og:title and og:image data for which we are after. It's then realized that these og metatags are a direct target which to perform the data scraping process.

To be sure, the os:image content link above is for the full image size version via _c.jpg. The Thumbnail versions use _b.jpg. Essentially, you have two unique image sizes per pinned item.

Since the data scraping process does not return these og property names, only Metatag Object Numbers, we need to analyze the returned content associated with each Metatag Object Number.

Looking at the above metatag source, it's clear that the image will always be located at some place starting with http://media-. Those 13 characters are unique among all metatags, and therefore when that's matched, that entire URL is the image location.

Of course should Pinterest use more than one URL Template for there images, then things will need to be adjusted accordingly.

Looking at og:title you immediately realize that there are no unique string of characters in the content portion to indicate that this tag is the image's title. Therefore, assuming all metatags follow a template and will not change for some time, we will allocate this Metatag Object Number 7 to provide the Pinterest Pinned Item's Image Title. To be clear, this number 7 is based on .ajax() and YQL Results from this scripts process, not the source HTML structure as seen above.

Again, if Pinterest changes there template for the head section, then adjustments may be required.

What follows now is an live step by step tutorial I wrote, based on data scraping techniques/script seen in this online article.


jsFiddle Pinterest Data Scraping DEMO



Tip:
Although not demonstrated, at your disposal is a numeric value for total found Metatags, which can be checked against a predetermined value for what the page should contain, indicating the head section has changed. For example, the current metatag count is 25 items. If the returned value is not equal to this value on any other Pinterest Pinned Item webpage, you know there is a different head section in use... which may affect the script since it expects only 25 and calls two of them directly by it's Metatag Object Number.


Something extra:
If your curious on how to retrieve the current Pinterest Pinned ITEMS as seen on the homepage, first understand how this jsFiddle DEMO works. Then, you'll need to make your own jsFiddle version for testing and use the Pinterest Homepage URL along with changing the XPATH in the .ajax() call to data scrape only the relevant div's in the body section. To learn more about XPATH basics, click HERE. Then you can understand: XPATH for Select Divs in Body on YQL Playground.

For example, the body section contains a maximum total of 50 pin's in this format:

 "href": "/pin/15833036160340477/"

Those href fragments will serve as a starting point in recreating the URL's. Important note: Some pins may be repins which means you will have less than 50 pins returned.

For those that read this far, here it is:

Something Extra jsFiddle DEMO.

Here is an improved XPATH for Select Divs in Body on YQL Playground, but do understand how the longer one above works.


Also see my other Pinterest SO Answers for:

Custom Pinterest button for custom URL (Text-Link, Image, or Both)

How can I duplicate Pinterest website's modal effect?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...