Let's flip your question upside down, and start with the theory. Methodology might be a better word for it.
You want to get at something specific in a structured page. To do that, you either need a way to zap right to the element (which can be done if it's labeled in a unique way that we can access), OR you need to navigate the structure more-or-less manually. You already know how to look at the source of a page, so you're familiar with this step. Here's a screenshot of Firefox Inspector, highlighting the element we're interested in.
We can see the hierarchy of elements that lead to the table: html, body, div, div, div.ticker, table.ticker_data. We can also see the source:
<table class="ticker_data">
Neat! It's labeled! Unfortunately, that class info gets dropped when we process the HTML in our script. Bummer. If it was id="ticker_data"
instead, we could use the getElementByVal() utility from this answer to reach it, and give ourselves some immunity from future restructuring of the page. Put a pin in that - we'll come back to it.
It can help to visualize this in the debugger. Here's a utility script for that - run it in debug mode, and you'll have your HTML document laid out to explore:
/**
* Debug-run this in the editor to be able to explore the structure of web pages.
*
* Set target to the page you're interested in.
*/
function pageExplorer() {
var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
var pageTxt = UrlFetchApp.fetch(target).getContentText();
var pageDoc = Xml.parse(pageTxt,true);
debugger; // Pause in debugger - explore pageDoc
}
This is what our page looks like in the debugger:
You might be wondering what the numbered elements are, since you don't see them in the source. When there are multiples of an element type at the same level in an XML document, the parser presents them as an array, numbered 0..n
. Thus, when we see 0
under a div
in the debugger, that's telling us that there are multiple <div>
tags in the HTML source at that level, and we can access them as an array, for example .div[0]
.
Ok, theory behind us, let's go ahead and see how we can access the table by brute-force.
Knowing the hierarchy, including the div arrays shown in the debugger, we could do this, ala Phil's previous answer. I'll do some weird indenting to illustrate the document structure:
...
var target = "http://www.bloomberg.com/markets/companies/country/hong-kong/";
var pageTxt = UrlFetchApp.fetch(target).getContentText();
var pageDoc = Xml.parse(pageTxt,true);
var table = pageDoc.getElement()
.getElement("body")
.getElements("div")[0] // 0-th div under body, shown in debugger
.getElements("div")[5] // 5-th div under there
.getElement("div") // another div
.getElement("table"); // finally, our table
As a much more compact alternative to all those .getElement()
calls, we can navigate using dot notation.
var table = pageDoc.getElement().body.div[0].div[5].div.table;
And that's that.
Let's go back to that pinned idea. In the debugger, we can see that there are various attributes attached to elements. In particular, there's an "id" on that div[5] that contains the div that contains the table. Remember, in the source we saw "class" attributes, but note that they don't make it this far.
Still, the fact that a kindly programmer put this "id" in place means we can do this, with getDivById()
from that earlier question:
var contentDiv = getDivById( pageDoc.getElement().body, 'content' );
var table = contentDiv.div.table;
If they move things around, we might still be able to find that table, without changing our code.
You already know what to do once you have the table element, so we're done here!