Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
146 views
in Technique[技术] by (71.8m points)

html - how to get at this data

I am looking to scrape the three items that are highlighted and bordered from the html sample below. I've also highlighted a few markers that look useful.

How would you do this?

enter image description here

A Solution

Ok so this wasn't a great question and I'm actually surprised it didn't get down-voted more! Oh well, here are some bread crumbs for someone else.

Three of the four items of info I want are the inner text of a span element with a known id (ie, $0.83 for "yfs_l10_gm150220c00036500"), so I the following helper class seems to be a decent and direct shot:

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' GetSpanTextForId
'
' Returns the inner text from a span element known by the passed id
'
' param doc:     the source HTMLDocument
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function GetSpanTextForId(ByRef doc As HTMLDocument, ByVal spanId As String) As Double
'   Error Handling
    On Error GoTo ErrHandler
    Dim sRoutine        As String
    sRoutine = cModule & ".GetSpanTextForId"
     
    CheckArgNotNothing doc, "doc"
    CheckArgNotBadString spanId, "spanId"
'   Procedure
    Dim oSpan As HTMLSpanElement
    Set oSpan = doc.getElementById(spanId)
    Check Not oSpan Is Nothing, "Could not find span with id: " & Bracket(spanId)
    GetSpanTextForId = oSpan.innerText
    
    Exit Function

ErrHandler:
    Select Case DspErrMsg(sRoutine)
         Case Is = vbAbort:  Stop: Resume    'Debug mode - Trace
         Case Is = vbRetry:  Resume          'Try again
         Case Is = vbIgnore:                 'End routine
     End Select


End Function
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I would probably use an XML parser to get the text content first (or this: xmlString.replace(/<[^>]+>/g, "") to replace all tags with empty strings), then use the following regexes to extract the information you need:

/-OPRs+(d+.d+)/
/Bid:s+(d+.d+)/
/Ask:s+(d+.d+)/
/Open Interest:s+(d+,d+)/

This process can easily be done in nodejs (more info)or with any other language that supports regex.


live demo:

  • Waits 1 second, then removes tags.
  • Waits another second, then finds all patterns and creates a table.

wait = true; // Set to false to execute instantly.

var elem = document.getElementById("parsingStuff");
var str = elem.textContent;

var keywords = ["-OPR", "Bid:", "Ask:", "Open Interest:"];
var output = {};
var timeout = 0;

if (wait) timeout = 1000;

setTimeout(function() { // Removing tags.
  elem.innerHTML = elem.textContent;
}, timeout);

if (wait) timeout = 2000;

setTimeout(function() { // Looking for patterns.
  for (var i = 0; i < keywords.length; i++) {
    output[keywords[i]] = str.match(RegExp(keywords[i] + "\s+(\d+[\.,]\d+)"))[1];
  }

  // Creating basic table of found data.
  elem.innerHTML = "";
  var table = document.createElement("table");
  for (k in output) {
    var tr = document.createElement("tr");
    var th = document.createElement("th");
    var td = document.createElement("td");

    th.style.border = "1px solid gray";
    td.style.border = "1px solid gray";

    th.textContent = k;
    td.textContent = output[k];

    tr.appendChild(th);
    tr.appendChild(td);

    table.appendChild(tr);
  }
  elem.appendChild(table);
}, timeout);
<div id="parsingStuff">
  <div class="yfi_rt_quote_summary" id="yfi_rt_quote_summary">
    <div class="hd">
      <div class="title">
        <h2>GM Feb 2015 36.500 call (GM150220C00036500)</h2>
        <span class="rtq_exch">
        <span class="rtq_dash">-</span>OPR
        </span>
        <span class="wl_sign"></span>
      </div>
    </div>
    <div class="yfi_rt_quote_summary_rt_top sigfig_promo_1">
      <div>
        <span class="time_rtq_ticker">

        <span id="yfs_110_gm150220c00036500">0.83</span>
        </span>
      </div>
    </div>undefined</div>undefined
  <div class="yui-u first yfi-start-content">
    <div class="yfi_quote_summary">
      <div id="yfi_quote_summary_data" class="rtq_table">
        <table id="table1">
          <tr>
            <th scope="row" width="48%">Bid:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_b00_gm150220c00036500">0.76</span>
            </td>
          </tr>
          <tr>
            <th scope="row" width="48%">Ask:</th>
            <td class="yfnc_tabledata1">

              <span id="yfs_a00_gm150220c00036500">0.90</span>
            </td>
          </tr>
        </table>
        <table id="table2">
          <tr>
            <th scope="row" width="48%">Open Interest:</th>

            <td class="yfnc_tabledata1">11,579</td>
          </tr>
        </table>
      </div>
    </div>
  </div>
</div>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...