Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
213 views
in Technique[技术] by (71.8m points)

screen scraping - Reading HTML page using Libreoffice Basic

I'm new to LibreOffice Basic. I'm trying to write a macro in LibreOffice Calc that will read the name of a noble House of Westeros from a cell (e.g. Stark), and output the Words of that House by looking it up on the relevant page on A Wiki of Ice and Fire. It should work like this:

enter image description here

enter image description here

Here is the pseudocode:

Read HouseName from column A
Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName
Iterate through HtmlFile to find line which begins "<table class="infobox infobox-body"" // Finds the info box for the page.
Read Each Row in the table until Row begins Words
Read the contents of the next <td> tag, and return this as a string.

My problem is with the second line, I don't know how to read a HTML file. How should I do this in LibreOffice Basic?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There are two mainly issues with this. 1. Performance Your UDF will need get the HTTP resource in every cell, in which it is stored. 2. HTML Unfortunately there is no HTML parser in OpenOffice or LibreOffice. There is only a XML parser. Thats why we cannot parse HTML directly with the UDF.

This will work, but slow and not very universal:

Public Function FETCHHOUSE(sHouse as String) as String

   sURL = "http://awoiaf.westeros.org/index.php/House_" & sHouse

   oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
   oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
   on error goto falseHouseName
   oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
   on error goto 0
   dim delimiters() as long
   sContent = oInpDataStream.readString(delimiters(), false)

   lStartPos = instr(1, sContent, "<table class=" & chr(34) & "infobox infobox-body" )
   if lStartPos = 0 then
     FETCHHOUSE = "no infobox on page"
     exit function
   end if   
   lEndPos = instr(lStartPos, sContent, "</table>")
   sTable = mid(sContent, lStartPos, lEndPos-lStartPos + 8)

   lStartPos = instr(1, sTable, "Words" )
   if lStartPos = 0 then
     FETCHHOUSE = "no Words on page"
     exit function
   end if        
   lEndPos = instr(lStartPos, sTable, "</tr>")
   sRow = mid(sTable, lStartPos, lEndPos-lStartPos + 5)

   oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
   oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
   oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
   oOptions.searchString = "<td[^<]*>"
   oTextSearch.setOptions(oOptions)
   oFound = oTextSearch.searchForward(sRow, 0, Len(sRow))
   If  oFound.subRegExpressions = 0 then 
     FETCHHOUSE = "Words header but no Words content on page"
     exit function   
   end if
   lStartPos = oFound.endOffset(0) + 1
   lEndPos = instr(lStartPos, sRow, "</td>")
   sWords = mid(sRow, lStartPos, lEndPos-lStartPos)

   FETCHHOUSE = sWords
   exit function

   falseHouseName:
   FETCHHOUSE = "House name does not exist"

End Function

The better way would be, if you could get the needed informations from a Web API that would offered from the Wiki. You know the people behind the Wiki? If so, then you could place this there as a suggestion.

Greetings

Axel


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...