Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
478 views
in Technique[技术] by (71.8m points)

vbscript - Extract text between HTML tags

I have many HTML files from which I need to extract text. If it's all on one line, I can do that quite easily but if the tag wraps around or is on multiple lines I can't figure how to do this. Here's what I mean:

<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>

I'm not concerned about the <br> text, unless it will help wrap the text around. The area that I want always begins with "MySection" and then is ended with </section>. What I'd like to end up with is something like this:

Some text here  another line here  last line of text.

I'd prefer something like a vbscript or command line option (sed?) but I'm not sure where to begin. Any help?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Normally you'd use the Internet Explorer COM object for this:

root = "C:asedir"

Set ie = CreateObject("InternetExplorer.Application")

For Each f In fso.GetFolder(root).Files
  ie.Navigate "file:///" & f.Path
  While ie.Busy : WScript.Sleep 100 : Wend

  text = ie.document.getElementById("MySection").innerText

  WScript.Echo Replace(text, vbNewLine, "")
Next

However, the <section> tag is not supported prior to IE 9, and even in IE 9 the COM object doesn't seem to handle it correctly, as getElementById("MySection") only returns the opening tag:

>>> wsh.echo ie.document.getelementbyid("MySection").outerhtml
<SECTION id=MySection>

You could use a regular expression instead, though:

root = "C:asedir"

Set fso = CreateObject("Scripting.FileSystemObject")

Set re1 = New RegExp
re1.Pattern = "<section id=""MySection"">([sS]*?)</section>"
re1.Global  = False
re2.IgnoreCase = True

Set re2 = New RegExp
re2.Pattern = "(<br>|s)+"
re2.Global  = True
re2.IgnoreCase = True

For Each f In fso.GetFolder(root).Files
  html = fso.OpenTextFile(filename).ReadAll

  Set m = re1.Execute(html)
  If m.Count > 0 Then
    text = Trim(re2.Replace(m.SubMatches(0).Value, " "))
  End If

  WScript.Echo text
Next

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...