Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
160 views
in Technique[技术] by (71.8m points)

Parsing out content from HTML using regex?

How can I use regex to find everything except for data within div with a specific style? e.g.

<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>
<div style="float:left; padding-top:5px;">
    Data to keep
</div>
<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>

I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)

I'm using regex because I need to extract the data from several hundred files.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Your suggested method is probably not a good way to do this. If:

  • you have access to grep
  • your version of grep supports perl-compatible regex (PCRE)
  • this style of div only wraps your data, not other elements
  • the 'data' div does not contain other divs

Then you can use:

(?s)<div style="float:left; padding-top:5px;">.*?</div>

The important parts of this are:

  • (?s) which activates DOTALL, which means that . will match newlines
  • .*? which matches the contents of the div reluctantly, which means it'll stop at the first </div> it finds.

To use this, you'll need to activate a few grep options:

grep -Pzo $PATTERN file

For these:

  • -P activates the PCRE
  • -z replaces by NUL so grep will treat the entire file as a single line
  • -o prints only the matching parts

After this you'll need to strip off the divs. sed is a good tool for this.

sed 's|</?div[^>]*>||g'

If you put all of your files in one directory you can do the joining at the same time:

grep -Pzo $PATTERN *.html | sed 's|</?div[^>]*>||g' > out.html

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...