Parsing out content from HTML using regex?

Question

Welcome To Ask or Share your Answers For Others

Parsing out content from HTML using regex?

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

Parsing out content from HTML using regex?

How can I use regex to find everything except for data within div with a specific style? e.g.

<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>
<div style="float:left; padding-top:5px;">
    Data to keep
</div>
<div style="float:left;padding-left:10px; padding-right:10px">
    <img src="../Style/BreadCrumbs/Divider.png">
</div>

I want regex to match everything except for the data. The best way I can see is to just remove the html markup and combine the files afterwards with vb (I already have the code for vb.)

I'm using regex because I need to extract the data from several hundred files.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:16:06+0000

Your suggested method is probably not a good way to do this. If:

you have access to grep
your version of grep supports perl-compatible regex (PCRE)
this style of div only wraps your data, not other elements
the 'data' div does not contain other divs

Then you can use:

(?s)<div style="float:left; padding-top:5px;">.*?</div>

The important parts of this are:

(?s) which activates DOTALL, which means that . will match newlines
.*? which matches the contents of the div reluctantly, which means it'll stop at the first </div> it finds.

To use this, you'll need to activate a few grep options:

grep -Pzo $PATTERN file

For these:

-P activates the PCRE
-z replaces by NUL so grep will treat the entire file as a single line
-o prints only the matching parts

After this you'll need to strip off the divs. sed is a good tool for this.

sed 's|</?div[^>]*>||g'

If you put all of your files in one directory you can do the joining at the same time:

grep -Pzo $PATTERN *.html | sed 's|</?div[^>]*>||g' > out.html

Categories

Parsing out content from HTML using regex?

Parsing out content from HTML using regex?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags