regex - Using regular expressions to parse HTML: why not?

Question

Welcome To Ask or Share your Answers For Others

regex - Using regular expressions to parse HTML: why not?

posted Oct 16, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Using regular expressions to parse HTML: why not?

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML.

Why not? I'm aware that there are quote-unquote "real" HTML parsers out there like Beautiful Soup, and I'm sure they're powerful and useful, but if you're just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

Moreover, is there just something fundamental that I don't understand about regex that makes them a bad choice for parsing in general?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T05:43:55+0000

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language and not a regular language (As @StefanPochmann pointed out, regular languages are also context-free, so context-free doesn't necessarily mean not regular). The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

Categories

regex - Using regular expressions to parse HTML: why not?

regex - Using regular expressions to parse HTML: why not?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags