As its name may suggest, strip_tags
should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...')
call, without a second argument for whitelisted tags.
First at all, some theory about HTML tags: a tag starts with a <
followed by non-whitespace characters. If this string starts with a ?
, it should not be parsed. If this string starts with a !--
, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->
, inside such a comment, characters like <
and >
are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character ('
or "
). If such a quote exist, it must be closed, otherwise if a >
is encountered, the tag is not closed.
The code <a href="example>xxx</a><a href="second">text</a>
is interpreted in Firefox as:
<a href="http://example.com%3Exxx%3C/a%3E%3Ca%20href=" second"="">text</a>
The PHP function strip_tags
is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.
Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth
holds the number of open angle brackets (<
).
The variable in_q
contains the quote character ('
or "
) if any, and 0
otherwise. The last character is stored in the variable lc
.
The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:
- State 0 is the output state (not in any tag)
- State 1 means we are inside a normal html tag (the tag buffer contains
<
)
- State 2 means we are inside a php tag
- State 3: we came from the output state and encountered the
<
and !
characters (the tag buffer contains <!
)
- State 4: inside HTML comment
We need just to be careful that no tag can be inserted. That is, <
followed by a non-whitespace character. Line 4326 checks an case with the <
character which is described below:
- If inside quotes (e.g.
<a href="inside quotes">
), the <
character is ignored (removed from the output).
- If the next character is a whitespace character,
<
is added to the output buffer.
- if outside a HTML tag, the state becomes
1
("inside HTML tag") and the last character lc
is set to <
- Otherwise, if inside the a HTML tag, the counter named
depth
is incremented and the character ignored.
If >
is met while the tag is open (state == 1
), in_q
becomes 0
("not in a quote") and state
becomes 0
("not in a tag"). The tag buffer is discarded.
Attribute checks (for characters like '
and "
) are done on the tag buffer which is discarded. So the conclusion is:
strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.
By "outside tags", I mean not in tags as in <a href="in tag">outside tag</a>
. Text may contain <
and >
though, as in >< a>>
. The result is not valid HTML though, <
, >
and &
need still to be escaped, especially the &
. That can be done with htmlspecialchars()
.
The description for strip_tags
without an whitelist argument would be:
Makes sure that no HTML tag exist in the returned string.