Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
313 views
in Technique[技术] by (71.8m points)

Regex disallow characters from anywhere in string that has capture groups

With the Accept Http Header, I have created the following regex string to validate the value of the header:

^(([^/]+[/][^/;,]+)(;q[ ]*=[ ]*[0-9][.][0-9])?([,][ ]*)?)+$

Whilst this works for the different examples of valid header inputs (such as (single input): text/html, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q=0.8), the header is a CORS safelisted request header, meaning that the header has an additional restriction: there is a set of characters that can't be in the value: ():<>?@[]{}, 0x00-0x1f (except 0x09), 0x7f.

So I tried to just disallow the : by putting [^:]* at the start or end of the regex string, with no effect. Is this syntax right, and if so where do I need to put it in order for it to apply to the entire string?

If the regex string is just ^[^:]*$, then it disallows : anywhere in the string, so I'm not sure if for the header example it isn't working due to capture groups? I haven't a huge amount of experience with regex. I will be implementing the regex into Python 3.9.

question from:https://stackoverflow.com/questions/65859433/regex-disallow-characters-from-anywhere-in-string-that-has-capture-groups

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Based on RFC-7231 at https://tools.ietf.org/html/rfc7231#section-5.3.2, each media type also accepts optional key=value parameters besides the q=0.7. Here is a more comprehensive test case to covers this:

text/*, text/plain, text/plain;format=flowed, text/html;level=1, text/html;level=2;q=0.4, application/xhtml+xml, application/xml;q=0.9, image/webp, */*;q=0.1

This regex tests for RFC compliance of the Accept value:

/^(?!.*[():<>?@[\]{}x00-x08x0a-x1fx7f])s*[^/]+/[^,;]+(;s*[^=]+(?<!q)=[^,;]+)*(;s*q=[01](.d{1,3})?)?(,s*[^/]+/[^,;]+(;s*[^=]+(?<!q)=[^,;]+)*(;s*q=[01](.d{1,3})?)?)*s*$/

Same over multiple lines for readability (not a valid regex) :

/^(?!.*[():<>?@[\]{}x00-x08x0a-x1fx7f])
  s*[^/]+/[^,;]+(;s*[^=]+(?<!q)=[^,;]+)*(;s*q=[01](.d{1,3})?)?
(,s*[^/]+/[^,;]+(;s*[^=]+(?<!q)=[^,;]+)*(;s*q=[01](.d{1,3})?)?)*
s*$/

Explanation:

  • ^ - anchor to start of string
  • (?!.*[():<>?@[\]{}x00-x08x0a-x1fx7f]) - negative lookahead that tests for any of the invalid characters: greedy .* match up to [...] character class
  • s* - scan over optional whitespace
  • [^/]+/[^,;]+ - scan over 1+ chars, a /, and anything up to , or ;, e.g. a media type, such as text/plain
  • ( ... )* - scan over optional key=value pattern, zero to multiple times:
    • ;s* - scan over ; separator, and optional whitespace
    • [^=]+(?<!q) - scan over anything up to =, but not a q
    • = - scan over =
    • [^,;]+ - scan over anything up to , or ;
  • ( ... )? - scan over optional q=1 or q=0.001 pattern:
    • ;s* - scan over ; separator, and optional whitespace
    • q=[01] - scan over q=0 or q=1
    • (.d{1,3})? - followed by optional . and 1 to 3 digits (based on RFC)
  • the first media type is covered up to this point
  • ( ... )* - scan over optional (zero to multiple) additional media types:
    • ,s* - scan over , separator, and optional whitespace
    • followed by same pattern as above to: scan for media type, scan for optional key=value patterns separated by ;, scan for optional q=... pattern
  • s* - scan over optional whitespace
  • $ - anchor string at the end

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...