Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
255 views
in Technique[技术] by (71.8m points)

parsing - How can we write a regular expression (regex) to identify quantities with units, such as "54.20 grams"?

Below are some example test inputs.
Test inputs are ASCII-encoded strings.

TEST CASE INPUTS

arrhar = Array(100)
arrhar[1] = "Low Carb Orzo Low Carb Rice, High Protein, Great Low Carb Bread Company, Low Carb Pasta Rice, 7 g per pack"
arrhar[2] = "Helios Certified Organic Greek Orzo Pasta, 500gr"
arrhar[3] = "Barilla Orzo Pasta 15.73 oz."
arrhar[4] = "Pasta Granoro Il Primo Orzo 6 ounces per bag"
arrhar[5] = "Authentic Italian Orzo -- 6 OUNCE per bag"
arrhar[6] = "ORZO PASA 4 U! 1 BAGGY IZ 4.39-GRM"
arrhar.trim() 
# `trim()` removes all elements of the array which have memory allocated, but no value assigned.    

TEST CASE OUTPUTS

out[1] = "7 g"    
out[2] = "500gr"     
out[3] = "15.73 oz"      
out[4] = "6 ounces"    
out[5] = "6 OUNCE"       
out[6] = "4.1-grm"    

English Description of Regular Expression

Suppose that we represent a string-matching pattern as a bulleted list.
bullet (1) is the left-most part of the string.
bullet (2) is the right-most part of the string.

  1. Numeric Quantity
    1. Zero or more Latin numerals
    2. zero or one decimal points or commas
    3. Zero or more Latin numerals
  2. Optional Delimiter
    1. Zero or more of any character except chars from the classes [A-Z], [a-z], and d
  3. Unit
    1. Grams
      1. Any case insensitive subsequence of "GRAMS" a. "g" b. "GRMS" c. "gs" d. "Gms" e. et cetera...
    2. Ounces
      1. Z-ounces ... any case-insensitive substring of OUNCEZ
      2. S-ounces ... any case-insensitive substring of OUNCES

Regex Peices

Appropriate regular expressions the left-part (integer-part) of a numeric quantity might be:

  • d*
  • d{0,}
  • [0-9]{0,}
  • [0123456789]*

A regex for zero or one decimal points is [.,]?

A decimal number is d*[.,]d

There might, or be not be, a delimiter between the number and the unit-specification.

  • 56.1gr
  • 56.1 gr
  • 56.1-grams

A suitable regexp for the delimiter might be [^a-zA-Z0-9]*

Suppose that we write a regex for the number and delimiter, but not the units (e.g. "ounces"). We might have:

d*[.,]?d[^a-zA-Z0-9]*?

I hope that the above would match "4.91...." or "4.91 "

A regex for subsequences of "GRAMS" might be: [Gg]?[Rr]?[Aa]?[Mm]?[Ss]?

A regex which captures something like "4.1-grm" is shown below:

d*[.,]?d[^a-zA-Z0-9]*?[Gg]?[Rr]?[Aa]?[Mm]?[Ss]?

How can we get both grams and ounces.

question from:https://stackoverflow.com/questions/65647077/how-can-we-write-a-regular-expression-regex-to-identify-quantities-with-units

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using a ? to make all the parts optional in [Gg]?[Rr]?[Aa]?[Mm]?[Ss]? could possibly also match RM or an empty string.

You might use a case insensitive match with an alternation | to list the possible alternatives making them a bit more specific.

d+(?:[.,]d+)?s*(?:gr?|oz|ounces?|-grm|grams?)
  • A word boundary
  • d+ Match 1+ digits
  • (?:[.,]d+)? Optionally match either . or , and 1+ digits
  • s* Match 0+ whitespace chars
  • (?:gr?|oz|ounces?|-grm|grams?) Match one of the alternatives
  • A word boundary

Regex demo

Another option for example is to nest non capture groups to make selected parts option, but in a certain order:

d+(?:[.,]d+)?s*-?(?:g(?:r(?:a?ms?)?)?|oz|ounces?)

Regex demo


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...