We've a "street_number" field which has been freely filed over the years that we want to format. Using regular expressions, we'd like to to extract the real "street_number", and the "street_number_suffix".
Ex: 17 b
, "street_number" would be 17
, and "street_number_suffix" would be b
.
As there's a dozen of different patterns, I'm having troubles to tune the regular expression correctly. I consider using 2 different regexes, one to extract the "street_number", and another to extract the "street_number_suffix"
Here's an exhaustive set of patterns we'd like to format and the expected output:
# Extract street_number using PCRE
input street_number street_number_suffix
19-21 19 null
2 G 2 G
A null A
1 bis 1 bis
3 C 3 C
N°10 10 null
17 b 17 b
76 B 76 B
7 ter 7 ter
9/11 9 null
21.3 21 3
42 42 null
I know I could invoke an expressions that matches any digits until a hyphen using d+(?=-)
.
It could be extended to match until a hyphen OR a slash using d+(?=-|/)
, thought, once I include s
to this pattern, 21
from 19-21
will match. Adding conditions may no be that simple, which is why I ask your help.
Could anyone give me a helping hand on this ? If it can help, here's a draft: https://regex101.com/r/jGK5Sa/4
Edit: at the time I'm editing, here's the closest regex I could find:
(?:(N°|(?<!-|/|.|[a-z]|.{1})))d+
Thought the full match of N°10
isn't 10
but N°10
(and our ETL doesn't support capturing groups, so I can't use /......(d+)/
)
question from:
https://stackoverflow.com/questions/65904349/match-street-number-from-different-formats-without-suffixes 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…