Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
430 views
in Technique[技术] by (71.8m points)

python - Regex to match company names from copyright statements under several conditions

I'm on a tight schedule to come up with a python regex to match company names in many possible different copyright statements, for instance:

Copyright ? 2019 Apple Inc. All rights reserved.  
? 2019 Quid, Inc. All Rights Reserved.  
? 2009 Database Designs  
? 2019 Rediker Software, All Rights Reserved  
?2019 EVOSUS, INC. ALL RIGHTS RESERVED  
? 2019 Walmart. All Rights Reserved.  
? Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.  
Copyright ? 1978-2019 Berkshire Hathaway Inc.  
? 2019 McKesson Corporation  
? 2019 UnitedHealth Group. All rights reserved.  
? Copyright 1999 - 2019 CVS Health  
Copyright 2019 General Motors. All Rights Reserved.  
? 2019 Ford Motor Company  
?2019 AT&T Intellectual Property. All rights reserved.  
? 2019 GENERAL ELECTRIC  
Copyright ?2019 AmerisourceBergen Corporation. All Rights Reserved.  
? 2019 Verizon  
? 2019 Fannie Mae  
Copyright ? 2018 Jonas Construction Software Inc. All rights reserved.  
All Comments ? Copyright 2017 Kroger | The Kroger Co. All Rights Reserved  
? 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121  
? 2019 JPMorgan Chase & Co.  
Copyright ? 1995 - 2018 Boeing. All Rights Reserved.  
? 2019 Bank of America Corporation. All rights reserved.  
? 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801  
?2019 Cardinal Health. All rights reserved.  

What I know of regex is only very basic stuff and at the moment not enough to come up with a good solution fast.

From what it seems to me, at least for these examples, the requirements to correctly capture the company name are the following:

If there's a '?' or 'Copyright' in the sentence:
    After '?' or 'Copyright' - look for a year, e.g. '2019', or a year range, e.g. '1995 - 2018' or '2003-2019' (spaces are to catch as well]):
        If there's a dot somewhere after this year/year range, capture  the text until the dot. E.g. in 'Copyright ? 1978-2019 Berkshire Hathaway Inc.' capture 'Berkshire Hathaway Inc'
        If there's no dot but there's the sentence 'All rights reserved', capture from the year/year range until there and also ignore any possible non-alphanumeric characters that precede it, such as spaces and commas. E.g. from '? 2019 Rediker Software, All Rights Reserved' capture 'Rediker Software'
        If there's no dot nor the sentence 'All rights reserved', capture from the year/year range until the end. E.g. from '? 2019 Verizon' Capture 'Verizon'

Any advice on a good regex for this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You may consider a regex like

(?i)(?:?(?:s*Copyright)?|Copyright(?:s*?)?)s*d+(?:s*-s*d+)?s*(.*?(?=W*Alls+rightss+reserved)|[^.]*(?=.)|.*)

See the regex demo. Use a case insensitive modifier, re.I with it.

Details

  • (?:?(?:s*Copyright)?|Copyright(?:s*?)?) - either
    • ?(?:s*Copyright)? - ? char followed with an optional substring of 0+ whitespaces and then Copyright
    • | - or
    • Copyright(?:s*?)? - Copyright followed with an optional substring of 0+ whitespaces and ? char
  • s* - 0+ whitespaces
  • d+ - 1+ digits (use d{4} if the years always contain 4 digits)
  • (?:s*-s*d+)? - an optional sequence of a - enclosed with 0+ whitespaces and then 1+ digits (use d{4} if the years always contain 4 digits)
  • s* - 0+ whitespaces
  • (.*?(?=W*Alls+rightss+reserved)|[^.]*(?=.)|.*) - Capturing group 1: any of the alternatives:
    • .*?(?=W*Alls+rightss+reserved) - any 0+ chars other than line break chars, s few as possible, up to the 0+ non-word chars followed with All rights reserved string
    • [^.]*(?=.) - any 0+ chars other than . as many as possible up to . not including .
    • .* - the rest of the line

Python demo:

import re
s = "Copyright ? 2019 Apple Inc. All rights reserved.
? 2019 Quid, Inc. All Rights Reserved.
? 2009 Database Designs 
? 2019 Rediker Software, All Rights Reserved
?2019 EVOSUS, INC. ALL RIGHTS RESERVED
? 2019 Walmart. All Rights Reserved.
? Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.
Copyright ? 1978-2019 Berkshire Hathaway Inc.
? 2019 McKesson Corporation
? 2019 UnitedHealth Group. All rights reserved.
? Copyright 1999 - 2019 CVS Health
Copyright 2019 General Motors. All Rights Reserved.
? 2019 Ford Motor Company
?2019 AT&T Intellectual Property. All rights reserved.
? 2019 GENERAL ELECTRIC
Copyright ?2019 AmerisourceBergen Corporation. All Rights Reserved.
? 2019 Verizon
? 2019 Fannie Mae
Copyright ? 2018 Jonas Construction Software Inc. All rights reserved.
All Comments ? Copyright 2017 Kroger | The Kroger Co. All Rights Reserved
? 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121
? 2019 JPMorgan Chase & Co.
Copyright ? 1995 - 2018 Boeing. All Rights Reserved.
? 2019 Bank of America Corporation. All rights reserved.
? 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801
?2019 Cardinal Health. All rights reserved.
? 2019 Quid, Inc All Rights Reserved."
rx = r"(?:?(?:s*Copyright)?|Copyright(?:s*?)?)s*d+(?:s*-s*d+)?s*(.*?(?=W*Alls+rightss+reserved)|[^.
]*(?=.)|.*)"
for m in re.findall(rx, s, re.I):
    print(m)

Output:

Apple Inc
Quid, Inc
Database Designs 
Rediker Software
EVOSUS, INC
Walmart
Exxon Mobil Corporation
Berkshire Hathaway Inc
McKesson Corporation
UnitedHealth Group
CVS Health
General Motors
Ford Motor Company
AT&T Intellectual Property
GENERAL ELECTRIC
AmerisourceBergen Corporation
Verizon
Fannie Mae
Jonas Construction Software Inc
Kroger | The Kroger Co
Express Scripts Holding Company
JPMorgan Chase & Co
Boeing
Bank of America Corporation
Wells Fargo
Cardinal Health
Quid, Inc

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...