I'm on a tight schedule to come up with a python regex to match company names in many possible different copyright statements, for instance:
Copyright ? 2019 Apple Inc. All rights reserved.
? 2019 Quid, Inc. All Rights Reserved.
? 2009 Database Designs
? 2019 Rediker Software, All Rights Reserved
?2019 EVOSUS, INC. ALL RIGHTS RESERVED
? 2019 Walmart. All Rights Reserved.
? Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.
Copyright ? 1978-2019 Berkshire Hathaway Inc.
? 2019 McKesson Corporation
? 2019 UnitedHealth Group. All rights reserved.
? Copyright 1999 - 2019 CVS Health
Copyright 2019 General Motors. All Rights Reserved.
? 2019 Ford Motor Company
?2019 AT&T Intellectual Property. All rights reserved.
? 2019 GENERAL ELECTRIC
Copyright ?2019 AmerisourceBergen Corporation. All Rights Reserved.
? 2019 Verizon
? 2019 Fannie Mae
Copyright ? 2018 Jonas Construction Software Inc. All rights reserved.
All Comments ? Copyright 2017 Kroger | The Kroger Co. All Rights Reserved
? 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121
? 2019 JPMorgan Chase & Co.
Copyright ? 1995 - 2018 Boeing. All Rights Reserved.
? 2019 Bank of America Corporation. All rights reserved.
? 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801
?2019 Cardinal Health. All rights reserved.
What I know of regex is only very basic stuff and at the moment not enough to come up with a good solution fast.
From what it seems to me, at least for these examples, the requirements to correctly capture the company name are the following:
If there's a '?' or 'Copyright' in the sentence:
After '?' or 'Copyright' - look for a year, e.g. '2019', or a year range, e.g. '1995 - 2018' or '2003-2019' (spaces are to catch as well]):
If there's a dot somewhere after this year/year range, capture the text until the dot. E.g. in 'Copyright ? 1978-2019 Berkshire Hathaway Inc.' capture 'Berkshire Hathaway Inc'
If there's no dot but there's the sentence 'All rights reserved', capture from the year/year range until there and also ignore any possible non-alphanumeric characters that precede it, such as spaces and commas. E.g. from '? 2019 Rediker Software, All Rights Reserved' capture 'Rediker Software'
If there's no dot nor the sentence 'All rights reserved', capture from the year/year range until the end. E.g. from '? 2019 Verizon' Capture 'Verizon'
Any advice on a good regex for this?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…