Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
149 views
in Technique[技术] by (71.8m points)

database - Google Sheets - Matching Company Names

I have 2 databases, both have names of companies, but in different formats. I have been able to do exact matching using vlookup. I want to extract companies that were written differently, but they are actually the same company and extract the data.

Below is a small part of the databases I have

Database 1

Column A
1-800-Flowers.com Inc
Abbott Laboratories (Abbott)
21st Century Fox America Inc (formerly News America Inc)

Column B
1234(data I need to grab)
4567
8910

Database 2

Column C                                             
1-800 CONTACTS INC                                 
1-800-FLOWERS.COM                                   
ABBOTT LABORATORIES                                 
TWENTY-FIRST CENTURY FOX INC                        

Column D
ABCD(DataI can ignore as the company doesn't exist in database 1)
EFGH (Data I need as it matches from Database 1)
IJK
LMNO

As you can see from the above databases, Database 1 matches Database 2's in similar words like 21st Century Fox America Inc vs Twenty-first Century Fox Inc

In my database 1, I have about 4000+ values, while in database 2, I have 10,000 values. Is there a code to compare similar words between both databases and extract the data I need from columns B and D?

I have tried query, but it doesn't work the way I wanted it to. This is my shareable link.

Currently, What I have done is to extract the words which are similar using REGEXTRACT to find a match between the strings like Century Fox in 21st Century Fox and Twenty-First Century Fox and attempted to match both data sets using query. However my query result comes up with NA when I write it like this

=query(E:E,"Select E where E contains '"&L2&"'",0 )

L2 being the cell that contains the string Century Fox

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

L2:

=ARRAYFORMULA(INDEX($E$2:$E$68,MATCH(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7)),ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7),0)))

M2:

=ARRAYFORMULA(INDEX($E$2:$F$68,MATCH(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7)),ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7),0),2))

N2:

=ARRAYFORMULA(TEXT(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7))/LEN(A2),"0%"))

Drag fill down.

Notes:

  • Formula is resource intensive. Apps Script might be a better choice.

  • For the given sample, This formula works with a reasonable degree of precision.

  • 7 is the maximum number of words per cell found in all of Column E( or Column C of database 2). This is hardcoded in the above formula. This should be found using a helper column. Z2:COUNTA (SPLIT(A2," ")) Drag fill down. And AA2: =MAX(Z2:Z)

  • N column gives the degree of confidence in the VLOOKUP produced result. Preferably, Anything below 45% should be rechecked manually.

  • How it works: All of E column (db2) is split by words and each of the word is looked upon in each entry of A column(db1). If a group of words are matched for multiple entries in E column, then the maximum of the length of matched words is taken and given as the possible match. A letter approach instead of a word approach may give better precision, but seems unnecessary in the given sample.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...