Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
433 views
in Technique[技术] by (71.8m points)

javascript - Is there a way to measure string similarity in Google BigQuery

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.

Seems like would be a neat function to have.

My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.

I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )

Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.

Any help much appreciated, thanks

EDIT: Adding some example code

So if i have a UDF defined as:

// distance function

function levenshteinDistance (row, emit) {

  //if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
  if (typeof row.inputA === 'undefined') {var myresult = 1};
  if (typeof row.inputB === 'undefined') {var myresult = 1};
  //if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};

    var myresult = Math.min(
        levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
        levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
        levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
    ) + 1;

  emit({outputA: myresult})

}

bigquery.defineFunction(
  'levenshteinDistance',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  levenshteinDistance                       // Reference to JavaScript UDF
);

// make a test function to test individual parts

function test(row, emit) {
  if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
  emit({outputA: x});
}

bigquery.defineFunction(
  'test',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  test                       // Reference to JavaScript UDF
);

Any i try test with a query such as:

SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))

I get error:

Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39 Error Location: User-defined function

It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.

Again any help much appreciated, thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...