Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
557 views
in Technique[技术] by (71.8m points)

language agnostic - Efficient Algorithm for String Concatenation with Overlap

We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"

Our current algorithm is pretty slow because it uses brute-force to identify the overlapping part between 2 strings. Does any one know an efficient algorithm to do this?

Say we have 2 strings A and B. The algorithm needs to find the longest common substring S so that A ends with S and B starts with S.

Our current brute-force implementation in Java is attached for reference,

public static String concat(String s1, String s2) {
    if (s1 == null)
        return s2;
    if (s2 == null)
        return s1;
    int len = Math.min(s1.length(), s2.length());

    // Find the index for the end of overlapping part
    int index = -1;
    for (int i = len; i > 0; i--) {
        String substring = s2.substring(0, i);
        if (s1.endsWith(substring)) {
            index = i;
            break;
        }
    }
    StringBuilder sb = new StringBuilder(s1);
    if (index < 0) 
        sb.append(s2);
    else if (index <= s2.length())
        sb.append(s2.substring(index));
    return sb.toString();
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Most of the other answers have focused on constant-factor optimizations, but it's also possible to do asymptotically better. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!

Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.

That gives you an O(n) algorithm. Nice!

    int OverlappedStringLength(string s1, string s2) {
        //Trim s1 so it isn't longer than s2
        if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);

        int[] T = ComputeBackTrackTable(s2); //O(n)

        int m = 0;
        int i = 0;
        while (m + i < s1.Length) {
            if (s2[i] == s1[m + i]) {
                i += 1;
                //<-- removed the return case here, because |s1| <= |s2|
            } else {
                m += i - T[i];
                if (i > 0) i = T[i];
            }
        }

        return i; //<-- changed the return here to return characters matched
    }

    int[] ComputeBackTrackTable(string s) {
        var T = new int[s.Length];
        int cnd = 0;
        T[0] = -1;
        T[1] = 0;
        int pos = 2;
        while (pos < s.Length) {
            if (s[pos - 1] == s[cnd]) {
                T[pos] = cnd + 1;
                pos += 1;
                cnd += 1;
            } else if (cnd > 0) {
                cnd = T[cnd];
            } else {
                T[pos] = 0;
                pos += 1;
            }
        }

        return T;
    }

OverlappedStringLength("abcdef", "defghl") returns 3


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...