Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.7k views
in Technique[技术] by (71.8m points)

java - How to optimize code without using regex?

My test assignment when hiring was to optimize this code:

someString.replaceAll(""", "'").replaceAll("
", "").replace("[]", "{}");

To solve this problem, I wrote this method:

public String outputValue(String input) {
   char[] inputArray = input.toCharArray();
   StringBuilder builder = new StringBuilder();
   for (int i = 0; i < inputArray.length; i++) {
      if (i != inputArray.length - 1 && inputArray[i] == '[' && inputArray[i + 1] == ']' ) {
         builder.append("{}");
         i++;
      }
      else if (inputArray[i] == '"')
         builder.append("'");
      else if (inputArray[i] == '
' )
         continue;
      else builder.append(inputArray[i]);
   }
   return  builder.toString();
}

I had many options for solving this problem, but this seemed the most optimal. I tested the speed using the Instant class, and on any line size it showed me high results. The firm told me that in their measurements there is degradation on large lines, that this is due to an overhead for using StringBuilder and unconditional allocation of a new line at the end.

So i didn't get a job.

What solution could be even more optimized? Even using a char array, I cannot do without string allocation String.valueOf().


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

StringBuilder builder = new StringBuilder();

StringBuilder has no idea how large the output is going to be, so you do get amortized growth cost here: SB isn't voodoo magic, it needs to guess at sizes. It'll guess 'eh, I'll give you 10 chars worth'. If you then add the 11th char, SB has to make a new char array (a larger one; I think it grows by a factor x2, might be x1.5), so it makes a 20-char array, copies over the first 10 chars, and tosses out the old array. This is not too bad (by definition this growth penalty is applied to an ever exponentially decreasing number of append operations), but...

look at the input question: The output string is as large as the input string, OR smaller (every in the input results in the output being 1 smaller). So, you can tell StringBuilder beforehand: Hey, just start out with this much capacity (the length of the input). Now there is no growth penalty anymore.

that this is due to an overhead for using StringBuilder

This is misleading or oversimplified; you may not have quite heard that correctly. StringBuilder is just an object, with a char array. That is overhead, yeah, but,

someString.replaceAll(""", "'").replaceAll(" ", "").replace("[]", "{}");

that makes 2 pattern objects and (potentially) 3 strings. strings also are just an object with a big char array inside, so that's potentially more overhead and thus does not explain in the slightest why your take can be a lot slower.

However, there is another optimization you missed:

The replace/replaceAll methods don't make any string, AT ALL, if there is no need. Try it: "Hello!".replace("Foo", "Bar") returns the same reference (a == b equal).

Also, .toCharArray does require yet another full clone, also pricey. Just use charAt.

So, let's make an actual fast one:

public String outputValue(String input) {
   StringBuilder builder = null;
   int len = input.length();
   for (int i = 0; i < len; i++) {
      char c = input.charAt(i);
      if (c != '[' && c != '
' && c != '"') {
        if (builder != null) builder.append(c);
        continue;
      }

      if (builder == null) {
        // Don't even make the builder until we determine we need it!
        builder = new StringBuilder(len);
        builder.append(input, 0, i);
      }
      if (i != leg - 1 && c == '[' && input.charAt(i + 1) == ']' ) {
         builder.append("{}");
         i++;
      } else if (c == '"') {
         builder.append(''');
      } else if (c != '
') {
         builder.append(c);
      }
   }
   return  builder == null ? input : builder.toString();
}

I tested the speed using the Instant class, and on any line size it showed me high results.

Not how you test performance. Instant represents system clock; it has only at best microsecond granularity (and usually not even that), and can and will change if the net time daemon changes the time on you. It can also be under smear (when a clock needs adjusting, because so much software out there assumes time can never run backwards, instead of just moving the time back, a smearing NTP will simply run time 'slower' - ticking up 750 milliseconds every second for example, until time's 'fixed').

Even nanoTime is not great, though - benchmarking is way more complicated than this. Use JMH (Java Microbenchmark Harness).

If you want to see some examples where your code does way worse than the original .replaceAll/replaceAll/replace chain, make a bunch of strings that are large and which do not contain any of these symbols (no newlines, no quotes, no brackets). You'll see a very big difference.


But, interview questions aren't neccessarily about code optimization. For example, I find these just as plausible:

  1. Change almost nothing; replace is reasonably fast. However, .replaceAll involves a regex. If this line gets executed a ton, making an explicit pattern object would be more efficient and read better (though that is in the eye of the beholder). More generally, the patterns here don't use any regex features, so maybe the interviewer just wanted you to replace replaceAll with replace and that is all. To be clear, my code above will be faster if you have a large input with newlines and quotes and brackets in it, and will not be slower if there aren't. But, not every line of code needs to be optimized to the maximum possible extent.

  2. Before digging into the code at all, the exercise tested your willingness to ask questions and look for the bigger picture first. Maybe they wanted you to ask: How often is this run? Is there a performance framework in place? What kinds of strings are usually thrown at this method? Can we perhaps look elsewhere and fix the strings 'at the source' instead of converting them every time? When I do an interview, I tend to craft technical questions that seem complete but which have some missing information once you dig down enough - if the interviewee doesn't ask, I dock them some points. It's not an instant fail - I'll tell em and see if they bother to ask and debate first before diving into programming for the next question.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...