Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
364 views
in Technique[技术] by (71.8m points)

java - 为什么处理排序数组要比处理未排序数组快?(Why is processing a sorted array faster than processing an unsorted array?)

Here is a piece of C++ code that shows some very peculiar behavior.

(这是一段C ++代码,显示了一些非常特殊的行为。)

For some strange reason, sorting the data miraculously makes the code almost six times faster:

(出于某些奇怪的原因,奇迹般地对数据进行排序使代码快了将近六倍:)

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);

    (没有std::sort(data, data + arraySize);)

    , the code runs in 11.54 seconds.

    (,代码将在11.54秒内运行。)

  • With the sorted data, the code runs in 1.93 seconds.

    (使用排序的数据,代码将在1.93秒内运行。)


Initially, I thought this might be just a language or compiler anomaly, so I tried Java:

(最初,我认为这可能只是语言或编译器异常,所以我尝试了Java:)

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a similar but less extreme result.

(具有相似但不太极端的结果。)


My first thought was that sorting brings the data into the cache, but then I thought how silly that was because the array was just generated.

(我首先想到的是排序将数据带入缓存,但是后来我想到这样做是多么愚蠢,因为刚刚生成了数组。)

  • What is going on?

    (到底是怎么回事?)

  • Why is processing a sorted array faster than processing an unsorted array?

    (为什么处理排序数组要比处理未排序数组快?)

The code is summing up some independent terms, so the order should not matter.

(该代码总结了一些独立的术语,因此顺序无关紧要。)

  ask by GManNickG translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You are a victim of branch prediction fail.

(您是分支预测失败的受害者。)


What is Branch Prediction? (什么是分支预测?)

Consider a railroad junction:

(考虑一个铁路枢纽:)

该图显示了铁路枢纽 Image by Mecanismo, via Wikimedia Commons.

(Mecanismo的图片 ,通过Wikimedia Commons。)

Used under the CC-By-SA 3.0 license.

(CC-By-SA 3.0许可下使用。)

Now for the sake of argument, suppose this is back in the 1800s - before long distance or radio communication.

(现在,为了论证,假设这是在1800年代-在进行长距离或无线电通信之前。)

You are the operator of a junction and you hear a train coming.

(您是路口的操作员,并且听到火车驶入。)

You have no idea which way it is supposed to go.

(您不知道应该走哪条路。)

You stop the train to ask the driver which direction they want.

(您停下火车,询问驾驶员他们想要哪个方向。)

And then you set the switch appropriately.

(然后您适当地设置开关。)

Trains are heavy and have a lot of inertia.

(火车很重,惯性很大。)

So they take forever to start up and slow down.

(因此,它们要花很多时间才能启动和减速。)

Is there a better way?

(有没有更好的办法?)

You guess which direction the train will go!

(您猜火车将朝哪个方向行驶!)

  • If you guessed right, it continues on.

    (如果您猜对了,它将继续进行。)

  • If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch.

    (如果您猜错了,机长会停下来,后退并大喊大叫,以拨动开关。)

    Then it can restart down the other path.

    (然后,它可以沿着其他路径重新启动。)

If you guess right every time , the train will never have to stop.

(如果您每次都猜对了 ,火车将永远不会停止。)
If you guess wrong too often , the train will spend a lot of time stopping, backing up, and restarting.

(如果您经常猜错 ,火车将花费大量时间停止,备份和重新启动。)


Consider an if-statement: At the processor level, it is a branch instruction:

(考虑一个if语句:在处理器级别,它是一条分支指令:)

包含if语句的已编译代码的屏幕截图

You are a processor and you see a branch.

(您是处理器,并且看到一个分支。)

You have no idea which way it will go.

(您不知道它将走哪条路。)

What do you do?

(你是做什么?)

You halt execution and wait until the previous instructions are complete.

(您停止执行,并等待之前的说明完成。)

Then you continue down the correct path.

(然后,您沿着正确的路径继续。)

Modern processors are complicated and have long pipelines.

(现代处理器很复杂,而且流程很长。)

So they take forever to "warm up" and "slow down".

(因此,他们需要永远进行“热身”和“减速”。)

Is there a better way?

(有没有更好的办法?)

You guess which direction the branch will go!

(您猜分支将朝哪个方向前进!)

  • If you guessed right, you continue executing.

    (如果您猜对了,则继续执行。)

  • If you guessed wrong, you need to flush the pipeline and roll back to the branch.

    (如果您猜错了,则需要刷新管道并回滚到分支。)

    Then you can restart down the other path.

    (然后,您可以沿着其他路径重新启动。)

If you guess right every time , the execution will never have to stop.

(如果您每次都猜对了 ,执行将永远不会停止。)
If you guess wrong too often , you spend a lot of time stalling, rolling back, and restarting.

(如果您经常猜错 ,那么您将花费大量时间来拖延,回滚和重新启动。)


This is branch prediction.

(这是分支预测。)

I admit it's not the best analogy since the train could just signal the direction with a flag.

(我承认这不是最好的类比,因为火车可以只用一个标志来指示方向。)

But in computers, the processor doesn't know which direction a branch will go until the last moment.

(但是在计算机中,处理器直到最后一刻才知道分支的方向。)

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path?

(那么,您如何从战略上猜测如何将火车必须倒退和走另一条路的次数降至最低?)

You look at the past history!

(您看看过去的历史!)

If the train goes left 99% of the time, then you guess left.

(如果火车有99%的时间向左行驶,那么您就猜到了。)

If it alternates, then you alternate your guesses.

(如果它交替出现,那么您将交替猜测。)

If it goes one way every three times, you guess the same...

(如果它每三回??去一次,您会猜到相同...)

In other words, you try to identify a pattern and follow it.

(换句话说,您尝试识别一个模式并遵循它。)

This is more or less how branch predictors work.

(这或多或少是分支预测变量的工作方式。)

Most applications have well-behaved branches.

(大多数应用程序具有行为良好的分支。)

So modern branch predictors will typically achieve >90% hit rates.

(因此,现代分支预测器通常将达到90%以上的命中率。)

But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

(但是,当面对没有可识别模式的不可预测分支时,分支预测变量实际上是无用的。)

Further reading: "Branch predictor" article on Wikipedia .

(进一步阅读: Wikipedia上的“分支预测器”文章 。)


As hinted from above, the culprit is this if-statement: (从上面暗示,罪魁祸首是这个if陈述:)

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement.

(请注意,数据在0到255之间均匀分布。对数据进行排序时,大约前一半的迭代将不会进入if语句。)

After that, they will all enter the if-statement.

(之后,他们都会进入if语句。)

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times.

(这对分支预测器非常友好,因为分支连续多次朝同一方向前进。)

Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

(即使是简单的饱和计数器也可以正确预测分支,除了在切换方向后进行几次迭代外。)

Quick visualization:

(快速可视化:)

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless, because it can't predict random data.

(但是,当数据完全随机时,分支预测器将变得无用,因为它无法预测随机数据。)

Thus there will probably be around 50% misprediction (no better than random guessing).

(因此,可能会有大约50%的错误预测(没有比随机猜测好)。)

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

       = TTNTTTTNTNNTTTN ...   (completely random - hard to predict)

So what can be done?

(那该怎么办呢?)

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

(如果编译器无法将分支优化为有条件的移动,那么如果您愿意牺牲可读性来提高性能,则可以尝试一些破解。)

Replace:

(更换:)

if (data[c] >= 128)
    sum += data[c];

with:

(与:)

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

(这消除了分支,并用一些按位运算将其替换。)

(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[] .)

((请注意,这种破解并不严格等同于原始的if语句。但是在这种情况下,它对data[]所有输入值均有效。))

Benchmarks: Core i7 920 @ 3.5 GHz

(基准:Core i7 920 @ 3.5 GHz)

C++ - Visual Studio 2010 - x64 Release

(C ++-Visual Studio 2010-x64版本)

//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587

Java - NetBeans 7.1.1 JDK 7 - x64

(Java-NetBeans 7.1.1 JDK 7-x64)

//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823

Observations:

(观察结果:)

  • With the Branch: There is a huge difference between the sorted and unsorted data.

    (使用分支:排序和未排序的数据之间存在巨大差异。)

  • With the Hack: There is no difference between sorted and unsorted data.

    (使用Hack:排序和未排序的数据之间没有区别。)

  • In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.

    (在C ++情况下,对数据进行排序时,hack实际上比分支慢一点。)

A general rule of thumb is to avoid data-dependent branching in critical loops (such as in this example).

(一般的经验法则是避免在关键循环中避免依赖于数据的分支(例如在此示例中)。)


Update:

(更新:)

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move.

    (在x64上带有-O3-ftree-vectorize GCC 4.6.1能够生成条件移动。)

    So there is no difference between the sorted and unsorted data - both are fast.

    (因此,已排序和未排序的数据之间没有区别-两者都很快速。)

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox .

    (即使在/Ox下,VC ++ 2010也无法为此分支生成条件移动。)

  • Intel C++ Compiler (ICC) 11 does something miraculous.

    (英特尔C ++编译器 (ICC)11发挥了神奇的作用。)

    It <a href="https://stackoom.com/link/aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTG9vcF9pbnRlcmNoYW5nZQ==" rel="nofollow noop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...