A question from
stackoverflow


Why deal with ordered arrays faster than processing without arrays

, there has been some discussion in the original text, here we first reproduce the results, and then explain why!

We have the following two pieces of code, the code looks almost the same, in fact, the logic is the same, are the number of the statistics array less than the THRESHOLD number, the only difference is that one is in the unordered array statistics, the other is Statistics in an ordered array. If the two array data sources are consistent (array size, data are consistent), just an unordered one, how much difference do you think the performance gap between the two functions?

    public static void countUnsortedArr() {
        int cnt = 0;
        for (int i = 0; i < MAX_LENGTH; i++) {
            if (arr[i] < THRESHLOD) {
                cnt++;
            }
        }
    }

    public static void countSortedArr() {
        int cnt = 0;
        for (int i = 0; i < MAX_LENGTH; i++) {
            if (arrSotred[i] < THRESHLOD) {
                cnt++;
            }
        }
    }

 

 

Intuitively, the logic of the two pieces of code is exactly the same, because the consistent statistical results of the data source are also consistent, so there will be no difference in performance, but is this really true? We tested it with the benchmark test tool JMH in OpenJdk .

test environment

MacBook Pro (13-inch, 2017)
CPU: 2.5 GHz Intel Core i7
JMH version: 1.21
VM version: JDK 11, Java HotSpotTM 64-Bit Server VM, 11+28
Test Method: Warm up one round, then for each The function does three rounds of pressure measurement, each round is 10s

The result is as follows. SCore represents the number of microseconds required to execute this function once. The smaller the value, the higher the performance.

Benchmark                              Mode  Cnt      Score      Error  Units
BranchPredictionTest.countSortedArr    avgt    3   5212.052 ± 7848.382  us/op
BranchPredictionTest.countUnsortedArr  avgt    3  31854.238 ± 5393.947  us/op

Is it very unexpected, the statistically orderly array is much faster, the performance gap is 6 times . And after many tests, this performance gap is very stable. Is it not logical? Most programs use high-level language to write code. In fact, the language itself encapsulates a lot of low-level details. In fact, the CPU has optimized branch jump instructions. This is what we mentioned in the title. CPU branch predictions. Before declaring a detailed branch prediction, the purpose of this article is not to clearly explain the branch prediction, but to tell you the impact of branch prediction on performance. I want to know more about the CPU branch prediction. I have listed several references at the end of the article.
How_much_CPU_branch_prediction_affects_performance_0.png

To say branch prediction, you have to mention the instruction pipeline mode of modern CPUs. As shown in the figure above, in order to improve the throughput of instructions, modern CPUs divide a single instruction into multiple stages, which are roughly divided into __fetch, decode, execute, and write-write- Back)__, an instruction can be started without waiting for the previous complete execution, just like the pipeline in the factory, which can greatly improve the throughput of instruction execution. Modern CPUs actually have more than four phases. Processors like intel and arm basically have more than ten stages, and the advantages of pipeline throughput are more obvious.

But the ideal is very beautiful and shows a very backbone. It is not that all instructions can be executed before the execution of the previous instruction. It is possible that this instruction will depend on the execution result of the previous instruction. In response to this data dependency, the CPU designer has proposed a number of optimization methods, such as instruction adventure, instruction out-of-order execution, and prediction

. The __branch prediction __ we mentioned today belongs to the prediction. “One kind.”
How_much_CPU_branch_prediction_affects_performance_1.png

The idea of ​​branch prediction is also very simple. Since the dependent data has not yet been calculated, I will guess a result and then start executing the instruction ahead of time. Of course, it is not a random guess. The prediction of modern CPU is to look at the results of previous predictions. It’s like it rained the day before yesterday and it rained yesterday. Then I can simply and rudely think that it is raining today. See the end of the article for details. The idea is simple, but the effect is surprisingly good. From the Wikipedia data, we can know that the branch prediction accuracy of modern CPU can reach more than 90%.

Since the accuracy rate is not 100%, it means there is a time of failure. If the CPU finds that the prediction error will abandon all the execution results of the instruction after the prediction, and then restart the execution from the prediction branch, it is equivalent to many instructions running white, the cost of prediction failure is very high, and the normal execution of one instruction requires 10-20 The instruction cycle, if the prediction fails, may be an additional 30-40 instruction cycles. Going back to our test code above, the data I prepared is 100w numbers from 0-100w, and then counts the number of numbers less than 50w. In the case of disorder, there is a 50% chance that the branch prediction will fail. In an orderly case, the 100w prediction will only fail once, and the branch prediction failure is the cause of the performance gap.

 

Performance optimization

Know the reason, how to optimize performance? Since the ordered array is the fastest, is it possible to sort the array directly, and then do the traversal? Don’t forget that the performance penalty caused by additional sorting is far more than the performance loss caused by branch prediction failure. Since the way to improve the branch prediction success rate does not work, we simply kill the logic that will lead to branch prediction, How?

 

Optimization 1

For my simple statistical logic, you can do it directly with bit operations. The bit operation looks complicated, but in fact the idea is very simple, that is, the sign bit of the integer is directly converted into 0 and 1, and then added to the cnt. Without the if less than the judgment, there is no jmp instruction in the generated instruction, thus avoiding the performance consumption caused by the CPU branch prediction error.
code show as below:

    @Benchmark
    public static void count1() {
        int cnt = 0;
        for (int i = 0; i < MAX_LENGTH; i++) {
            cnt += (-~(THRESHLOD-arr[i]) >> 31);
        }
    }

Optimization 2

Since I have been optimized for standard 1-bit operations, it must also optimize 2, in fact, with the ? :ternary operator can optimize performance.

    public static void count2() {
        int cnt = 0;
        for (int i = 0; i < MAX_LENGTH; i++) {
            cnt += arr[i] < THRESHLOD ? 1 : 0;
        }
    }

We put the four ways together and look at the performance gap. The bit operation is undoubtedly the fastest. ?:The way to use the trinocular operator is also quite fast. It is similar to the ordered data statistics. It can be determined that the trinocular operator is also successfully avoided. The branch predicts the performance penalty from the error code.

Benchmark                              Mode  Cnt      Score       Error  Units
BranchPredictionTest.count1            avgt    3   3807.000 ± 16265.107  us/op
BranchPredictionTest.count2            avgt    3   4706.082 ± 19757.705  us/op
BranchPredictionTest.countSortedArr    avgt    3   4458.783 ±   107.975  us/op
BranchPredictionTest.countUnsortedArr  avgt    3  30719.090 ±  4517.611  us/op

 

?:Why is the trinocular operation so fast?

?:There is less than the judgment in the expression, why is there no branch jump? I also doubted this question for a long time. Later, I used C code to generate ifand ?:assemble the logic code, and finally found the difference.

 

C code
int fun1(int x)
{
    int cnt = 0;
    if (x < 0)
    {
        cnt += 1;
    }
    return cnt;
}

int fun2(int x)
{
    int cnt = 0;
    cnt  += x > 0 ? 1 : 0;
    return cnt;
}

 

Assembly code
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 10, 14
	.globl	_fun1                   ## -- Begin function fun1
	.p2align	4, 0x90
_fun1:                                  ## @fun1
	.cfi_startproc
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	movl	%edi, -4(%rbp)
	movl	$0, -8(%rbp)
	cmpl	$0, -4(%rbp)
	jge	LBB0_2       
## %bb.1:
	movl	-8(%rbp), %eax
	addl	$1, %eax
	movl	%eax, -8(%rbp)
LBB0_2:
	movl	-8(%rbp), %eax
	popq	%rbp
	retq
	.cfi_endproc
                                        ## -- End function
	.globl	_fun2                   ## -- Begin function fun2
	.p2align	4, 0x90
	
	
_fun2:                                  ## @fun2
	.cfi_startproc
## %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	xorl	%eax, %eax
	movl	%edi, -4(%rbp)
	movl	$0, -8(%rbp)
	movl	-4(%rbp), %edi
	cmpl	$0, %edi
	movl	$1, %edi
	cmovll	%edi, %eax
	addl	-8(%rbp), %eax
	movl	%eax, -8(%rbp)
	movl	-8(%rbp), %eax
	popq	%rbp
	retq
	.cfi_endproc
                                        ## -- End function

.subsections_via_symbols

As you can see from the assembly code, there is no jgeinstruction in the fun2 (jump when greater or equal). This is a typical jump instruction. It will cmpljump to a code block according to the result of the instruction. As mentioned above, the CPU Branch prediction is only done when a jump instruction is made. The ?:realization of a completely different, although there are cmplinstructions, is followed by an cmovllinstruction, the instruction according to a result of the value cmp into% eax register, all subsequent instructions are executed serially, no jump instructions, so There is no performance penalty due to branch prediction failure.

Unlike what I had imagined before, jump instructions are not generated by size comparison, but are generated by a logical fork like if for. Conditional jumps only rely on the result of size comparison.

 

Complete benchmark code:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)
@Fork(3)
@Warmup(iterations = 1)
@Measurement(iterations = 3)
public class BranchPredictionTest {
    private static Random random = new Random();
    private static int MAX_LENGTH = 10_000_000;
    private static int[] arr;
    private static int[] arrSotred;
    private static int THRESHLOD = MAX_LENGTH >> 1;

    @Setup
    public static void init() {
        arr = new int[MAX_LENGTH];
        for (int i = 0; i < MAX_LENGTH; i++) {
            arr[i] = random.nextInt(MAX_LENGTH);
        }
        arrSotred = Arrays.copyOf(arr, arr.length);
        Arrays.sort(arrSotred);
    }

    @Benchmark
    public static void countUnsortedArr() {
        int cnt = 0;
        for (int i = 0; i < MAX_LENGTH; i++) {
            if (arr[i] < THRESHLOD) {
                cnt++;
            }
        }
    }

    @Benchmark
    public static void countSortedArr() {
        int cnt = 0;
        for (int i = 0; i < MAX_LENGTH; i++) {
            if (arrSotred[i] < THRESHLOD) {
                cnt++;
            }
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(BranchPredictionTest.class.getSimpleName())
                .forks(1)
                .build();
        new Runner(opt).run();
    }
}

 

Conclusion

The CPU branch prediction itself is to improve the pipeline to avoid pipeline waiting. In fact, it is essentially the
principle 
of locality. Because of the locality, in most cases, the technology itself brings positive performance (or else it It won’t exist today, so we don’t need to pay attention to its existence in most cases, or just write the code with confidence. Don’t ifchange all the ?:trinoculars because of this blog . Maybe the code is OK. The impact of readability is far greater than the benefit of performance gains. Again, I only constructed an extreme data today to verify the performance difference, because the locality exists in most cases, the branch prediction is correct.

Finally, I also found that if all the less than the number in the above benchmark is changed to the greater than the number, all the performance differences will disappear, the actual test results are as follows. I can’t figure it out. It seems that there is no branch prediction. Is there a big explanation for knowing it?

Benchmark                              Mode  Cnt     Score      Error  Units
BranchPredictionTest.count1            avgt    3  3354.059 ± 3995.160  us/op
BranchPredictionTest.count2            avgt    3  4047.069 ± 2285.700  us/op
BranchPredictionTest.countSortedArr    avgt    3  5732.614 ± 6491.716  us/op
BranchPredictionTest.countUnsortedArr  avgt    3  5251.890 ±   64.813  us/op

Reference material


  1. Wikipedia instruction pipeline

  2. Wikipedia branch forecast

  3. CPU branch prediction

  4. Local principle

Orignal link:https://blog.csdn.net/xindoo/article/details/101762198