cs: App Chapter 5 Program Optimization - Reading Notes

5 Program Optimization

Before optimizing performanceSystem

  • How is the program compiled and executed?

  • How is the modern processor and storage system operate?

  • How to measure the performance of the program performance and position the performance bottleneck

  • Improve program performance under the premise of maintaining code integrity

The program is complicated, but the process should be simple, redundant variables, expressions, errors, and the like may greatly limit the performance release of the program, but the processing of these problems is often compared to algorithm complexity. The lower thing is easy, even through the optimization of the compiler, some programmers’ bad habits are changed in the bottom layer!

When you start compilation or computer composition principle, you may feel that the program code and compilation are at least one or one, how will it be too poor, and later understand the program optimization, I know that the original compiler is We have done so much work, allocate from register, code selection and sorting scheduling, unreachable code, to illegal code clearing, etc.

But the compiler is also a limitable. As a creator, it is difficult to predict all the issues that appear during the running process, and it is more difficult to compile the compiler, only through existing static code, do Static analysis, exclude some grammar, semantic errors, runtime errors can hardly troubleshoot, unless you do a lot of preset conditions to judge, and often local code analysis is much smaller than the overall analysis price, so new version of GCC The analysis is carried out in a separate file.

Therefore, when the compiler of the code is in the face of unsurely optimized risks, it can only follow the original meaning of the code, and cannot do excessive extensive optimization results in its change, so the compiler must be conservative.

Common optimization means

Do not consider the specific processor and compiler, just to C language

Code movement

Reduce the frequency of calculation, which always generates the expression of the same result, and these expressions are called multiple times in a loop or many places.

Simplification of complex operations

Replace the cost of high-cost operations with a minimum cost, such as using a shift operation, but generally replacing those 2 times.

16 * x > x << 4

Or use the addition to replace multiple multiplications.

Hereni+=nint ni = n*iReduce multiplication operation, in the Intel Nehalem processor, the multiplication of integers requires 3 CPU cycles.

Share public sub-expression

This makes 4 multiplication operations into a form of multiplication and four additions.

One of the obstacles to the optimization: function call

A function of string to lowercase as an example

void lower1(char *s){    size_t i;    for (i = 0; i < strlen(s); i++) {        if (s[i] >= 'A' && s[i <= 'Z']) {            s[i] -= ('A' - 'a');        }        }}

The bottleneck of this function is called once when it is determined whether I exceeds S.strlenThis step is very high, so the length of each query string is needed, and the complexity is\(O (N)\)So the final complexity of this function came.\(O (N^2)\)Make the results unacceptable.

Therefore, for this time, it is called, but the result of the call is the same function. We should take it out and store it in a variable, so that each loop only needs to be compared once, not need Execute a function.

So we can write the following function.

void lower2(char *s){    size_t i;    size_t len = strlen(s)    for (i = 0; i < len; i++) {        if (s[i] >= 'A' && s[i <= 'Z']) {            s[i] -= ('A' - 'a');        }        }}

So only need to call oncestrlenIt is our efficiency to improve.

Memory call correlation optimization

for (int i = 0; i < n; i++) {    b[i] = 0;    for (int j = 0; j < n; j++)        b[i] += a[i * n + j];}

In this code, repeat calls many timesb[i]In fact, this variable only acts as a stored role, but the program will access memory each time, therefore causing a large overhead.

The resulting assembly code is as follows

# sum_rows1 inner loop.L4:    movsd (%rsi,%rax,8), %xmm0 # FP load    addsd (%rdi), %xmm0 # FP add    movsd %xmm0, (%rsi,%rax,8) # FP store    addq $8, %rdi    cmpq %rcx, %rdi    jne .L4

In the third line here, it is read from memory.b[i]The value, and this is in a loop, the number of executions will be very large, resulting in a decrease in efficiency.

So we can use a temporary variable to store the results added, and then assign the result after the completion is completed.b[i].

Use of memory alias

  • Two different memory reference points to the same memory

  • It is easy to happen in C language

    • Because the address operation is allowed in C

    • Direct access storage structure

  • C often use local variables

    • The local variable will accumulate in the loop, and the compiler does not check the alias usage of the memory.

Excessive measurement processor

The superbatch processor can perform multiple instructions within a clock cycle, which are all acquired from the same consecutive instruction stream, usually dynamically scheduled.

The advantage is that there is no need for a specific code insertion, and the over-scale processor can use the instructions of most program code to perform code in parallel.

Modern processors can often perform multiple instructions at the same time. This technique is generally implemented by the pipeline, and when a command is related to the operation results of the previous instruction, you need to wait for the last instruction operation to complete, before you can continue This forms a sequential dependency.

Sequential dependency reduces the operational efficiency of the processor.

We can crack this sequential dependence through a recombinant cycle development.

void unroll2a_combine(vec_ptr v, data_t *dest){    long length = vec_length(v);    long limit = length-1;    data_t *d = get_vec_start(v);    data_t x = IDENT;    long i;    /* Combine 2 elements at a time */    for (i = 0; i < limit; i+=2) {        x = (x OP d[i]) OP d[i+1];    }    /* Finish any remaining elements */    for (; i < length; i++) {        x = x OP d[i];    }    *dest = x;}

Reorganization

We replace the operation assignment statement of one of the loops to this statement:

x = x OP (d[i] OP d[i+1]);

So the original data dependent becomes non-linear

Using the separate accumulator

for (i = 0; i < limit; i+=2) {    x0 = x0 OP d[i];    x1 = x1 OP d[i+1];}

Equivalent to the summary of the odd and even number of operations in two variables, in the last merge, but this requires the operationHave a exchange law!

*dest = x0 OP x1;

This is dependent on

Write high performance code

  • Using a good compiler

  • Compilation Level Optimization

    • Algorithm using a lower time complexity / spatial complexity

    • Write compiler friendly code

      • Note that the function call, repeat the call to generate the same result of the same result in the loop

      • Note that the memory reference, the use of pointers in the C language

    • Pay attention to the most in-room cycle because most of the work is done here.

  • Machine level optimization

    • Pay attention to the command level of the code parallel to avoid the order-dependent

    • Avoid unpredictable branches should make branch code as possible to predict

    • Use code cache