An efficient way to implement matrix multiplication in C

  • 2020-04-01 21:39:16
  • OfStack

You know how to compute matrix multiplication. In general, We all implemented it with the following code :


for(i=0;i<n;++i)
    for(j=0;j<n;++j){
        sum=0;
        for(k=0;k<n;++k)
            sum+=A[i][k]*B[k][j];
        C[i][j]+=sum;
}

But with caching in mind, there's a better way to do it:


for(i=0;i<n;++i)
    for(k=0;k<n;++k){
        r=A[i][k];
        for(j=0;j<n;++j)
            C[i][j]+=r*B[k][j];
}


A closer look reveals that the semantics of the two implementations are equivalent, but the latter is actually more efficient.

So why is that?

That's because instead of accessing memory directly when the CPU reads the data, it first looks to see if there is any data in the cache and then reads it directly from the cache. And reading from the cache is much faster than reading from memory.

When the data is not in the cache, the CPU reads a block of data containing the data into the cache. If the program has good spatial locality, the data accesses after the first cache miss can be completed directly in the cache. In addition to spatial locality (programs tend to refer to data adjacent to the current data), there is temporal locality (programs tend to refer to recently referenced data).

So let's go back to matrix multiplication. (let's just think about the inner loop.)

The former has A good spatial locality for matrix A. If four elements can be cached at one time, each iteration only has 0.25 misses for A, but not for B. Therefore, B is accessed by columns, and every access will miss, so the total number of misses in each iteration is 1.25.

The latter has good locality for both matrix C and matrix B, and each iteration has only 0.25 word miss, so the total miss number is 0.5. The latter has one more memory per iteration (write to C[I][j]), but even so, the latter runs more efficiently than the former.

All in all, if you want your program to run fast, you need to make more use of locality in your program, let the cache hold your data, and reduce the number of accesses. Remember that the CPU can access L1 cache in 3 clock cycles and L2 cache in 10 clock cycles. Memory access takes hundreds of clock cycles, which is faster and which is slower, okay?


Related articles: