Multiplying two matrices is a very common subproblem in solving larger problems starting from solving a set of linear equations to training Deep Neural Networks. It is very important to do matrix multiplication with the maximum efficiency of the hardware platform. In this series of blog posts, I am trying to find the performance of the simplest 3 loop implementation of matrix multiplication and compare it with that of libraries used in the industry. I will also discuss the bottlenecks and how we can approach the efficiency of industrial-grade high-performance math libraries using our knowledge of the CPU architecture.
Linux kernel has different types of locks for synchronization. Spin locks, Semaphores, Futexes are some examples. These locks have different behavior and if they are not used properly, it can create performance degradation. For example, spinlocks have least locking and unlocking times but they waste CPU cycles. Spinlocks are used when the waiting time is known to be small. But if a kernel thread is waiting for a spinlock for a considerable amount of time, CPU time is wasted and system performance will be severely affected.
eBPF is an in kernel virtual machine in linux. It can execute user supplied code in a sandboxed environment inside the kernel.
It can be used in different ways to achieve different goals. Most usefull and interesting application of eBPF is dynamic tracing.
eBPF
Today most of the applications are multithreaded. Parallelism improves performance of applications. But there can be scalability bottlenecks inside the kernel. There are studies that exposes such bottlenecks in linux kernel. Even if the application is written in a perfectly scalable fasion, sometimes the bottlenecks inside the kernel prevent the application from scaling to a large number of processors. There are two main reasons for this problem