High-Performance Matrix Multiplication, Part 1: What's Wrong in Your Code

July 10, 2020

Multiplying two matrices is a very common subproblem in solving larger problems starting from solving a set of linear equations to training Deep Neural Networks. It is very important to do matrix multiplication with the maximum efficiency of the hardware platform. In this series of blog posts, I am trying to find the performance of the simplest 3 loop implementation of matrix multiplication and compare it with that of libraries used in the industry. I will also discuss the bottlenecks and how we can approach the efficiency of industrial-grade high-performance math libraries using our knowledge of the CPU architecture.

Building usefull tools with eBPF: Part 2, Tracing the Locks in Linux Kernel

June 20, 2019

Linux kernel has different types of locks for synchronization. Spin locks, Semaphores, Futexes are some examples. These locks have different behavior and if they are not used properly, it can create performance degradation. For example, spinlocks have least locking and unlocking times but they waste CPU cycles. Spinlocks are used when the waiting time is known to be small. But if a kernel thread is waiting for a spinlock for a considerable amount of time, CPU time is wasted and system performance will be severely affected.

Building usefull tools with eBPF: Part 1, Setting up bcc

May 20, 2019

eBPF is an in kernel virtual machine in linux. It can execute user supplied code in a sandboxed environment inside the kernel. It can be used in different ways to achieve different goals. Most usefull and interesting application of eBPF is dynamic tracing. eBPF

kLockStat: An eBPF Tool To Monitor Linux Kernel Lock Contentions

April 30, 2019

Today most of the applications are multithreaded. Parallelism improves performance of applications. But there can be scalability bottlenecks inside the kernel. There are studies that exposes such bottlenecks in linux kernel. Even if the application is written in a perfectly scalable fasion, sometimes the bottlenecks inside the kernel prevent the application from scaling to a large number of processors. There are two main reasons for this problem

Prathyush's Blog

OS/Architecture/Compilers

High-Performance Matrix Multiplication, Part 1: What's Wrong in Your Code

Building usefull tools with eBPF: Part 2, Tracing the Locks in Linux Kernel

Building usefull tools with eBPF: Part 1, Setting up bcc

kLockStat: An eBPF Tool To Monitor Linux Kernel Lock Contentions