Profiling Distributed Applications with Linux perf: A Practical Guide

Optimizing a mature application can feel like hunting for a needle in a haystack. The best way to identify real performance bottlenecks is to let data, not intuition, drive your decisions. In practice, the most time‑consuming code is often the least obvious.

When I suspect a piece of code is slow, I first check the evidence before I modify it.

Profiling turns that intuition into measurable facts. While there are dozens of profilers on the market, choosing the right tool for a multi‑threaded or distributed system can be challenging. Traditional instruction‑count profilers like Callgrind can distort timing, especially under heavy mutex contention, because they measure instructions rather than wall‑clock time.

Statistical profilers, on the other hand, sample the running program at low overhead, making them ideal for complex, concurrent workloads. Linux’s perf is a premier example of this approach.

Below is a step‑by‑step workflow that demonstrates how to use perf together with Brendan Gregg’s FlameGraph to pinpoint hot spots in a distributed application. The example targets the c/hello_dynamic sample from RTI Connext 5.3.0.

1. Install perf
On Ubuntu, run:

sudo apt-get install linux-tools-common linux-tools-3.13.0-107-generic

2. Clone FlameGraph
Clone the repository to a convenient location (e.g., your home directory):

git clone https://github.com/brendangregg/FlameGraph

3. Build the example
Navigate to rti_workspace/examples/c and compile with debug symbols:

export DEBUG=1
make -f makefile_Hello_x64Linux3gcc4.8.2

Replace the makefile name with the one that matches your platform if necessary.

4. Run the publisher while profiling
Start a subscriber in the background, then launch perf:

objs/x64Linux3gcc4.8.2/Hello sub &
sudo perf record -g objs/x64Linux3gcc4.8.2/Hello pub

After a short test period, press Ctrl+C to stop the publisher. perf will generate perf.out.

5. Convert the data for FlameGraph
Translate the perf output into a folded stack file:

perf script -f | ~/FlameGraph/stackcollapse-perf.pl > out.perf-folded

Then generate the visual flame graph:

~/FlameGraph/flamegraph.pl out.perf-folded > perf.svg

6. Inspect the results
Open perf.svg in a web browser. The horizontal axis shows time spent in each function, while the stacked bars reveal the call stack. Clicking a bar zooms in on that stack. Running the publisher without a subscriber will remove the right‑hand portion of the graph, confirming that DDS only emits data when subscribers exist.

Perf offers many more options—adjust sampling frequency, exclude kernel code, or focus on specific CPUs. If you have tips or have discovered complementary tools that simplify profiling, share them in the comments.

Happy profiling!

Industry 4.0 and the Industrial Internet Consortium Align to Drive IIoT Innovation DocBox CEO on Building Data‑Centric, Interoperable Healthcare IoT Solutions

Internet of Things Technology

Embedded

Sensor

Cloud Computing

Internet of Things Technology