Chris Allen - Two memory issues from the last two weeks

Okay maybe they don't qualify as actual memory bugs, but they were annoying and had memory as a common theme. One of them by itself doesn't merit a blog post so I bundled them together.

Two?

Yeah. Here's the list:

Migrating a Rust application from Centos 7 to Ubuntu 22.04 significantly increased memory consumption
Rust Leptos app had steadily increasing memory consumption on the backend and would OOM every so often

Migrating a Rust application from Centos 7 to Ubuntu 22.04 significantly increased memory consumption

Awhile back I was testing a Rust application I had written that principally consumed data from a high-throughput Kafka topic (~300 TiB/day uncompressed) in production and I decided to try using Ubuntu as the base for the Docker containers I was deploying to Kubernetes. My pods were OOM-churning over and over and it was pretty obvious this was the problem. I deployed them with Centos 7 and then there were no issues at all. Decided to leave the issue alone since I was under the gun for a deadline and needed to move on to things actually blocking rolling the service out.

While I was on parental leave, my coworkers were forced to migrate off Centos 7, fair enough. They did have to significantly increase the memory requests for the application pods though. Previously every single pod was inside a ~300 MiB memory utilization banding ranging from 2.6 to 2.9 GiB. On Ubuntu they were using 3.3 to 5.3 GiB depending on throughput and partition balance. At best the max - min was 900 MiB but often it was higher.

This had bothered me for quite awhile so I decided to lock in and figure it out once and for all.

Facts gathered:

Main components of the app were actix-web (not significant), the Rust rdkafka crate (uses librdkafka), libzstd (spiky, volatile heap allocation) for decompressing the Kafka messages, and then diesel_async for loading and synchronizing state between the application instances via PostgreSQL.
I was opting to statically link librdkafka via rdkafka as well as libzstd, and both dependencies were vendored in the Rust crate. This is precisely what I wanted and means the libzstd version installed in the Centos 7 and Ubuntu containers isn't what was getting linked in.
I was using tikv_jemallocator for better memory utilization stability, less memory consumed, and better application efficiency.
The Rust application wasn't swinging around much in memory consumption. It catches up on the state of the data it tracks from Kafka by loading the data from PostgreSQL, so it loads up and then settles into basically a flat-line. Very low variance.
libzstd was swinging/spiking really hard, presumably because the default configuration is trying to be a "good citizen" and not hold onto memory any longer than necessary for it to finish decompressing the incoming data. These application instances are ingesting and processing ~100-200 MiB/second w/ 10 cores and (after the fix) 4 GiB of RAM. Most of the RAM utilization was the internal memory data structure for the state being updated and synchronized. libzstd was bouncing between "almost nothing" and ~200 MiB.

In the course of kicking these facts around with a coworker, a realization struck me. tikv_jemallocator probably isn't linking malloc and free by default and is instead using a prefix. Investigated the crate documentation and yep, it prefixes by default because a number of platforms don't tolerate that well.

So to test this hypothesis, I disabled jemalloc and deployed a version that used libc malloc only. That used less RAM but it still wasn't what I had before or what I wanted. Average across all instances was ~3.2 GiB instead of ~3.6 GiB but still not what it was before (~2.7 GiB). Still higher variance too.

So then I used the crate feature in tikv_jemallocator to disable prefixing and boom, back to the way it was before. Digging into the 9 years of changes between Centos 7's libc malloc and Ubuntu 22.04's libc malloc, it seems like they've been trying to make libc malloc eat into tcmalloc and jemalloc's wheelhouse and that introduced additional overhead and volatility to my use-case. That and running two mallocs side-by-side unnecessarily.

OK, done and dusted. Separately, I probably should look into setting libzstd tuning parameters to calibrate for the level of throughput we're dealing with. There's an even bigger application than mine that could benefit from this anyhow.

Some context for work I did to improve Kafka Consumer throughput with librdkafka:

Updated Magnus' Add 'fetch.queue.backoff.ms' to the consumer (#2879) patch was later merged under a separate PR but is largely what made it into trunk. Fixing this issue Led to a huge throughput improvement for my application.

Rust Leptos app had steadily increasing memory consumption on the backend and would OOM every so often

Yeah this one is really simple: don't use leptos-query unless someone takes the project on and fixes it up. There was no way I could find to disable it in ssr mode and it wasn't freeing/deallocating anything properly. Heaptrack seemed to think it might've actually been leaking but heaptrack has given me false positives before so who knows.

GitHub issue for context: https://github.com/gaucho-labs/leptos-query/issues/36

This impacted ShotCreator which I am working on with a few others.