Kafka Lag Was Really PostgreSQL Write Pressure

Kafka lag showed up first.

That made Kafka the obvious suspect.

The consumer group was behind. The lag number kept dropping too slowly. More messages were waiting in the topic than the consumer could drain.

But the broker was not the bottleneck.

The consumer was slow because every message had to become a PostgreSQL write before the offset could move forward.

This benchmark was built to separate Kafka symptoms from downstream database cost. We produced 50,000 messages per run, changed payload size and indexing, and measured how long the consumer needed to drain the lag.

The Setup

The architecture was intentionally small so the bottleneck would be visible.

Producer -> Kafka (1 partition) -> Consumer -> PostgreSQL

The consumer design was synchronous. A message was polled, written to PostgreSQL, and only then could the consumer continue making useful progress.

Poll -> Insert -> Commit offset

That design is common because it is simple and reliable. It also means Kafka throughput is tightly coupled to database write latency.

The consumer code made that coupling explicit:

@KafkaListener(topics = "lag-demo-topic")
public void consume(String message) {
    long start = System.currentTimeMillis();

    EventEntity entity = new EventEntity();
    entity.setPayload(message);

    repository.save(entity);

    long time = System.currentTimeMillis() - start;
    log.info("Insert took: {} ms", time);
}

The Initial Assumption

When consumer lag grows, teams often look at Kafka first.

That is reasonable. Kafka owns the lag metric. Kafka exposes the consumer group. Kafka is the thing on the dashboard blinking red.

But lag is not a root cause. Lag only says that consumption is slower than production.

Producer Rate > Consumer Processing Capacity

The real question is why the consumer processing capacity dropped.

What We Controlled

To keep the test focused, these variables stayed constant:

Same Kafka cluster
Same hardware
Same consumer group
Same topic shape
Same number of messages: 50,000 per run
Same synchronous consumer design

Only two variables changed:

Payload size: 1 KB, 20 KB, 100 KB
PostgreSQL indexing: minimal indexes versus an additional payload index

Lag was measured with the consumer group command:

kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group lag-demo-group

Throughput was calculated from how long the system needed to drain all 50,000 messages.

Throughput = Total Messages / Drain Time

The Results

Payload	Indexing	Drain Time	Throughput	Change
1 KB	Minimal	125s	400 msg/sec	Baseline
20 KB	Minimal	152s	329 msg/sec	18% lower throughput
100 KB	Minimal	288s	173 msg/sec	57% lower throughput
100 KB	Extra index	336s	149 msg/sec	Additional 14% drop

The pattern was clear. Kafka did not suddenly become worse. The consumer became slower as the PostgreSQL write became more expensive.

At 1 KB, the system drained 50,000 messages in 125 seconds. At 100 KB, the same message count needed 288 seconds. With the extra index, drain time increased again to 336 seconds.

The lag curve was really a database write-cost curve.

Why Payload Size Changed Lag

A larger Kafka message is not only a larger Kafka message.

In this pipeline, a larger message became a larger PostgreSQL row payload. That changed the cost profile after the consumer received the message.

More bytes copied through the JVM
More memory allocation during processing
Larger database write payload
More WAL generated by PostgreSQL
More disk and buffer pressure
Longer transaction work per message

The consumer was not CPU-bound on parsing alone. It was waiting on the downstream write path.

Why Indexes Made It Slower

The extra index made the 100 KB scenario worse because every insert had more work to do.

The table shape looked like this:

CREATE TABLE events (
    id BIGSERIAL PRIMARY KEY,
    payload VARCHAR(200000),
    field1 VARCHAR(100),
    field2 VARCHAR(100),
    field3 VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_payload ON events(payload);

Indexes are excellent when they help reads. On a write-heavy ingest table, every secondary index is also write amplification.

For each inserted event, PostgreSQL had to write the row and maintain the index structure. With large payloads, that maintenance cost became visible in the drain time.

The Producer Was Not The Limiting Side

The producer could push the test messages quickly enough to create lag.

The load endpoint generated a fixed payload and sent messages into Kafka in a loop:

@PostMapping("/load")
public String generate(@RequestParam int count) {
    String payload = generateJsonPayload(100);

    for (int i = 0; i < count; i++) {
        producer.send(payload);
    }

    return "Sent " + count + " messages";
}

That part is important. Kafka accepted the work. The backlog existed because the consumer could not finish the downstream insert path fast enough.

The First Wrong Fix

The tempting fix is to increase Kafka consumer concurrency immediately.

Sometimes that is correct. But it is not free.

If the database is already the bottleneck, more consumers can turn one slow writer into multiple competing writers. That may improve throughput for a while, or it may increase lock pressure, WAL pressure, connection pool pressure, and disk I/O.

Before adding consumers, the better question is whether PostgreSQL can absorb more concurrent writes.

What The Lag Was Telling Us

Kafka lag was the symptom. PostgreSQL write latency was the constraint.

The consumer had a simple contract: do not move forward until the message is saved.

That means each offset depended on the database accepting the write. As payload size increased, offset progress slowed. As index maintenance increased, offset progress slowed again.

This is why looking only at Kafka can mislead the investigation. The broker can be healthy while the consumer group is still falling behind.

What I Would Monitor In Production

For this class of incident, I would put Kafka and PostgreSQL metrics on the same dashboard.

Consumer lag by partition
Lag drain rate, not only current lag
Consumer processing time per message
Database insert latency P95/P99
PostgreSQL WAL volume
Index write overhead
Connection pool active and waiting counts
Consumer error and retry count

The most useful chart is often not lag alone. It is lag next to database write latency.

If both rise together, Kafka is probably reporting a downstream bottleneck rather than causing it.

Production Recommendations

Keep write-heavy ingest tables lean. Add secondary indexes only when the read path justifies the write cost.
Measure drain time during load tests. Lag count without drain rate does not tell you recovery capacity.
Batch inserts when correctness and failure handling allow it.
Separate hot ingest storage from query-optimized storage when the workloads conflict.
Track payload size as a first-class performance variable.
Scale consumers only after confirming database write capacity.
Treat offset commits as part of the durability design, not only Kafka configuration.

Engineering Lessons

Kafka lag is a symptom. The root cause may live in the consumer, database, network, or external dependency.
Synchronous consumers couple offset progress to downstream latency.
Payload size affects more than broker throughput; it affects database write cost and WAL growth.
Indexes on ingest tables can reduce consumer throughput even when Kafka is healthy.
Drain time is often a better incident metric than lag at one point in time.

Source Code And Benchmark Project

This benchmark is reproducible. The project includes the Spring Boot producer and consumer, Kafka setup, PostgreSQL schema, and lag result files.

https://github.com/nithidol/kafka-pe

Conclusion

Kafka was the place where the symptom appeared, but PostgreSQL was where the time was spent.

As payload size grew from 1 KB to 100 KB, consumer throughput dropped from 400 msg/sec to 173 msg/sec. Adding an index dropped it again to 149 msg/sec.

The broker remained stable. The synchronous consumer was gated by database write cost.

When lag grows, do not stop at the Kafka dashboard. Follow the message all the way through the consumer. Measure the work that must complete before the offset can move.

That is usually where the real bottleneck is waiting.

Kafka Lag vs PostgreSQL Writes: What Slowed the Consumer Down

The Setup

The Initial Assumption

What We Controlled

The Results

Why Payload Size Changed Lag

Why Indexes Made It Slower

The Producer Was Not The Limiting Side

The First Wrong Fix

What The Lag Was Telling Us

What I Would Monitor In Production

Production Recommendations

Engineering Lessons

Source Code And Benchmark Project

Conclusion

Sign up to receive email updates, fresh news and more!

Kafka Lag vs PostgreSQL Writes: What Slowed the Consumer Down

The Setup

The Initial Assumption

What We Controlled

The Results

Why Payload Size Changed Lag

Why Indexes Made It Slower

The Producer Was Not The Limiting Side

The First Wrong Fix

What The Lag Was Telling Us

What I Would Monitor In Production

Production Recommendations

Engineering Lessons

Source Code And Benchmark Project

Conclusion

Related Posts