Trino + Iceberg failing to write metadata.json solved

What began as a neat data processing pipeline recently hit a roadblock.

Query 20250409_124602_00001_tip63 failed: Failed to commit the transaction during insert: Failed to write json to file:
s3a://warehouse/medical/ergometer-76e41009d6bc4f419bd0583b32d3bdaa/metadata/00397-9f9715cd-94e9-47ba-a351-6781f75d6427.metadata.json

TL;DR: Run ALTER TABLE <table> EXECUTE expire_snapshots(retention_threshold => '7d');

The problem appears

A while back I set up a data processing environment using Trino, Nessie, MinIO, Argo Workflows, and Argo Events. Argo Events is configured to monitor a bunch of MinIO buckets for new files to trigger a workflow in Argo Workflows. These workflows interface with MinIO and Trino in order to get the data from the uploaded files into an Apache Iceberg table, which is also located somewhere on MinIO.

The data processing ran fine for a while, but eventually it ran into the issue that it could no longer process mutations. The error appeared to indicate issues with either MinIO (bucket size restrictions, quotas, ACLs) or Trino (configuration problems, incomplete Iceberg support). However, I could not reproduce any write issues in MinIO like that, nor could I find clear issues on Trino’s GitHub that seemed to align with the problems I saw.

Initial troubleshooting

In order to keep the development team working on the data pipeline, a ’temporary’ workaround maintained the existing setup. The workaround for this was to create a new table, copy the data into the new table, then rename and remove the original table.

Because this set-up was only running on a non-production environment the workaround was enough until there was an actual solution to the error above. Weeks turned into months, with the same workaround applied whenever the error appeared. Each troubleshooting attempt focused on MinIO or Trino configuration reinforcing my assumptions. Trino was updated to the latest version each time, but it changed nothing. Eventually, I gave up.

Pivot to Spark?

After months of applying the same workaround, I had given up on the architecture. Spending hours working on this issue every time I saw it appear in order to hopefully find a solution was not a feasible approach for production, and that moment slowly came into sight. A reliable solution was necessary, not a recurring fix. So I decided to abandon Trino/Iceberg completely in favor of Apache Spark using the Kubeflow Operator. All scripts and workflows had to be reworked.

During the installation of Spark Operator and creating SparkApplication objects I ran into the usual hurdles for setting up stuff on Kubernetes (arranging netpols and service configuration for webhooks, configuring a service account with the correct permissions, nothing special). Then, during one of my coffee breaks, I came across a neat blog on Hacker News called “Best Programmers”.

Back to the beginning

The blog is describing programmers, and I’ll just twist that a bit so it applies to other subjects as well. From the blog:

An expert goes in (after reading the reference!) and sits down to write a config for the tool of which they understand every single line and can explain it to a colleague. That leaves no room for doubt! … Everyone gets stuck at times. The best know how to get unstuck. They simplify problems until they become digestible. That’s a hard skill to learn and requires a ton of experience.

This made me wonder, did I really try to understand the issue? Did I understand the tools I was using? Did I try to make the issue smaller? It was quite the wake-up call for me, so I paused my work on Spark and dove back into the initial set-up. The issue had just occurred again, so I had something to work with as well.

With a short list of tools in the data processing framework - even those only slightly related - I started off on trying to solve the issue now once and for all. So, in short and no particular order: Kubernetes, MinIO, Kyverno, Nessie, Cilium, Argo Workflows, Argo CD, Argo Events, Apache Iceberg, Trino. After prioritizing the list a bit, and about to look into Trino, Nessie and MinIO again. Almost making the same mistake again.

This pattern of continuously looking into familiar components (Trino and MinIO) and neglecting less understood parts is what might be holding me back. This is where Apache Iceberg came into view, which I had only treated like a glorified Parquet file so far, even though I had heard great things about it (snapshots, partitioning, time travel, and the like).

The real culprit

Apache Iceberg is well-documented, and has a recommended maintenance section. This is where I struck gold, the first thing I read:

Each write to an Iceberg table creates a new snapshot, or version, of a table. Snapshots can be used for time-travel queries, or the table can be rolled back to any valid snapshot.
Snapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small.

This hinted at being a possible solution for my issue, not being able to write the metadata. I took some time to read more of the docs to understand more of how the different parts of Iceberg tables function. This is also the part where I started to appreciate Iceberg’s complexity a lot more. It was time to test this new knowledge and see what happens.

Testing the solution

I was already familiar with how Trino can call Apache Iceberg its operations/procedures/functions. I already used the add_files operation before in some earlier trials of the pipeline. So there I went, trying an insert first to ensure the issue was still active. Then I called the expire_snapshots operation (as it is called in Trino):

ALTER TABLE ergometer EXECUTE expire_snapshots(retention_threshold => '7d');
---
ALTER TABLE EXECUTE
 rows
------
(0 rows)

Query 20250410_132550_00014_y2wfr, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.13 [0 rows, 0B] [0 rows/s, 0B/s]

And an insert:

INSERT INTO ergometer (...) VALUES (...);
---
INSERT: 1 row

Query 20250410_132708_00015_y2wfr, FINISHED, 3 nodes
Splits: 22 total, 22 done (100.00%)
0.72 [0 rows, 0B] [0 rows/s, 0B/s]

Great! There are some loose ends I need to figure out, like proactively running the required maintenance tasks, set up monitoring, etc.

Lessons learned & next steps

This journey refocused me to look beyond the tools I was familiar with and truly try to understand the entire stack. I treated Iceberg as a simple file format, it’s actually a complex system requiring maintenance.

For production, I’m implementing (some level of) automated maintenance, scheduling expire_snapshots with Argo Workflows and investigate options for monitoring.

Most importantly, I’ve learned to question my assumptions when troubleshooting. Let’s see if that sticks.

sng.rs