My fun journey of managing a large table of PostgreSQL (2024)

digitake

For a reader who just wants TLDR

Try indexing the columns that are used in conditional clauses.
Think about the nature of your data. If it has intrinsic order, using BRIN might help.
If you are using GROUP BY DATE_TRUNC or some similar kind of aggregate function, you might as well index the aggregated results.
Every time the rows are inserted, the indices are also updated.
SQL is declarative, so the end result of GROUP BY is order-independent. But indexing is not, and you might need to start the aggregate with whatever indexed column.

At the very first step, for one who works with a big data table, indexing is the first go-to solution. Indexing will basically rearrange your sequential data into a better kind of data structure that is searching-friendly in terms of time complexity. Postgres automatically creates indices for unique fields including the primary key fields. The default data structure of B-tree. This is a balanced tree of size 8KB, which will make the depth of just 3 even though the number of records is billions!

I found this a great resource explaining indexing if you have some spare time.

Since there is no free lunch, we sacrifice our space for the time complexity trade-off. The space used for indexing is just about 11GB in size per index!

My fun journey of managing a large table of PostgreSQL (2)

Luckily, the data is time-series. This enables one option for indexing as a Block-range index(aka. BRIN). It requires minimal space of just over 4kB.

My fun journey of managing a large table of PostgreSQL (3)

This indexing is based on an assumption that columns are well-correlated with the physical order of the table rows. In my specific case, the order of data is well-correlated(timestamp of each measurement value). And what kind of order is NOT well-correlated? You might ask. The data that order is artificially created. For example, user id 0001, 0002, 0003, … does not tell anything besides its order. it doesn’t naturally correlate with the columns’ values.

So, All I need is just to go ahead and implement the BRIN index and go enjoy life in the lavender field? No, not so fast…

My SQL query looks like this

SELECT data1, DATE_TRUNC('minute', data_timestamp), data2 AS minute_ts
FROM data_table
WHERE data_timestamp
BETWEEN DATE_TRUNC('minute', '2022-09-17 07:15+7'::timestamptz)
AND DATE_TRUNC('minute', '2022-09-17 07:30+7'::timestamptz)
GROUP BY data2, DATE_TRUNC('minute', data_timestamp);

And this query took about 14 minutes. Not ideal, but if you compare it to this.

SELECT data1, DATE_TRUNC('minute', data_timestamp), data2 AS minute_ts
FROM data_table
WHERE data_timestamp
BETWEEN DATE_TRUNC('minute', '2022-09-17 07:15+7'::timestamptz)
AND DATE_TRUNC('minute', '2022-09-17 07:30+7'::timestamptz)
GROUP BY DATE_TRUNC('minute', data_timestamp), data2;

Which took me more than 12 hours just to realize that, perhaps, I rather give up waiting.

So, what’s the difference here? What makes it takes so much time to retrieve the same query(from a declarative perspective) with a different order of group by column?

Aggregate is expensive
You might need to be careful about index when you want to optimize aggregation query
Insert Query Plan Image of wrong indexing
Insert Query Plan Image of indexing with DATE_TRUNC function

Remember that the table I am trying to optimize is taking a constant of data from IoT Devices? It implies that DB-Engine will, most of the time, perform inserts operation.

This is taken from the official document of the PostgreSQL website.

Once an index is created, no further intervention is required: the system will update the index when the table is modified, and it will use the index in queries when it thinks doing so would be more efficient than a sequential table scan.

The keyword here is “update”! Meaning, the index needs to be re-arranged, every time new data is inserted. This doesn’t sound too bad for the BRIN-index type. However, in order to keep the query planner using either the system runs ANALYZE command or you have to run it periodically.

But you might have to run the ANALYZE command regularly to update statistics to allow the query planner to make educated decisions.

In conclusion, if you are working on big data with PostgreSQL. You might very well need to consider the nature of your application seriously. Knowing the right indexing might help on speed things up. BRIN index might help to keep index size in check. However, you should double check with Query planner that it does what you expect.

Next article is how I manage to break down large index (and table) into partition using pg_partman.