TLDR: I run some tests on the Vacuum command. Thought you might find it interesting.
What is Vacuum in Postgres?
When rows are updated or deleted in PostgreSQL, the old versions of those rows are not immediately removed from the database files. Instead, they are marked as obsolete, and PostgreSQL reserves that space for future updates. Over time, this can lead to bloat, where the database files contain many obsolete rows, wasting disk space and causing performance issues.
The VACUUM command scans through the database tables and removes the obsolete rows, reclaiming the disk space they occupied. It also updates the statistics that the PostgreSQL query planner uses to determine the most efficient execution plan for queries, ensuring optimal performance.
Regular vacuuming is crucial for maintaining a healthy PostgreSQL database, especially in environments with frequent updates and deletes. Without regular vacuuming, the database files can become bloated, leading to slower queries, increased disk usage, and potential performance bottlenecks.
The VACUUM
command: https://www.postgresql.org/docs/current/sql-vacuum.html
Current table size
SELECT
pg_size_pretty(pg_total_relation_size('postgres_air.boarding_pass')),
pg_stat_user_tables.n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'boarding_pass';
As you can see, the table is 3143MB with 0 dead rows.
Let's update 100,000 rows. It should take under 10 seconds
UPDATE postgres_air.boarding_pass
SET update_ts = update_ts + INTERVAL '1 second'
WHERE pass_id IN (
SELECT pass_id
FROM postgres_air.boarding_pass
LIMIT 100000);
The table is about 13MB larger. The number of dead rows is 100K.
I run it a few more times and saw the table grow around 13MB every time and the number of dead rows was increased by 100K.
3143MB -> 3156MB -> 3179MB -> 3183MB -> 3198MB -> 3211MB
Let's add the ESTIMATED number of rows too
SELECT
pg_size_pretty(pg_total_relation_size('postgres_air.boarding_pass')),
pg_stat_user_tables.n_dead_tup,
pg_stat_user_tables.n_live_tup AS estimated_rows
FROM
pg_stat_user_tables
JOIN pg_class ON pg_stat_user_tables.relname = pg_class.relname
WHERE
pg_class.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'postgres_air')
AND pg_class.relname = 'boarding_pass';
updating 100K rows
As you can see: 100K more dead rows. The table is 14MB higher. But the estimated number of LIVE rows is the same, which makes perfect sense.
Vacuum
After 5 runs the the table grow from 3143MB to 3211MB. But it is all dead rows. We only updated the DB, replacing one timestamp with another. So in theory, the size of the table should stay the same.
Practically a VACUUM should run.
Auto-vacuum
PostgreSQL has an auto-vacuum feature that automatically removes dead tuples and updates statistics, alleviating developers from manually running VACUUM. However, users should periodically review and tune the auto-vacuum configuration parameters to ensure they align with their database's workload and usage patterns for optimal performance.
To disable auto-vacuum use the command
ALTER TABLE postgres_air.boarding_pass SET (
autovacuum_enabled = false
);
I tried updating 5M rows. Usually the cloud vendors apply auto-vacuum which, as the name suggests, kick in and send the number back to 0. After disabling it the table shows:
Vacuum Setting
To see the vacuum and auto vacuum configuration run
SELECT name, setting, short_desc
FROM pg_settings
WHERE name LIKE '%vacuum%'
ORDER BY name;
Show metadata
We need a better query to show more vacuum related data:
SELECT
pg_size_pretty(pg_total_relation_size('postgres_air.boarding_pass')),
pg_stat_user_tables.n_dead_tup,
pg_stat_user_tables.n_live_tup AS estimated_rows,
pg_stat_user_tables.n_ins_since_vacuum, -- Estimated number of rows inserted since this table was last vacuumed
pg_stat_user_tables.last_vacuum, -- Last time at which this table was manually vacuumed (not counting VACUUM FULL)
now() - pg_stat_user_tables.last_vacuum AS time_since_last_vacuum,
pg_stat_user_tables.last_autovacuum, -- Last time at which this table was vacuumed by the autovacuum daemon.
now() - pg_stat_user_tables.last_autovacuum AS time_since_last_autovacuum,
pg_stat_user_tables.vacuum_count, -- Number of times this table has been manually vacuumed (not counting VACUUM FULL)
pg_stat_user_tables.autovacuum_count
FROM
pg_stat_user_tables
JOIN pg_class ON pg_stat_user_tables.relname = pg_class.relname
WHERE
pg_class.relnamespace = (SELECT oid FROM pg_namespace WHERE nspname = 'postgres_air')
AND pg_class.relname = 'boarding_pass';
Run the VACUUM command
Notice this is not a FULL vacuum. The command run for about 30 seconds.
VACUUM postgres_air.boarding_pass;
While there are no dead rows, the size of the table didn't change!
Let's run FULL VACUUM
from the documentation: Selects “full” vacuum, which can reclaim more space, but takes much longer and exclusively locks the table. This method also requires extra disk space, since it writes a new copy of the table and doesn't release the old copy until the operation is complete. Usually this should only be used when a significant amount of space needs to be reclaimed from within the table.
VACUUM FULL postgres_air.boarding_pass;
It took about 60 seconds.
The table is now back to 3143MB.
No dead rows
The FULL vacuum didn't count. I suspect this is because internally this is a new table.
The re_pack Extension
pg_repackis a PostgreSQL extension which lets you remove bloat from tables and indexes, and optionally restore the physical order of clustered indexes. Unlike CLUSTERand VACUUM FULLit works online, without holding an exclusive lock on the processed tables during processing. pg_repack is efficient to boot, with performance comparable to using CLUSTER directly.
It is recommended to create this extension and use it, instead of VACUUM FULL.