GIN Index Pending Lists (Fast Updates)

A GIN - generalised inverted - index is a useful alternative index access method in PostgreSQL. It stores a list of row IDs for each key, rather than (key, rowid) pairs like the btree (default) access method. This makes it useful for indexing composite values (e.g. arrays, hstore), where the same value is likely to appear in many rows, and a single row many have many values in the indexed item. As well as this, the btree_gin extension implements operator classes with equivalent behaviour to btree indexes for most default PostgreSQL data types. This can be useful when indexing a column has that has few distinct values relative to the number of rows in the table, as the index is much more compact: keys are stored once, rather than once per indexed tuple.

However, a problem with GIN is that updating the index can be slow: a single row may contain multiple values in the indexed item, requiring multiple insertions into the index when a row is inserted or updated. For this reason, by default GIN accumulates new tuples to insert into the index in a pending list. Tuples are added to this list until the index is vacuumed or it reaches gin_pending_list_limit, when the tuples are added in bulk to the index itself. The best situation is to have these updates happen in the background with an autovacuum or autoanalyze. If they happen in the foreground due to getting too large, it will considerably slow down the command that triggered the update.

The problem with delaying index updates is that all items in the pending list have to be scanned every time the index is searched.

Example

We'll use an index on a value with a relatively small number of distinct integer values to illustrate the size advantage of a GIN index, using the btree_gin extension.

SET ROLE postgres;
CREATE EXTENSION IF NOT EXISTS btree_gin;
RESET ROLE;

SET
CREATE EXTENSION
RESET

DROP TABLE IF EXISTS testtable;

DROP TABLE

CREATE TABLE testtable AS SELECT generate_series(1, 10) x, y FROM (SELECT generate_series(1, 10000)) y(y);

SELECT 100000

CREATE INDEX testtable_x_idx_btree ON testtable (x);

CREATE INDEX

CREATE INDEX testtable_x_idx_gin ON testtable USING gin (x);

CREATE INDEX

The GIN index is about ten times smaller than the btree index (the size of the GIN index was reduced with the posting list compression added in 9.4 - see backend/access/gin/ginpostinglist.c):

\di+ testtable_x_idx_*

List of relations
Schema	Name	Type	Owner	Table	Size	Description
public	testtable_x_idx_btree	index	pete	testtable	2208 kB
public	testtable_x_idx_gin	index	pete	testtable	256 kB

Determining the current size of the pending list

We can use the pageinspect extension to find out how many items are in the GIN pending list. This requires superuser access:

SET ROLE postgres;
CREATE EXTENSION IF NOT EXISTS pageinspect;
RESET ROLE;

SET
CREATE EXTENSION
RESET

The pageinspect extension has a gin_metapage function in 9.5+:

SET ROLE postgres;
SELECT
     pending_head,
     pending_tail,
     n_pending_pages,
     pg_size_pretty(n_pending_pages * 8192) AS size,
     n_pending_tuples
FROM gin_metapage_info(get_raw_page('testtable_x_idx_gin', 0));
RESET ROLE;

SET
pending_head	pending_tail	n_pending_pages	size	n_pending_tuples
4294967295	4294967295	0	0 bytes	0
RESET

In earlier versions, we have to decode the metapage manually (little-endian only - works on x86):

SET ROLE postgres;
WITH ints AS (
    SELECT
        substring(v, 25, 4) AS low,
        substring(v, 29, 4) AS high
    FROM get_raw_page('testtable_x_idx_gin', 0) v),
decoded AS (
    SELECT 
        get_byte(low, 0)::bigint + (get_byte(low, 1) << 8) + (get_byte(low, 2) << 16) + (get_byte(low, 3) << 24) low,
        get_byte(high, 0)::bigint + (get_byte(high, 1) << 8) + (get_byte(high, 2) << 16) + (get_byte(high, 3) << 24) high
    FROM ints
),
decoded_pages AS (
    SELECT low, high, CASE WHEN low = -1 THEN 0 ELSE high - low + 1 END AS pages
    FROM decoded
)
SELECT low, high, pages AS pages, pg_size_pretty((high - low + 1) * 8192) AS size
FROM decoded_pages;
RESET ROLE;

SET
low	high	pages	size
-1	-1	0	8192 bytes
RESET

Turning Off Fast Updates

The FASTUPDATE storage option can be used to turn off the fast update feature. Inserts are slower, but lookups will generally be faster, as the pending list does not have to be scanned. This can get very slow for large pending lists.

ALTER INDEX testtable_x_idx_gin SET (fastupdate = false);

ALTER INDEX

You can see the current values of any index options with the \d+ command in psql:

\d+ testtable_x_idx_gin

Index "public.testtable_x_idx_gin"
Column	Type	Definition	Storage
x	integer	x	plain
gin, for table "public.testtable"
Options: fastupdate=false