This article aims to illustrate the timing of several events that occur during transaction commit and their consequences for real-life applications.
wal2json
The behavior will be shown using a few hands-on experiments.
For these experiments we need 3 Postgres clusters, one master, one synchronous replica and one asynchronous replica. In the following, they are called master
, sync
and async
. The synchronous replica is configured with a replay delay of 5 seconds. All databases and the psql
client run on the same machine.
To see the setup in detail, please click on the Setup details tab.
If you have access to a Debian or Ubuntu system with PGDG Postgres, you should be able to follow.
Copy the following script to a file called inc.sh
. It creates a couple of bash functions. To activate them in a running bash, run . inc.sh
. These functions work with a file called ports.sh
. It is assumed that you are allowed to write that file in the current directory. Best if you start with an empty working directory.
V=16
ASPG='sudo -iu postgres'
ctl () { $ASPG pg_ctlcluster $V $1 $2 -- -w; }
create () {
local -n __=$1
$ASPG pg_createcluster $V $1
__=$($ASPG pg_conftool -s $V $1 show port)
printf >>ports.sh '%s=%q\n' $1 $__
}
drop () {
ctl $1 stop
$ASPG pg_dropcluster $V $1
sed -i "/^$1=/d" ports.sh
}
copy () {
. ports.sh
local -n __=$1
create $2
$ASPG rm -rf $V/$2
$ASPG /usr/lib/postgresql/$V/bin/pg_basebackup -p $__ -R -D $V/$2
ctl $2 start
psql $2 <<PSQL
SELECT format('ALTER SYSTEM SET primary_conninfo=%L', 'application_name=$2 ' || setting)
FROM pg_settings
WHERE name='primary_conninfo'\gexec
SELECT pg_reload_conf();
PSQL
}
psql () {
. ports.sh
local -n __=$1; shift
$ASPG psql -Xp $__ "$@"
}
With this setup we now can create the databases:
. inc.sh
create master; ctl master start
psql master <<'PSQL'
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET track_commit_timestamp = on;
-- pull in dblink and create our test table
CREATE EXTENSION dblink;
CREATE TABLE t (i int);
PSQL
ctl master restart
copy master sync
psql sync <<'PSQL'
ALTER SYSTEM SET recovery_min_apply_delay = '5s';
SELECT pg_reload_conf();
PSQL
psql master <<'PSQL'
ALTER SYSTEM SET synchronous_standby_names = 'sync';
SELECT pg_reload_conf();
PSQL
copy master async
To verify the setup connect to the master (psql master
) and check the pg_stat_replication
table:
postgres=# SELECT application_name, state, sync_state FROM pg_stat_replication;
application_name | state | sync_state
------------------+-----------+------------
sync | streaming | sync
async | streaming | async
(2 rows)
First, let's show the delay works.
The following command
rp
psql variabledblink
and counts the number of rows in the tableWe expect to see a row count of 1 on the master immediately after the insert. On the replica, however, the row should become visible only after 5 seconds.
psql master -v rp=$sync <<'PSQL'
INSERT INTO t VALUES (1);
SELECT *
FROM (SELECT count(*) AS master_count FROM t) AS local
, (SELECT *
FROM dblink(
'port='||:'rp',
'SELECT count(*) FROM t'
) AS remote(replica_count BIGINT)
)\watch i=2 c=4
PSQL
The output matches our expectations.
INSERT 0 1
Sat 11 May 2024 08:02:05 PM CEST (every 2s)
master_count | replica_count
--------------+---------------
1 | 0
(1 row)
...
Sat 11 May 2024 08:02:09 PM CEST (every 2s)
master_count | replica_count
--------------+---------------
1 | 0
(1 row)
Sat 11 May 2024 08:02:11 PM CEST (every 2s)
master_count | replica_count
--------------+---------------
1 | 1
(1 row)
We have set up a synchronous replica with an artificial replay delay. Shouldn't we expect the INSERT statement take 5 seconds?
This behavior is controlled by the synchronous_commit
parameter. By default it is on
. But it has more possible values than on
and off
. Let's set it to remote_apply
and repeat the operation:
SET synchronous_commit = remote_apply;
SELECT set_config('my.before_insert', now()::TEXT, false);
INSERT INTO t VALUES (1);
SELECT now() - current_setting('my.before_insert')::TIMESTAMPTZ AS elapsed;
This time the transaction took 5 seconds:
elapsed
-----------------
00:00:05.004428
(1 row)
Note, the synchronous_commit
parameter can be set for each session separately. The setting in the postgres configuration is just the default for this cluster.
Let's dig a bit deeper. Now we want to find out which part of the transaction is actually delayed. So, we start a transaction, insert a row and commit.
Right before commit we call now()
[1], clock_timestamp()
[2], and we also fetch the transaction id. The difference of these 2 timestamps tells us how long it took the transaction to get here.
We store all this information in psql variables using \gset
so that we can use them later after COMMIT
.
psql master -q <<'PSQL'
SET synchronous_commit = remote_apply;
BEGIN;
INSERT INTO t VALUES (1);
SELECT now() AS tx_start
, clock_timestamp() AS before_commit
, pg_current_xact_id() AS txid
\gset
COMMIT;
SELECT now() - :'tx_start'::TIMESTAMPTZ AS tx_start_age
, now() - :'before_commit'::TIMESTAMPTZ AS before_commit_age
, now() - pg_xact_commit_timestamp(:'txid') AS commit_age
PSQL
Here is the output:
tx_start_age | before_commit_age | commit_age
-----------------+-------------------+-----------------
00:00:05.005023 | 00:00:05.003971 | 00:00:05.003774
(1 row)
So, the COMMIT
command is the actually delayed part. This corresponds to the values we got from now()
and clock_timestamp()
within the transaction.
Now, the most interesting piece here is the commit timestamp of the transaction. It can be read only after the transaction has committed, of course. It also is 5 seconds old. Does that tell us the transaction has actually become visible 5 seconds before COMMIT
returned?
To answer that let's insert another row and use another session to see when it becomes visible on the 3 databases. Both sessions are started in the background. At the end we use bash's wait
to wait for both. The first session simply runs the INSERT
command. This time we insert the value 2
. The other session counts the number of such rows in the table. It does that 140 times with an interval of 0.05 sec in between.
In addition, strace
is used to trace the backend process performing the insert. We do that to find the exact timestamp when the COMMIT
operation is written to the WAL.
The 2 psql processes write their output to writer.out
and reader.out
. The strace output arrives in /tmp/writer.trace
. Note, this only works if the database cluster runs on the same machine as psql
and if psql
is run as the same user running the database.
psql master -qAtF $'\t' <<'PSQL' >writer.out &
-- format strace command line to trace our backend process
SELECT format('\! strace -tt -p %s >/tmp/writer.trace 2>&1 &', pg_backend_pid()) AS cmd\gset
-- run strace
:cmd
-- give it a bit of time to connect to the backend
SELECT pg_sleep(.1);
-- now run the transaction
SET synchronous_commit = remote_apply;
BEGIN;
INSERT INTO t VALUES (2);
SELECT pg_current_xact_id() AS xid\gset
SELECT 'before commit: ' || clock_timestamp();
COMMIT;
SELECT 'after commit: ' || clock_timestamp();
SELECT format(E'xid: %s; commit timestamp %s\n', :'xid', pg_xact_commit_timestamp(:'xid'));
PSQL
psql master -AtF $'\t' -v sync=$sync -v async=$async >reader.out <<'PSQL' &
WITH local(cnt) AS (
SELECT count(*) FROM t WHERE i=2
)
, sync(cnt) AS (
SELECT *
FROM dblink(
'port='||:'sync',
'SELECT count(*) FROM t WHERE i=2'
) AS t(count BIGINT)
)
, async(cnt) AS (
SELECT *
FROM dblink(
'port='||:'async',
'SELECT count(*) FROM t WHERE i=2'
) AS t(count BIGINT)
)
SELECT now(), local.cnt, sync.cnt, async.cnt
FROM local, sync, async
\watch i=.05 c=140
PSQL
wait
First, let's condense the reader output to the important pieces. This is the command:
uniq -f2 reader.out | column -t -s $'\t' -N time,local,sync_repl,async_repl
And this is the result:
time local sync_repl async_repl
2024-05-14 21:12:40.397404+02 0 0 0
2024-05-14 21:12:40.547516+02 0 0 1
2024-05-14 21:12:45.497518+02 0 1 1
2024-05-14 21:12:45.547532+02 1 1 1
This is writer.out
:
before commit: 2024-05-14 21:12:40.502987+02
after commit: 2024-05-14 21:12:45.509074+02
xid: 784; commit timestamp 2024-05-14 21:12:40.503772+02
And this is the interesting part in the strace output:
21:12:40.503865 openat(AT_FDCWD, "pg_wal/000000010000000000000004", O_RDWR|O_CLOEXEC) = 8
21:12:40.503999 pwrite64(8, "\23\321\5\0\1\0\0\0\0\0\5\4\0\0\0\0=\2\0\0\0\0\0\0\0\20\0{4\0\0\1"..., 8192, 327680) = 8192
21:12:40.504146 fdatasync(8) = 0
Now we can reconstruct the timeline:
COMMIT
command finishedAs expected, on the sync replica, the row became visible after 5 seconds.
However, on the async replica it became visible almost immediately.
Interestingly, despite the commit timestamp being recorded very close to the beginning of COMMIT
, the row did not become visible immediately on the master. In fact, it became visible there last.
How does that work?
Replication in PG is based on the WAL. The Write-Ahead-Log is an essential part of the durability guarantee a database provides. It is an ordered record of changes in the database. Every change a transaction performs including the final commit is added to the WAL.
When the database crashes due to power loss or similar, it can recover up to the last committed transaction by replaying the WAL. During normal crash recovery, this procedure ends when there is no more WAL to replay. Replication works by extending crash recovery till infinity. The master generates more and more WAL which is shipped to the replica and replayed there.
So:
For the examples above we chose to combine remote_apply
with a significant replay delay in order to demonstrate:
Similar race conditions can also occur without remote_apply
or even without any replica. This often comes as a surprise to developers. Let's now look at an example that's closer to real life.
Now, why does it matter whether a transaction appears in the WAL at the beginning or the end of COMMIT
?
Imagine we use the wal2json
logical decoding plugin to capture data changes in the database involving our table t
. Every time a transaction commits we want to know how many rows have been inserted by that transaction.
The output is a stream of JSON objects. This output is filtered with jq
and translated into a stream of SQL queries like
SELECT :'xact_id', pg_xact_status(:'xact_id'), count(*)
FROM t
WHERE xmin=:'xact_id';
This stream is then piped back to psql
connecting to the same database.
Inserting one row into t
looks like this when it's converted to JSON:
{
"action": "B",
"xid": 780,
"timestamp": "2024-05-18 20:03:23.794045+02"
}
{
"action": "I",
"xid": 780,
"timestamp": "2024-05-18 20:03:23.794045+02",
"schema": "public",
"table": "t",
"columns": [
{
"name": "i",
"type": "integer",
"value": 4
}
]
}
{
"action": "C",
"xid": 780,
"timestamp": "2024-05-18 20:03:23.794045+02"
}
There are 3 JSON objects. The first represents the BEGIN
operation, the 2nd is the actual insert and the last one COMMIT
.
We use jq
's select()
function to ignore all non-commit operations. Only COMMIT
operations are transformed into SQL statements.
In order to make this work we first need to create a logical replication slot. The name w2j
is used later when pg_recvlogical
is used. The wal2json
parameter specifies which output plugin to use.
SELECT * FROM pg_create_logical_replication_slot('w2j', 'wal2json');
Next we run in a separate terminal session this:
$ASPG /usr/lib/postgresql/$V/bin/pg_recvlogical \
-d 'port='$master' dbname=postgres' \
-S w2j \
-f - \
-F 0 \
--start \
-o include-xids=1 \
-o include-timestamp=1 \
-o add-tables=public.t \
-o format-version=2 |
jq -r --unbuffered \
'select(.action == "C") | .xid | tostring |
"SELECT "+.+", pg_xact_status($$"+.+"$$), count(*) FROM t WHERE xmin="+.+";"' |
psql master -AtF $'\t'
As explained above, pg_recvlogical
connects to the WAL stream on the master database. jq
then converts COMMIT
records using the xid
contained in it to generate SQL queries which are fed back to psql
.
If we now generate a stream of transactions each inserting one row into t
, many people would expect to see a stream of rows where count is 1
.
Here is what happened on my system.
synchronous_commit = off
or synchronous_commit = local
I could see the expected behavior.synchronous_commit = on
and, of course, with synchronous_commit = remote_apply
I get a constant count of 0
. However, if I repeated the same query later for any of the transactions where I saw 0
, I get the expected 1
.synchronous_standby_names=''
), I saw the expected 1
s. This is equivalent to synchronous_commit=local
.Now, it would be foolish to assume this race condition can only happen when synchronous replication is used. You cannot rely on the transaction that's being committed to be visible to other transactions immediately when the COMMIT WAL record is written.
The example above also shows a way to resolve the issue. We can use pg_xact_status(XID8)
to inquire the status of a transaction. If it's still in progress
, then we need to wait a bit and repeat the query. I don't think there is a function that allows to wait for another transaction to finish.
An application must not assume any order in which transactions become visible on the master and its replicas.
now()
returns the start time of the transaction ↩︎
clock_timestamp()
returns the current clock time ↩︎
subject to commit_delay
↩︎