Home:ALL Converter>pg_create_logical_replication_slot hanging indefinitely due to old walsender process

pg_create_logical_replication_slot hanging indefinitely due to old walsender process

Ask Time:2019-01-17T23:24:44         Author:JosMac

Json Formatter

I am testing logical replication between 2 PostgreSQL 11 databases for use on our production (I was able to set it thanks to this answer - PostgreSQL logical replication - create subscription hangs) and it worked well.

Now I am testing scripts and procedure which would set it automatically on production databases but I am facing strange problem with logical replication slots.

I had to restart logical replica due to some changes in setting requiring restart - which of course could happen on replicas also in the future. But logical replication slot on master did not disconnect and it is still active for certain PID.

I dropped subscription on master (I am still only testing) and tried to repeat the whole process with new logical replication slot but I am facing strange situation.

I cannot create new logical replication slot with the new name. Process running on the old logical replication slot is still active and showing wait_event_type=Lock and wait_event=transaction.

When I try to use pg_create_logical_replication_slot to create new logical replication slot I get similar situation. New slot is created - I see it in pg_catalog but it is marked as active for the PID of the session which issued this command and command hangs indefinitely. When I check processes I can see this command active with same waiting values Lock/transaction.

I tried to activate parameter "lock_timeout" in postgresql.conf and reload configuration but it did not help.

Killing that old hanging process will most likely bring down the whole postgres because it is "walsender" process. It is visible in processes list still with IP of replica with status "idle wating".

I tried to find some parameter(s) which could help me to force postgres to stop this walsender. But settings wal_keep_segments or wal_sender_timeout did not change anything. I even tried to stop replica for longer time - no effect.

Is there some way to do something with this situation without restarting the whole postgres? Like forcing timeout for walsender or lock for transaction etc...

Because if something like this happens on production I would not be able to use restart or any other "brute force". Thanks...

UPDATE: "Walsender" process "died out" after some time but log does not show anything about it so I do not know when exactly it happened. I can only guess it depends on tcp_keepalives_* parameters. Default on Debian 9 is 2 hours to keep idle process. So I tried to set these parameters in postgresql.conf and will see in following tests.

Author:JosMac,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/54239082/pg-create-logical-replication-slot-hanging-indefinitely-due-to-old-walsender-pro
yy