Troubleshoot PostgreSQL: ‘An existing connection was forcibly closed by the remote host’

Published Oct 21 2019 03:38 PM 23.5K Views
Microsoft

Application logs may show intermittent connection errors such as “An I/O error occurred while sending to the backend or other error messages that indicate a timeout, write failure or a pipe error. Very shortly after the PostgreSQL logs show an error like “could not receive data from client: An existing connection was forcibly closed by the remote host”. These two events are interrelated and are frequently associated with client-side connection handling issues.

 

These types of errors can be tricky to troubleshoot as it involves different components. In this blog post, I will discuss what can lead to this error and best practices to avoid it.

 

Stale connections

Let’s say a client application tries to execute a query using a previously opened connection object retrieved from a connection pool. When the application attempts to use the connection object, the connection has gone “stale” and the client application throws an exception from attempting to send query data over the “open” connection.  The client-side logic catches the exception and the TCP connection is closed. The PostgreSQL backend detects that the client-side connection was closed and reports the dropped connection as Windows error WSAECONNRESET (10054) Connection reset by peer: An existing connection was forcibly closed by the remote host. The ‘remote host’ here is the client, which is separate from the Postgres server. 

 

Curious to know more?
The only way to understand exactly what is causing the issue is to capture network trace at the client-side at the time when the issue occurrs. Network Traces can be captured by different applications such as NetMon, WireShark, Fiddler, etc.

If you are using a Linux client and the tools mentioned above don't work, you can manually generate network dump as in the following command:

tshark -i any -n -b filesize:204800 -w `date +%y%m%d-%H:%M:%S`.pcap  -b files:1000

or

tcpdump -i any -w `date +%y%m%d-%H:%M:%S`.pcap -G 300  -W 1000

Note: If you enable network capture, monitor your disk space and purge as needed to ensure you don’t run out of space.

 

 

Solution

  • Use Pgbouncer:
    pgBouncer is a connection pooler that sits between your application and the database. When it needs a new connection to the database it grabs one, and then continues to re-use it. After a certain period of time, it releases that connection. What this means is when your application grabs a connection to the database and doesn’t use it, it’s not actually passed on and consumed as an idle connection. From: Not all Postgres connection pooling is equal.

  • Implement retry logic to handle transient errors
    Implement retry logic as a best practice for designing and developing applications in the cloud as transient errors could occur. In this case, a re-attempt of the client query can find an active connection to use. Read more here: Handling transient connectivity errors

 

  • Minimize idle connections’ impact on the database
    Managing connections is a topic comes up often in conversations with PostgreSQL users. The connections in Postgres aren’t free. Each connection, whether idle or active, consumes a certain overhead of memory (about 10MB per connection). Idle is something that grabs a connection from your application and holds it. Application connection poolers often also consume one or more idle connections. For more information, see: Connection handling best practices with PostgreSQL.

    configure statement_timeout and idle_in_transaction_session_timeout properly, see
    Tracking and Managing Your Postgres Connections

  • Send keep-alive signals from your application in your connection to avoid idle sessions
    This trick can make sure pooled connections will not idle out in scenarios when the connection is not active for a certain time.

  • Check application resource utilization
    High resource pressure (high CPU, high IOPS, high context switching) at the client-side might cause slowness. This slowness can cause connections to be open too long and eventually drop them.

 

 

 

Richard Bartel
Senior Software Engineer - Azure Database for PostgreSQL

Bashar Hussein
Embedded Escalation Engineer - Azure OSS DB

1 Comment
Occasional Visitor

I start getting below errors, when the number of concurrent connections to azure postgresql increased to around 100, though SSL is currently disabled in azure postgres. It works fine when the number of concurrent connections are below 60, please can you suggest what could be the cause?

 

org.postgresql.util.PSQLException: SSL error: Remote host closed connection during handshake
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:42)
at org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:435)
at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:94)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)
at org.postgresql.Driver.makeConnection(Driver.java:454)
at org.postgresql.Driver.connect(Driver.java:256)
at java.sql.DriverManager.getConnection(DriverManager.java:675)
at java.sql.DriverManager.getConnection(DriverManager.java:258)
at tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf.tDBInput_1Process(TJob_DQ_fullload_child_splitfile_perf.java:6134)
at tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf.tJava_3Process(TJob_DQ_fullload_child_splitfile_perf.java:21074)
at tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf.tDBInput_4Process(TJob_DQ_fullload_child_splitfile_perf.java:21778)
at tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf$8.run(TJob_DQ_fullload_child_splitfile_perf.java:24995)
Caused by: javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake  

%3CLINGO-SUB%20id%3D%22lingo-sub-925164%22%20slang%3D%22en-US%22%3ETroubleshoot%20PostgreSQL%3A%20%E2%80%98An%20existing%20connection%20was%20forcibly%20closed%20by%20the%20remote%20host%E2%80%99%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-925164%22%20slang%3D%22en-US%22%3E%3CP%3EApplication%20logs%20may%20show%20intermittent%20connection%20errors%20such%20as%20%E2%80%9C%3CSTRONG%3E%3CEM%3EAn%20I%2FO%20error%20occurred%20while%20sending%20to%20the%20backend%3C%2FEM%3E%3C%2FSTRONG%3E%3CEM%3E%E2%80%9D%3C%2FEM%3E%20or%20other%20error%20messages%20that%20indicate%20a%20timeout%2C%20write%20failure%20or%20a%20pipe%20error.%20Very%20shortly%20after%20the%20PostgreSQL%20logs%20show%20an%20error%20like%20%E2%80%9C%3CSTRONG%3E%3CEM%3Ecould%20not%20receive%20data%20from%20client%3A%20An%20existing%20connection%20was%20forcibly%20closed%20by%20the%20remote%20host%3C%2FEM%3E%3C%2FSTRONG%3E%3CEM%3E%E2%80%9D.%26nbsp%3B%3C%2FEM%3EThese%20two%20events%20are%20interrelated%20and%20are%20frequently%20associated%20with%20client-side%20connection%20handling%20issues.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThese%20types%20of%20errors%20can%20be%20tricky%20to%20troubleshoot%20as%20it%20involves%20different%20components.%20In%20this%20blog%20post%2C%20I%20will%20discuss%20what%20can%20lead%20to%20this%20error%20and%20best%20practices%20to%20avoid%20it.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CFONT%20size%3D%224%22%3E%3CU%3E%3CSTRONG%3EStale%20connections%3C%2FSTRONG%3E%3C%2FU%3E%3C%2FFONT%3E%3C%2FP%3E%0A%3CP%3ELet%E2%80%99s%20say%20a%20client%20application%20tries%20to%20execute%20a%20query%20using%20a%20previously%20opened%20connection%20object%20retrieved%20from%20a%20connection%20pool.%20When%20the%20application%20attempts%20to%20use%20the%20connection%20object%2C%20the%20connection%20has%20gone%20%E2%80%9Cstale%E2%80%9D%20and%20the%20client%20application%20throws%20an%20exception%20from%20attempting%20to%20send%20query%20data%20over%20the%20%E2%80%9Copen%E2%80%9D%20connection.%20%26nbsp%3BThe%20client-side%20logic%20catches%20the%20exception%20and%20the%20TCP%20connection%20is%20closed.%26nbsp%3BThe%20PostgreSQL%20backend%20detects%20that%20the%20client-side%20connection%20was%20closed%20and%20reports%20the%20dropped%20connection%20as%20Windows%20error%20WSAECONNRESET%20(10054)%20Connection%20reset%20by%20peer%3A%20%3CSTRONG%3E%3CEM%3EAn%20existing%20connection%20was%20forcibly%20closed%20by%20the%20remote%20host.%3C%2FEM%3E%3C%2FSTRONG%3E%26nbsp%3BThe%20%E2%80%98remote%20host%E2%80%99%20here%20is%20the%20client%2C%20which%20is%20separate%20from%20the%20Postgres%20server.%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CFONT%20size%3D%224%22%3E%3CU%3E%3CSTRONG%3ECurious%20to%20know%20more%3F%3CBR%20%2F%3E%3C%2FSTRONG%3E%3C%2FU%3E%3C%2FFONT%3EThe%20only%20way%20to%20understand%20exactly%20what%20is%20causing%20the%20issue%20is%20to%20capture%20network%20trace%20at%20the%20client-side%20at%20the%20time%20when%20the%20issue%20occurrs.%20Network%20Traces%20can%20be%20captured%20by%20different%20applications%20such%20as%20%3CA%20href%3D%22https%3A%2F%2Fwww.microsoft.com%2Fen-US%2Fdownload%2Fdetails.aspx%3Fid%3D4865%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3ENetMon%3C%2FA%3E%2C%20%3CA%20href%3D%22https%3A%2F%2Fwww.wireshark.org%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3EWireShark%3C%2FA%3E%2C%20%3CA%20href%3D%22https%3A%2F%2Fwww.telerik.com%2Fdownload%2Ffiddler%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3EFiddler%3C%2FA%3E%2C%20etc.%20%3CBR%20%2F%3E%3CBR%20%2F%3E%3C%2FP%3E%0A%3CP%3EIf%20you%20are%20using%20a%20Linux%20client%20and%20the%20tools%20mentioned%20above%20don't%20work%2C%20you%20can%20manually%20generate%20network%20dump%20as%20in%20the%20following%20command%3A%20%3CBR%20%2F%3E%3CBR%20%2F%3E%3C%2FP%3E%0A%3CP%3E%3CFONT%20size%3D%222%22%3E%3CSTRONG%3E%3CEM%3Etshark%20-i%20any%20-n%20-b%20filesize%3A204800%20-w%20%60date%20%2B%25y%25m%25d-%25H%3A%25M%3A%25S%60.pcap%26nbsp%3B%20-b%20files%3A1000%20%3C%2FEM%3E%3C%2FSTRONG%3E%3C%2FFONT%3E%3C%2FP%3E%0A%3CP%3E%3CFONT%20size%3D%222%22%3E%3CSTRONG%3Eor%3C%2FSTRONG%3E%3C%2FFONT%3E%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3E%3CFONT%20size%3D%222%22%3E%3CEM%3Etcpdump%20-i%20any%20-w%20%60date%20%2B%25y%25m%25d-%25H%3A%25M%3A%25S%60.pcap%20-G%20300%26nbsp%3B%20-W%201000%3C%2FEM%3E%3C%2FFONT%3E%3CEM%3E%3CBR%20%2F%3E%3CBR%20%2F%3E%3C%2FEM%3E%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3ENote%3A%20If%20you%20enable%20network%20capture%2C%20monitor%20your%20disk%20space%20and%20purge%20as%20needed%20to%20ensure%20you%20don%E2%80%99t%20run%20out%20of%20space.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CFONT%20size%3D%224%22%3E%3CU%3E%3CSTRONG%3ESolution%3C%2FSTRONG%3E%3C%2FU%3E%3C%2FFONT%3E%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CSTRONG%3EUse%20Pgbouncer%3A%20%3CBR%20%2F%3E%3C%2FSTRONG%3E%3CA%20href%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2FAzure-Database-for-PostgreSQL%2FSteps-to-install-and-setup-PgBouncer-connection-pooling-proxy%2Fba-p%2F730555%22%20target%3D%22_blank%22%3EpgBouncer%3C%2FA%3E%20is%20a%20connection%20pooler%20that%20sits%20between%20your%20application%20and%20the%20database.%20When%20it%20needs%20a%20new%20connection%20to%20the%20database%20it%20grabs%20one%2C%20and%20then%20continues%20to%20re-use%20it.%20After%20a%20certain%20period%20of%20time%2C%20it%20releases%20that%20connection.%20What%20this%20means%20is%20when%20your%20application%20grabs%20a%20connection%20to%20the%20database%20and%20doesn%E2%80%99t%20use%20it%2C%20it%E2%80%99s%20not%20actually%20passed%20on%20and%20consumed%20as%20an%20idle%20connection.%20From%3A%20%3CA%20href%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2FAzure-Database-for-PostgreSQL%2FNot-all-Postgres-connection-pooling-is-equal%2Fba-p%2F825717%22%20target%3D%22_blank%22%3ENot%20all%20Postgres%20connection%20pooling%20is%20equal%3C%2FA%3E%3CSPAN%3E.%3C%2FSPAN%3E%3CBR%20%2F%3E%3CBR%20%2F%3E%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3EImplement%20retry%20logic%20to%20handle%20transient%20errors%3C%2FSTRONG%3E%3CBR%20%2F%3EImplement%20retry%20logic%20as%20a%20best%20practice%20for%20designing%20and%20developing%20applications%20in%20the%20cloud%20as%20transient%20errors%20could%20occur.%20In%20this%20case%2C%20a%20re-attempt%20of%20the%20client%20query%20can%20find%20an%20active%20connection%20to%20use.%20Read%20more%20here%3A%20%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fpostgresql%2Fconcepts-connectivity-architecture%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3EHandling%20transient%20connectivity%20errors%3C%2FA%3E%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%3CSTRONG%3E%26nbsp%3B%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CSTRONG%3EMinimize%20idle%20connections%E2%80%99%20impact%20on%20the%20database%3C%2FSTRONG%3E%3CBR%20%2F%3EManaging%20connections%20is%20a%20topic%20comes%20up%20often%20in%20conversations%20with%20PostgreSQL%20users.%20The%20connections%20in%20Postgres%20aren%E2%80%99t%20free.%20Each%20connection%2C%20whether%20idle%20or%20active%2C%20consumes%20a%20certain%20overhead%20of%20memory%20(about%2010MB%20per%20connection).%20Idle%20is%20something%20that%20grabs%20a%20connection%20from%20your%20application%20and%20holds%20it.%20Application%20connection%20poolers%20often%20also%20consume%20one%20or%20more%20idle%20connections.%20For%20more%20information%2C%20see%3A%20%3CA%20href%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2FAzure-Database-for-PostgreSQL%2FConnection-handling-best-practice-with-PostgreSQL%2Fba-p%2F790883%22%20target%3D%22_blank%22%3EConnection%20handling%20best%20practices%20with%20PostgreSQL%3C%2FA%3E%3CSPAN%3E.%3C%2FSPAN%3E%20%3CBR%20%2F%3E%3CBR%20%2F%3Econfigure%20%3CSTRONG%3E%3CEM%3Estatement_timeout%3C%2FEM%3E%3C%2FSTRONG%3E%20and%20%3CSTRONG%3E%3CEM%3Eidle_in_transaction_session_timeout%20%3C%2FEM%3E%3C%2FSTRONG%3Eproperly%3CSTRONG%3E%3CEM%3E%2C%20%3C%2FEM%3E%3C%2FSTRONG%3Esee%3CSTRONG%3E%3CEM%3E%20%3CBR%20%2F%3E%3C%2FEM%3E%3C%2FSTRONG%3E%3CA%20href%3D%22http%3A%2F%2Fwww.craigkerstiens.com%2F2017%2F09%2F18%2Fpostgres-connection-management%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3ETracking%20and%20Managing%20Your%20Postgres%20Connections%3C%2FA%3E%3CBR%20%2F%3E%3CBR%20%2F%3E%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3ESend%20keep-alive%20signals%20from%20your%20application%20in%20your%20connection%20to%20avoid%20idle%20sessions%3CBR%20%2F%3E%3C%2FSTRONG%3EThis%20trick%20can%20make%20sure%20pooled%20connections%20will%20not%20idle%20out%20in%20scenarios%20when%20the%20connection%20is%20not%20active%20for%20a%20certain%20time.%3CBR%20%2F%3E%3CBR%20%2F%3E%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3ECheck%20application%20resource%20utilization%3CBR%20%2F%3E%3C%2FSTRONG%3EHigh%20resource%20pressure%20(high%20CPU%2C%20high%20IOPS%2C%20high%20context%20switching)%20at%20the%20client-side%20might%20cause%20slowness.%20This%20slowness%20can%20cause%20connections%20to%20be%20open%20too%20long%20and%20eventually%20drop%20them.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%3CSTRONG%3E%26nbsp%3B%3C%2FSTRONG%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSTRONG%3ERichard%20Bartel%3C%2FSTRONG%3E%3CBR%20%2F%3ESenior%20Software%20Engineer%20-%20Azure%20Database%20for%20PostgreSQL%3CBR%20%2F%3E%3CBR%20%2F%3E%3CSTRONG%3EBashar%20Hussein%3CBR%20%2F%3E%3C%2FSTRONG%3EEmbedded%20Escalation%20Engineer%20-%20Azure%20OSS%20DB%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-925164%22%20slang%3D%22en-US%22%3E%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20style%3D%22width%3A%20600px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Fgxcuf89792.i.lithium.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F139230i3DFD7E4F47C1CAE4%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20alt%3D%22pg.png%22%20title%3D%22pg.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%3C%2FLINGO-TEASER%3E%3CLINGO-SUB%20id%3D%22lingo-sub-1413121%22%20slang%3D%22en-US%22%3ERe%3A%20Troubleshoot%20PostgreSQL%3A%20%E2%80%98An%20existing%20connection%20was%20forcibly%20closed%20by%20the%20remote%20host%E2%80%99%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1413121%22%20slang%3D%22en-US%22%3E%3CP%3EI%20start%20getting%20below%20errors%2C%20when%20the%20number%20of%20concurrent%20connections%20to%20azure%20postgresql%20increased%20to%20around%20100%2C%20though%20SSL%20is%20currently%20disabled%20in%20azure%20postgres.%20It%20works%20fine%20when%20the%20number%20of%20concurrent%20connections%20are%20below%2060%2C%20please%20can%20you%20suggest%20what%20could%20be%20the%20cause%3F%3C%2FP%3E%3CP%3E%26nbsp%3B%3C%2FP%3E%3CP%3Eorg.postgresql.util.PSQLException%3A%20SSL%20error%3A%20Remote%20host%20closed%20connection%20during%20handshake%3CBR%20%2F%3Eat%20org.postgresql.ssl.MakeSSL.convert(MakeSSL.java%3A42)%3CBR%20%2F%3Eat%20org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java%3A435)%3CBR%20%2F%3Eat%20org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java%3A94)%3CBR%20%2F%3Eat%20org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java%3A192)%3CBR%20%2F%3Eat%20org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java%3A49)%3CBR%20%2F%3Eat%20org.postgresql.jdbc.PgConnection.%3CINIT%3E(PgConnection.java%3A195)%3CBR%20%2F%3Eat%20org.postgresql.Driver.makeConnection(Driver.java%3A454)%3CBR%20%2F%3Eat%20org.postgresql.Driver.connect(Driver.java%3A256)%3CBR%20%2F%3Eat%20java.sql.DriverManager.getConnection(DriverManager.java%3A675)%3CBR%20%2F%3Eat%20java.sql.DriverManager.getConnection(DriverManager.java%3A258)%3CBR%20%2F%3Eat%20tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf.tDBInput_1Process(TJob_DQ_fullload_child_splitfile_perf.java%3A6134)%3CBR%20%2F%3Eat%20tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf.tJava_3Process(TJob_DQ_fullload_child_splitfile_perf.java%3A21074)%3CBR%20%2F%3Eat%20tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf.tDBInput_4Process(TJob_DQ_fullload_child_splitfile_perf.java%3A21778)%3CBR%20%2F%3Eat%20tcl_datalake_project_dataquality.tjob_dq_fullload_child_splitfile_perf_0_1.TJob_DQ_fullload_child_splitfile_perf%248.run(TJob_DQ_fullload_child_splitfile_perf.java%3A24995)%3CBR%20%2F%3ECaused%20by%3A%20javax.net.ssl.SSLHandshakeException%3A%20Remote%20host%20closed%20connection%20during%20handshake%26nbsp%3B%26nbsp%3B%3C%2FINIT%3E%3C%2FP%3E%3C%2FLINGO-BODY%3E
Version history
Last update:
‎Oct 25 2019 06:26 PM
Updated by: