Chasing a "Connection Refused" Through Three Wrong Theories

For weeks, the same two lines kept showing up in our FreeSWITCH logs:

[ERR] mod_event_socket.c:486 Socket Error: Connection refused
[ERR] mod_event_socket.c:490 Socket Error!

Every time they appeared, a call had just dropped. Not a clean hangup — a mid-IVR cut, the kind where a caller is halfway through a prompt and the line goes quiet. Our outbound ESL handler, a Go service that FreeSWITCH dials into for call control, was supposed to be answering those sockets. Most of the time it did. Some of the time it didn't, and rude for the callers.

We had been working around this for a long time — restart scripts, watchdogs, retries on the FreeSWITCH side. None of it actually fixed anything; it just smoothed over the symptom enough that we could ship. Today I sat down to actually fix it. This is the trail.

First Wrong Theory: "The Go Server's ESL Handshake Is Broken"

The first thing I did was the first thing I shouldn't have done: I wrote a Python script to talk to the Go server directly, expecting it to behave like an ESL server I'd seen before.

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("127.0.0.1", 9090))
banner = s.recv(4096)
print(banner)
if b"Content-Type: auth/request" in banner:
    print("Proper ESL handshake")
else:
    print("NOT proper ESL handshake")

The script connected, read a few bytes, and printed NOT proper ESL handshake. The first bytes coming off the socket were connect\r\n\r\n, not Content-Type: auth/request. To my eyes at the time, that looked like a smoking gun.

It wasn't. It was me forgetting which side of the protocol we were on.

There are two ESL modes. In inbound ESL, the application opens a TCP connection to FreeSWITCH on port 8021, FreeSWITCH greets you with Content-Type: auth/request, you send your password, and you start subscribing to events. In outbound ESL, FreeSWITCH is the client. The dialplan tells it to socket out to your application, your application is the server, and the first thing it sends across the new connection is connect\r\n\r\n — that's the cue for FreeSWITCH to reply with the channel data for the call that triggered the connection.

Our setup is outbound. The Go server sending connect\r\n\r\n first is correct. My Python test was written for the wrong mode and was lying to me.

I won't pretend I caught this quickly. I spent an embarrassing hours reviewing the eslgo source, comparing it to other Go ESL libraries, and quietly drafting an issue I never filed. The energy of being one tweak away from a fix is hard to step back from. Eventually I deleted the Python script and accepted that the handshake was fine.

Second Wrong Theory: "Dialplan Misconfiguration"

Connection refused from mod_event_socket almost always means exactly what it says there was nothing listening at the host and port the dialplan tried to reach. Port typo, wrong container hostname, IPv6 vs IPv4, FreeSWITCH and the Go service in different network namespaces. This is the boring, well-trodden cause.

So we walked through it:

ss -tlnp | grep 9090
# LISTEN 0 4096 *:9090 *:* users:(("our-binary",pid=...,fd=8))

The Go server was bound on :::9090 (all interfaces, dual-stack). The dialplan's socket action pointed at the same host and port. FreeSWITCH and the Go service were running on the same host — no Kubernetes, no overlay network. A plain python -c "import socket; socket.create_connection(('127.0.0.1', 9090))" from the same machine succeeded immediately.

If Connection refused had been the constant state of the world, this is where we'd be looking. But it wasn't constant — most calls went through fine, and the error showed up unpredictably. A real misconfiguration would fail every call, not 1-in-N. Ruled out.

Third Wrong Theory: "Kernel Backlog Overflow"

The next theory was prettier: maybe the Go server's accept loop was momentarily slow, the kernel's SYN/accept backlog filled up, and incoming SYNs were getting dropped or reset. That's the kind of thing that would happen unpredictably under load.

sysctl net.ipv4.tcp_abort_on_overflow   # 0  (default — drops, not RST)
sysctl net.core.somaxconn               # 4096
nstat -az | grep -i listen
# TcpExtListenDrops              0
# TcpExtListenOverflows          0

Zero overflows since the last counter reset, with a 4096-deep accept queue. Even if we were briefly slow, we weren't overflowing. Ruled out.

At this point I had eliminated the three theories that sounded most like the error message. The error message was lying about what kind of problem it was.

The Real Clue — Log Rotation Cadence

The thing that broke the case open had nothing to do with networking. It was the log directory.

Our application's log files were rotating every 15 to 30 minutes. We use size-based rotation, so under heavy load that's not impossible — but the cadence felt off, and rotations were happening even when call volume was low. The other thing that can rotate a log on a short cadence is a process restart: some setups start a fresh file every time the binary boots.

ps -o pid,etime,cmd -p $(pgrep -f our-binary)
#   PID     ELAPSED CMD
# 31482       00:23 /opt/our-app/bin/our-binary --config ...

23 seconds. That was the moment everything clicked. The process I had been treating as "the Go server" was a brand new one. Whatever had been listening on 9090 a minute ago was already gone. systemd was restarting it on a tight loop, and "Connection refused" was what FreeSWITCH saw whenever a call landed during the gap between the old process dying and the new one binding the port.

The Go server wasn't broken. The Go server was dying.

Hidden in Plain Sight — The systemd Service File

The next obvious move was to look at journalctl for the crash:

journalctl -u our-service --since "1 hour ago"

Plenty of Started ..., plenty of Main process exited, code=exited, status=2/INVALIDARGUMENT, restart counter sitting at 67. But no stack trace. No panic. Just exit codes.

Go panics go to stderr. Our service unit had this line:

[Service]
StandardError=append:/var/log/w-ivr-fs-engine/error.log

That redirect peels stderr off of systemd's journal and writes it directly to a file. journalctl will never see a Go panic from this service. The stack trace had been in the error log the whole time, hiding behind a config line nobody had looked at in months.

grep -A 20 "^panic:" /var/log/w-ivr-fs-engine/error.log | tail -40

There it was.

The Bug — `panic: send on closed channel`

panic: send on closed channel

goroutine 299 [running]:
github.com/percipia/eslgo.(*Conn).receiveLoop(0x3908e4064180)
        .../[email protected]/connection.go:299 +0x347
created by github.com/percipia/eslgo.newConnection in goroutine 1
        .../[email protected]/connection.go:92 +0x5ed

The crash is in eslgo, the FreeSWITCH ESL client library we use. The shape of the bug is a classic Go channel race:

receiveLoop sits in a for loop reading messages off the TCP connection.
When FreeSWITCH closes its side, doMessage returns EOF.
The EOF branch tries to be polite about it: it notifies any registered listener on a TypeDisconnect channel by sending into it.
That send is not protected by the response-channel mutex.
Concurrently, the connection's close() path acquires the write lock on responseChannels, walks the map, closes every channel, and deletes the entries.
If receiveLoop reads the TypeDisconnect channel pointer out of the map before close() closes it, and then performs the send after close() has closed it, the send hits a closed channel and Go panics. Process dies.

Single panic, single process exit. No recovery — the panic is in a goroutine the library spawned, and there's no recover() in the path.

Why Our Dialplan Amplified It

Our extension for the IVR entry point looks like this:

<action application="socket" data="${outbound_esl_host}:${outbound_esl_port} async full"/>
<action application="hangup" data="NORMAL_CLEARING"/>

socket immediately followed by hangup means FreeSWITCH closes its side of the TCP connection moments after the handler engages. Every call drives the EOF path in receiveLoop. At our steady-state of roughly ~100 calls per second, the race window is exercised five thousand times an hour. That's plenty.

This explained the unpredictability too: the race only fires when the close and the EOF-send interleave in the wrong order. Most calls hit a benign ordering. A few don't, and one of those is enough to take the process down.

Finding the Bug Upstream

Before sharpening a fix, I checked the eslgo issue tracker. Sitting at the top: issue #46, filed five days before we started this investigation, with the same stack trace and a clear root-cause writeup. PR #47 proposes the fix. Credit to naufalandika for both — we'd have eventually gotten there ourselves, but they got there first and saved us a meaningful amount of time.

We also looked at older eslgo versions to see if downgrading was a quick out. It isn't — the same pattern lives in the v1 branch all the way back. There's no "safe version" to pin to.

The Fix

The fix mirrors the lock pattern that doMessage() already uses on the same map. Wrap the disconnect send in a read lock, look the channel up safely, and use a select with a default so a slow listener can't block the receive loop:

c.responseChanMutex.RLock()
if disconnectCh, ok := c.responseChannels[TypeDisconnect]; ok {
    select {
    case disconnectCh <- &RawResponse{ /* ... */ }:
    default:
    }
}
c.responseChanMutex.RUnlock()

Symmetric with the rest of the file, no new primitives, no behavior change in the happy path.

Deploying the Workaround

We're not waiting for the upstream merge to land before stabilising production. The go.mod replace directive is the right tool here:

# Clone and patch a local copy of eslgo at the version we depend on
git clone https://github.com/percipia/eslgo.git ~/eslgo-patched
cd ~/eslgo-patched && git checkout v1.5.0
# Apply the change from PR #47

# Wire the local copy into our project
cd ~/our-project
go mod edit -replace github.com/percipia/eslgo=/home/app/eslgo-patched
go mod tidy
go build -o bin/our-binary ./cmd/server

# Restart the service cleanly
sudo systemctl reset-failed our-service
sudo systemctl restart our-service

The honest tradeoff: a replace pointing at a local path works on this host and nowhere else. A fresh clone of our repo on a new machine won't build until that directory exists. We're holding here until PR #47 merges upstream, at which point we'll bump the dependency and drop the replace.

What "Working" Looked Like

After the restart:

ps -o etime= -p $(pgrep -f our-binary) kept climbing past 23 seconds, then past minutes, then past hours.
journalctl -u our-service showed one Started line and no further Main process exited.
The mod_event_socket.c:486 Socket Error: Connection refused lines stopped showing up in the FreeSWITCH log.
The error log still shows Connection closed, stopping receive loop constantly — that's the patched EOF path firing for every completed call, now cleanly.

The last point is worth pausing on. The log line that looked like an error is actually the patch working as intended. The bug wasn't that the EOF path existed; it was that the EOF path raced.

Lessons

"Process is still listening" doesn't mean "process is healthy." Check the systemd restart counter and ps -o etime before you trust uptime.
StandardError=append:... in a service file hides Go panics from journalctl. Either drop the redirect or grep the file directly when a Go service starts misbehaving.
Inbound and outbound ESL are not symmetric protocols. A test script written for one mode will misdiagnose the other. Match your probe to the role your service plays.
When a bug feels new, check the upstream issue tracker first. Issue #46 was five days old when we hit it. That's a cheap search.
A replace directive in go.mod is a fine emergency patch path. Just leave a TODO so you remember to remove it when upstream lands.

We're still on the local patch for now, watching PR #47 for the merge. If you've seen the same panic: send on closed channel from eslgo — or any of the symptom set above — upvote the issue and add your stack trace. The more reports, the sooner it lands.

Chasing a 'Connection Refused' Through Three Wrong Theories

Chasing a "Connection Refused" Through Three Wrong Theories

First Wrong Theory: "The Go Server's ESL Handshake Is Broken"

Second Wrong Theory: "Dialplan Misconfiguration"

Third Wrong Theory: "Kernel Backlog Overflow"

The Real Clue — Log Rotation Cadence

Hidden in Plain Sight — The systemd Service File

The Bug — `panic: send on closed channel`

Why Our Dialplan Amplified It

Finding the Bug Upstream

The Fix

Deploying the Workaround

What "Working" Looked Like

Lessons

Tags

Chasing a "Connection Refused" Through Three Wrong Theories

First Wrong Theory: "The Go Server's ESL Handshake Is Broken"

Second Wrong Theory: "Dialplan Misconfiguration"

Third Wrong Theory: "Kernel Backlog Overflow"

The Real Clue — Log Rotation Cadence

Hidden in Plain Sight — The systemd Service File

The Bug — panic: send on closed channel

Why Our Dialplan Amplified It

Finding the Bug Upstream

The Fix

Deploying the Workaround

What "Working" Looked Like

Lessons

Tags

The Bug — `panic: send on closed channel`