Skip to content

[Repo Assist] fix: eliminate unsafe lock upgrade in SessionConnectionPool.Get#2311

Draft
github-actions[bot] wants to merge 1 commit intomainfrom
repo-assist/fix-connection-pool-lock-upgrade-e53fbcbfba4c0b87
Draft

[Repo Assist] fix: eliminate unsafe lock upgrade in SessionConnectionPool.Get#2311
github-actions[bot] wants to merge 1 commit intomainfrom
repo-assist/fix-connection-pool-lock-upgrade-e53fbcbfba4c0b87

Conversation

@github-actions
Copy link
Contributor

🤖 This PR was created by Repo Assist, an automated AI assistant.

Summary

Fixes an unsafe lock upgrade in SessionConnectionPool.Get that could return a resurrected closed connection under concurrent load.


Root Cause

The previous implementation held an RLock, released it, upgraded to a WLock to update metadata, then re-acquired RLock:

p.mu.RLock()
defer p.mu.RUnlock()
// ... nil check and state check ...
p.mu.RUnlock()   // drop read lock
p.mu.Lock()      // grab write lock — window opens here
metadata.LastUsedAt = time.Now()
metadata.RequestCount++
metadata.State = ConnectionStateActive  // ← overwrites cleanup's Closed state!
p.mu.Unlock()
p.mu.RLock()     // re-acquire (matches defer)

In the window between RUnlock and Lock, cleanupIdleConnections (or Delete) could:

  1. Acquire the write lock
  2. Mark metadata.State = ConnectionStateClosed
  3. Remove the key from the map

Then Get would re-acquire the write lock and overwrite State back to ConnectionStateActive, effectively resurrecting a closed connection and returning a stale *mcp.Connection to the caller.


Fix

Acquire a write lock from the start. Since Get always mutates metadata (LastUsedAt, RequestCount, State), a write lock is correct and simpler. The state check runs under the write lock, so no concurrent mutation can occur between the check and the update.

p.mu.Lock()
defer p.mu.Unlock()
// ... check, update, return — all atomic

This also reduces lock overhead from 4 operations (RUnlock + Lock + Unlock + RLock) to 2 (Lock + deferred Unlock).


New Test

TestConnectionPoolGetConcurrentDelete exercises concurrent Get + Delete/Set goroutines. Running with -race would have caught the data race with the previous implementation.


Test Status

Infrastructure note: the Go module proxy is blocked in this agent environment (network firewall). Module cache is empty so go test cannot compile. The fix has been reviewed for correctness; CI will run the full test suite including the new concurrency test.

Code review notes:

  • No behaviour change beyond fixing the race
  • Existing TestConnectionPoolConcurrency (1000 concurrent Gets, verifies RequestCount == 1000) also validates the fix

Generated by Repo Assist ·

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@851905c06e905bf362a9f6cc54f912e3df747d55

Warning

⚠️ Firewall blocked 2 domains

The following domains were blocked by the firewall during workflow execution:

  • proxy.golang.org
  • releaseassets.githubusercontent.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "proxy.golang.org"
    - "releaseassets.githubusercontent.com"

See Network Configuration for more information.

The previous Get() implementation held an RLock, released it, upgraded to
a WLock to update metadata, then downgraded back to RLock. This pattern
had two problems:

1. Between the RUnlock and Lock, cleanupIdleConnections (or Delete) could
   remove the connection and mark metadata.State = ConnectionStateClosed.
   Get then overwrote that back to ConnectionStateActive, effectively
   resurrecting a closed connection and returning a stale *mcp.Connection.

2. The four lock operations (RUnlock + Lock + Unlock + RLock) added
   unnecessary overhead on the critical path.

Fix: acquire a WLock from the start. Get always mutates metadata, so a
write lock is correct and simpler. The state is re-checked under the
write lock, ensuring only live connections are returned.

Also adds TestConnectionPoolGetConcurrentDelete, which exercises the race
window between concurrent Get and Delete goroutines. Running with -race
would have caught the previous data race.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants