Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker reconnect fail due to exception while attaching #4678

Open
YngveNPettersen opened this issue Mar 25, 2019 · 18 comments · May be fixed by #7198
Open

Worker reconnect fail due to exception while attaching #4678

YngveNPettersen opened this issue Mar 25, 2019 · 18 comments · May be fixed by #7198
Labels

Comments

@YngveNPettersen
Copy link
Contributor

YngveNPettersen commented Mar 25, 2019

We have had a few instances of workers losing the network connection, and not get properly attached when they try to reconnect.

As far as I can tell, the connection is re-established, the master accepts it, but then some kind of exception happens while it is handling the registration of the worker.

Afterwards, the connection seems to be OK from the worker's side (the master does not disconnect, and apparently responds to keep-alives), but from the master's side, the connection is not registered as active, and it is not associated with any builders.

Fixing this situation either requires manually stop/restart of the worker instance on each worker, or a stop/restart of the master instance.

Environment: Buildbot 2.0.1, TLS connection between worker and master.

The exception reported is as follows; there is no information about what the assert is about.

2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] worker 'arbeider' attaching from IPv4Address(type='TCP', host='1.2.3.4', port=51630)
2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got duplication connection from 'arbeider' starting arbitration procedure
2019-03-01 11:24:15+0000 [-] Connected worker 'arbeider' ping timed out after 10 seconds
2019-03-01 11:24:15+0000 [-] Old connection for 'arbeider' was lost, accepting new
2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got workerinfo from 'arbeider'
2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] worker arbeider cannot attach
        Traceback (most recent call last):
          File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
            _inlineCallbacks(None, g, status)
          File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "sandbox/lib/python3.6/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 638, in attached
            log.err(e, "worker %s cannot attach" % (self.name,))
        --- <exception caught here> ---
          File "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 636, in attached
            yield AbstractWorker.attached(self, bot)
        builtins.AssertionError:
@p12tic p12tic added the bug label Apr 25, 2019
@p12tic
Copy link
Member

p12tic commented Apr 25, 2019

I think I have looked into similar issues in latent worker side, but didn't realize that regular workers would be affected by network issues too. The problem here is that the master thinks the worker is still attached and when a new connection is attempted, thinks that a duplicate worker has connected. At least we should drop the old worker after a timeout.

@YngveNPettersen
Copy link
Contributor Author

Possibly related: We have seen a number of cases where jobs on some of our Windows testers are broken because the master detects an unexpected loss of connection to the worker. The problem appears to be that the worker still think it has a connection, and therefore does not reconnect, or as it is supposed to do, reboot the worker (these workers are supposed to run one test, then reboot on a graceful shutdown). We did see similar disconnects in the old 0.8.4 based buildbot, but in those cases the reboot did happen. Occasionally, when they occur during a Git update step, these disconnects can break the git checkout due to a index.lock file remains afterwards (and this may cause worse problems later).

@YngveNPettersen
Copy link
Contributor Author

Just had it happen again, and what I see is that the worker rebooted for some reason, and reported in the log that it had reconnected to the master.

15ish minutes later the master reported loosing the connection in "a non-clean fashion", and quite the job. Replies to keepalives are "Everything is OK".

So, it seems to be a similar type of incident

@YngveNPettersen
Copy link
Contributor Author

This issue continues to happen for us, and in some cases fixing it requires pausing all workers, wait until idle, then restarting the manager process.

Most recent cases were this week, involving several separate incidents typically caused by a short network disconnect between worker and manager.

@YngveNPettersen
Copy link
Contributor Author

I have been having another look at this (this keeps happening fairly frequently), while I am looking at the 2.10 upgrade, and my best guess is that the assert happens in the assert self.conn is None line at the beginning of AbstractWorker.attached().

This suggests to me that the Worker object is reused without properly resetting self.conn.

I see that detached() do reset that variable, but disconnect() doesn't. Should it have reset self.conn?

@YngveNPettersen
Copy link
Contributor Author

For reference, we are still seeing this issue occasionally when the network connection to workers get broken, with the same exception reported in the log. Most recent case this week.

We are now using Buildbot v3.1.1

@YngveNPettersen
Copy link
Contributor Author

We just had another case like this on our (now) 3.3.0 cluster,

This case is odd.

According to the log one of the workers started reconnecting (probably due to a network hickup causing the connection to be lost).

The manager determined that it had lost the connection to the other end, and allowed the new connection, and connected the worker to the associated builders

However: The connection icon for the worker is Red/disconnected, and no email has been sent about the worker connection being lost.

Even further oddness: Despite being marked as "offline" in the GUI, the worker is accepting tasks and successfully completing them.

This seems to indicate that the internal connected state for the worker and the GUI state for the worker is not in sync for some reason. (Reloading the the GUI pages does not fix the state)

2021-09-12 05:30:20+0000 [Broker (TLSMemoryBIOProtocol),68,1.2.3.4] worker 'foo' attaching from IPv4Address(type='TCP', host='1.2.3.4', port=64838)
2021-09-12 05:30:20+0000 [Broker (TLSMemoryBIOProtocol),68,1.2.3.4] Got duplication connection from 'foo' starting arbitration procedure
2021-09-12 05:30:20+0000 [Broker (TLSMemoryBIOProtocol),22,1.2.3.4] Got error while trying to ping connected worker foo:[
Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side
was lost in a non-clean fashion: Connection lost.
        ]
2021-09-12 05:30:20+0000 [Broker (TLSMemoryBIOProtocol),22,1.2.3.4] Old connection for 'foo' was lost, accepting new
2021-09-12 05:30:20+0000 [Broker (TLSMemoryBIOProtocol),22,1.2.3.4] Worker.detached(foo)
2021-09-12 05:30:20+0000 [Broker (TLSMemoryBIOProtocol),22,1.2.3.4] releaseLocks(<Worker 'foo'>): []
2021-09-12 05:30:20+0000 [Broker (TLSMemoryBIOProtocol),68,1.2.3.4] Got workerinfo from 'foo'
2021-09-12 05:30:20+0000 [-] Worker foo detached from Bar1
2021-09-12 05:30:20+0000 [-] Worker foo detached from Bar2
2021-09-12 05:30:20+0000 [-] Worker foo detached from Bar3
2021-09-12 05:30:20+0000 [-] bot attached
2021-09-12 05:30:21+0000 [Broker (TLSMemoryBIOProtocol),68,1.2.3.4] Worker foo attached to Bar1
2021-09-12 05:30:21+0000 [Broker (TLSMemoryBIOProtocol),68,1.2.3.4] Worker foo attached to Bar2
2021-09-12 05:30:21+0000 [Broker (TLSMemoryBIOProtocol),68,1.2.3.4] Worker foo attached to Bar3

@YngveNPettersen
Copy link
Contributor Author

I just started looking at an update to Buildbot 3.6, and had a look at the code suspected of triggering the assert that is causing this problem.

I see that the attached() assert in master/buildbot/worker/base.py has been changed to provide more information with commit e37a01c , and afterwards I realized there are a couple of questions that should be asked:

  • What if self.conn == conn ?
  • If they are different, what happens to conn after the assert? Is the connection released, or left dangling?
  • Is self.conn the newer connection, or is conn? If possible, the assert message should indicate the connection times.

I have no idea which of these are the best description of what happens, or if there is a better one.

One possible theory, if the conns are different, and self.conn is older, is that there is a race condition between removing the old attachments, and adding the new ones. This might be handled with some kind of locking or using a task that then fires of the reattach task.

The other part of the problem, that the connection stays alive despite not being attached to any builder, might be due to missing cleanup of during the exception handling. One possible way to handle/work around this would be to regularly verify that each worker connection is attached to all builders that it is assigned to, e.g. during the ping.

@YngveNPettersen
Copy link
Contributor Author

YngveNPettersen commented May 14, 2023

We have just had yet another incident which left a worker in the "quantum superposition" of being connected, yet disconnected, after an ISP network failure between the manager and some of our workers.

To resolve the situation I will have to stop the manager and start it again, which means I will have to pause all workers and wait several hours until the current tasks have completed.

What I see from the logs below is that the manager

  • Detects a duplicated connection
  • Starts a ping test of the old connection
  • The ping test fails and the new connection is accepted
  • However, when the attached call is made, the call is refused, throwing an exception, because the old connection "is still active and connected".

That suggests to me that the ping test action fails to disconnect the old connection and detach it from the builders it was attached to, or at least not mark it as no longer active.

The log from the manager:

2023-05-14 18:47:27+0000 [Broker (TLSMemoryBIOProtocol),41,1.2.3.4] worker 'arbeider' attaching from IPv4Address(type='TCP', host='1.2.3.4', port=51144)
2023-05-14 18:47:27+0000 [Broker (TLSMemoryBIOProtocol),41,1.2.3.4] Got duplication connection from 'mac-m1-builder' starting arbitration procedure
2023-05-14 18:47:37+0000 [-] Connected worker 'arbeider' ping timed out after 10 seconds
2023-05-14 18:47:37+0000 [-] Old connection for 'arbeider' was lost, accepting new
2023-05-14 18:47:37+0000 [Broker (TLSMemoryBIOProtocol),41,1.2.3.4] Got workerinfo from 'arbeider'
2023-05-14 18:47:37+0000 [Broker (TLSMemoryBIOProtocol),41,1.2.3.4] worker arbeider cannot attach
        Traceback (most recent call last):
          File "site-packages/twisted/internet/defer.py", line 1856, in _cancellableInlineCallbacks
            _inlineCallbacks(None, gen, status, _copy_context())
          File "site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
            result = context.run(
          File "site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "site-packages/buildbot/worker/base.py", line 722, in attached
            log.err(e, f"worker {self.name} cannot attach")
        --- <exception caught here> ---
          File "site-packages/buildbot/worker/base.py", line 720, in attached
            yield super().attached(conn)
        builtins.AssertionError: arbeider: 1.2.3.4:51144 connecting, but we are already connected to: 1.2.3.4:51110

2023-05-14 19:03:52+0000 [Broker (TLSMemoryBIOProtocol),23,1.2.3.4] Worker.detached(arbeider)
2023-05-14 19:03:52+0000 [Broker (TLSMemoryBIOProtocol),23,1.2.3.4] releaseLocks(<Worker 'arbeider'>): []
2023-05-14 19:03:52+0000 [-] Worker arbeider detached from Builder

The log from the worker ("arbeider"); as can be seen, it took several minutes before the worker was able to reconnect to the manager:

2023-05-14 18:43:03+0000 [-] sending app-level keepalive
2023-05-14 18:43:32+0000 [Broker (TLSMemoryBIOProtocol),client] connection already shut down when attempting keepalive
2023-05-14 18:43:32+0000 [-] Scheduling retry 1 to connect <twisted.internet.endpoints._WrapperEndpoint object at 0x104fcda00> in 2.384103564723184 s
econds.
2023-05-14 18:43:32+0000 [-] Stopping factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:43:35+0000 [-] Starting factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:44:05+0000 [-] Scheduling retry 2 to connect <twisted.internet.endpoints._WrapperEndpoint object at 0x104fcda00> in 2.4599213354228233
seconds.
2023-05-14 18:44:05+0000 [-] Stopping factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:44:08+0000 [-] Starting factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:44:38+0000 [-] Scheduling retry 3 to connect <twisted.internet.endpoints._WrapperEndpoint object at 0x104fcda00> in 3.584637161707642 s
econds.
2023-05-14 18:44:38+0000 [-] Stopping factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:44:41+0000 [-] Starting factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:45:11+0000 [-] Scheduling retry 4 to connect <twisted.internet.endpoints._WrapperEndpoint object at 0x104fcda00> in 5.987102588983977 seconds.
2023-05-14 18:45:11+0000 [-] Stopping factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:45:18+0000 [-] Starting factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:45:48+0000 [-] Scheduling retry 5 to connect <twisted.internet.endpoints._WrapperEndpoint object at 0x104fcda00> in 8.549573063096226 seconds.
2023-05-14 18:45:48+0000 [-] Stopping factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:45:56+0000 [-] Starting factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:46:26+0000 [-] Scheduling retry 6 to connect <twisted.internet.endpoints._WrapperEndpoint object at 0x104fcda00> in 11.657774117225621 seconds.
2023-05-14 18:46:26+0000 [-] Stopping factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:46:38+0000 [-] Starting factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:47:08+0000 [-] Scheduling retry 7 to connect <twisted.internet.endpoints._WrapperEndpoint object at 0x104fcda00> in 18.08507222735015 seconds.
2023-05-14 18:47:08+0000 [-] Stopping factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:47:27+0000 [-] Starting factory <buildbot_worker.pb.BotFactory object at 0x104fad9d0>
2023-05-14 18:47:37+0000 [Broker (TLSMemoryBIOProtocol),client] message from master: attached
2023-05-14 18:47:37+0000 [Broker (TLSMemoryBIOProtocol),client] Connected to buildmaster; worker is ready
2023-05-14 18:47:37+0000 [Broker (TLSMemoryBIOProtocol),client] sending application-level keepalives every 600 seconds
2023-05-14 18:57:37+0000 [-] sending app-level keepalive
2023-05-14 18:57:37+0000 [Broker (TLSMemoryBIOProtocol),client] Master replied to keepalive, everything's fine
2023-05-14 19:07:37+0000 [-] sending app-level keepalive
2023-05-14 19:07:37+0000 [Broker (TLSMemoryBIOProtocol),client] Master replied to keepalive, everything's fine
2023-05-14 19:17:37+0000 [-] sending app-level keepalive
2023-05-14 19:17:37+0000 [Broker (TLSMemoryBIOProtocol),client] Master replied to keepalive, everything's fine

@YngveNPettersen
Copy link
Contributor Author

Investigating a bit, I suspect that WorkerManager.newConnection's call to old_conn.loseConnection() does not result in a protocols.base.Connection.notifyDisconnected() call, at least before the call to attached is made.

I am guessing that there are further references to old_conn elsewhere that have not been removed until the network error of lost socket arrives.

Further, I guess that the attach failure does not result in the WorkerManager closing and removing the new connection.

@YngveNPettersen
Copy link
Contributor Author

I did a test with forcing notifyDisconnected() from a hot/monkey patched WorkerManager.newConnection, which may have worked, but did trigger exceptions in yield conn.waitShutdown() because conn is None. (The actual test was a bit of a hack: I set up a worker with two processes in parallel, and overrode the Runtime error to act like the timeout, causing the two processes to repeatedly kick each other out).

My guess is, as above, that loseConnection does not properly detach the connection from the builders it is attached to, and also does not properly shut down the connection.

I also noticed that the remove callback in WorkerManager.newConnection fails to make two checks: 1) that the worker has an active connection, and 2) that the connection being removed is the same as the call is being made for. It may be that the invariant holds under normal circumstances (rather then the hacky one I used), but it might be an idea to verify just the same.

BTW, it is now over 4 years since I first reported this problem.

@cmouse
Copy link
Contributor

cmouse commented Oct 30, 2023

I am having this issue with 3.9.2 as well.

@cmouse
Copy link
Contributor

cmouse commented Oct 30, 2023

@p12tic @tardyp any chance this issue could be looked into? Anything you further we could provide to assist in this?

cmouse added a commit to cmouse/buildbot that referenced this issue Nov 2, 2023
@YngveNPettersen
Copy link
Contributor Author

I have been trying to make a test case to reproduce this issue, but the standard null protocol workers did not reproduce, and I was unable to create a working PB based test that "sever" the connection (SeverWorkerConnectionMixin) then immediately reconnects, while the null version always passed.

I finally was able to reproduce by asserting in AbstractWorker.detached which removes the connection and tracing back to what triggered the call to that triggered it, and that is notifyDisconnected().

notifyDisconnected() is called by loseConnection in null, but not PB connections, thus null connections will be fully removed by WorkerManager.newConnection(), but PB connections will not, leading to the old one still being associated with the worker when the new connection is being attached, thus triggering the "but we are already connected to" assert.

My guess is that pb.Connection.loseConnection will need to add a call to notifyDisconnected(). I am not sure where such a call should be added, my guess is somewhere after the transport loseConnection call.

I am presently planning to deploy the the Hotpatch I created earlier this year which adds notifyDisconnected() in newConnection, but I was never quite sure if it was safe to deploy. Now, seeing that the null protocol implements that solution I think it is safe, although I don't think that function is the one that should have the fix; it will just be a workaround that implement what appears to be necessary.

It might also be an idea to look into having a backup check that verifies that connected workers are actually attached to at least one builder, and close connections that aren't attached, e.g. during the keepalive test.

A separate backup step might be to have a action in the webgui for each worker that disconnects all registed connections for it. This could be one of the ones that might be added by my suggestion in #5833

BTW, I recently observed this issue with a worker that was located in the same rack as the manager. The worker had an unstable connection, apparently caused by a bad network cable, and in one specific case it was connected but "offline"

@YngveNPettersen
Copy link
Contributor Author

Also, I suspect that exceptions like the "but we are already connected to"-exception, should be updated to properly cleaning up and closing the connection. Part of the problem in this report is due to, when the newer connection was detected as conflicting with an existing connection, causing the exception to be raised, that the newer connection was not closed, leaving it hanging, and kept alive by the connection manager, preventing the worker from trying again.

YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Mar 2, 2024
…from a

worker when there is already a known (but actually disconnected) connection
to the same worker.

Additionally, match pb.loseConnection with null.loseConnection() by calling self.notifyDisconnected() to make sure the connection is shut down properly.

This function is called both places to make sure the old connection is actually shut down.

This would otherwise cause a case of worker "knowing" it was "connected", while
the manager equally "know" that the worker was "offline" because it was not
attached to any builders, all the while the low-level connection maintenance
functionality kept sending and receiving keep-alives, thus keeping the unattached
connection open. (buildbot#4678)
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Mar 4, 2024
… with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Mar 4, 2024
…ociated with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Mar 4, 2024
…ociated with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Mar 4, 2024
…ociated with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Mar 4, 2024
…ociated with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Mar 4, 2024
…ociated with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Apr 6, 2024
…ociated with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Apr 6, 2024
…ociated with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
@YngveNPettersen
Copy link
Contributor Author

I have been doing more thinking about possible causes after the HotPatch didn't actually work as intended, and recently reviewing the log entries of one of the events.

I think the original case (without HotPatch) was probably like this:

  • Connection from worker to manager gets broken.
  • Worker quickly detects the breakage, and opens a new connection
  • The newConnection code finds that there is already an open connection to the worker, and pings the old connection
  • When the ping times out, it starts the process of starting to use the new connection, and starts attaching Builders
  • However, the old connection is still attached to the current Builders, so the operation fails, and the function gets terminated early by an Exception
  • After a while the old connection gets closed by Twisted, Builders detached, when it gets around to noticing the connection is broken
  • All the while the new connection did not get closed when newConnection failed, keepalives continues, so the worker thinks everything is OK, while the manager knows there is no worker connected.

My HotPatch (and the PR) solves the detachment problem by calling notifyDisconnect() for the old connection. This causes the connection to no longer be attached to the builders.

Unfortunately, this did not work properly. What happened after the Builders were attached to the new connection is as follows:

  • Some minutes later, perhaps 10+, the Builders were detached, and the manager again registered a missing worker, while as far as the Worker was concerned, there was a live connection to the manager.

What I now suspect happened is:

  • When Twisted noticed the old broken connection, it triggered the disconnected logic
  • This triggered a call to notifyDisconnect() again
  • and that caused the detachment from the Builders, triggered by the old connection, not the disconnection of the newer connection

The reason for this happening is that notifyDisconnect isn't a fire once callback subscription, but a multiple callback subscription.

Add in that the callbacks does not verify that the calling connection is still the active connection (some of the callbacks does not even have that information), and they can trample roughshod over the newer connection's attachment and other state.

The minimal way to solve this issue is to delete the subscriptions in the connection after they have been delivered, and replace the old subscription point with a new one. My more recent updates of the PR does this, and also adds a testcase for the fire-twice case (which fails if the notifyDisconnect update isn't there).

I have not yet tested this in my HotPatch, I am waiting for the next incident to prove that the disconnect function was called twice.

There might be further changes that could strengthen the protection against this issue:

  • The keepalive system could verify that the connection is still attached to all Builders it should be connected to, and close it if necessary.
  • There could be a separate action for the worker in the GUI to forcibly disconnect the worker connection.

A couple of things I am not sure about, is how sensitive Twisted is to connections being closed due to network errors, and whether there are issues with forcibly closing the socket maintained by Twisted.

@YngveNPettersen
Copy link
Contributor Author

Hmmm, just a related possibility I just started wondering about: Could the apparent problem with Twisted not detecting (or handling/acting on) connection errors in a timely fashion be related to what I observed in my comment #5832 (comment) ?

YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Apr 13, 2024
…from a

worker when there is already a known (but actually disconnected) connection
to the same worker.

This would otherwise cause a case of worker "knowing" it was "connected", while
the manager equally "know" that the worker was "offline" because it was not
attached to any builders, all the while the low-level connection maintenance
functionality kept sending and receiving keep-alives, thus keeping the unattached
connection open. (buildbot#4678)
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Apr 13, 2024
… with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
YngveNPettersen added a commit to YngveNPettersen/buildbot that referenced this issue Apr 13, 2024
once, by replacing the current subscription object with a new one
after delivering the notifications. A second call could mess up the
newer connection's state. (buildbot#4678)
@YngveNPettersen
Copy link
Contributor Author

We just had to do some network maintenance that triggered the online-but-offline scenario, and the log message for a duplicate call got triggered for three workers, so I think the double-call scenario is proven. So next step will be to hotpatch the conn object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants