I currently have an issue where deadline worker panel isn’t showing all our workers. Yesterday we added 68 new machine. Before we added them, we had 474 that were all showing up in deadline. Today the worker panel is only showing 200, and they are the workers that don’t have any pools/groups. The workers that do have pools and groups assigned are still working but I can’t see them in that panel. I can see those machines from a jobs task in the task panel. I can also see all the workers from Tools>Manage Pools/Groups and they both display all 542 machines.
This has happened before but the fix that has been previously used was to restart the deadline launcher on the deadline server to restart the services. The command we run sudo systemctl restart deadline10launcher
We’ve tried this 4 times, normally when we do this it fixes the issue but today we only got an increase for 40 something workers and if we then restart deadline monitor we drop back down to 200 machines.
We are currently running on
Deadline Client Version: 10.2.0.10 Release (3b87216c7)
Repository Version: 10.2.010 (3b87216c7)
If not, this is usually caused by something in the Worker’s database representation being missing, so you only get the first page of Worker results, hence only seeing 200.
We’re not entirely sure what the cause might be here, and it seems that there can be multiple causes. In the past we believed it was that there are missing entries in the SlaveSettingsCollection for the corresponding SlaveInfo Collection entries. You can compare them, but another test would be to run the following script which just loads and saves sanitized data:
# File is named update_workers.py
from Deadline.Scripting import RepositoryUtils
from datetime import datetime
import time
def __main__(**kargs):
workers = RepositoryUtils.GetSlaveSettingsList(True)
worker_count = len(workers)
print("Found " + worker_count + " workers!")
for worker in workers:
print("Update Worker " + worker.SlaveName + " at " + datetime.now())
RepositoryUtils.SaveSlaveSettings(worker)
This can be run connected to an RCS or directly to the DB. Use this command:
It is indeed not the same “ArgumentException” as before. No error message are being reported this time. We’ve looked in all the logs, nothing.
We didn’t run the suggested update_workers.py code, but I’ve been monitoring many times the content of the SlaveInfo and SlaveSettings collections.
The document count is vastly different. We have way more SlaveInfo than SlaveSettings. I haven’t yet checked if all the “SlaveSettings” entries exist in “SlaveInfo”. If we get into this issue again, I’ll take a look at that.
The mismatch in SlaveInfo and SlaveSettings would be contributing to this, those should be 1:1 with each other. The SlaveInfo holding the state of the Worker and the machine (Time spent rendering, CPU usage, hostname, HDD space, etc.) and SlaveSettings describes settings associated with the Worker (pools, groups, limit, etc.).
Interesting! I wonder if hitting the collections with the database did the equivalent of that script I’d mentioned above. That’s not been our experience, but results are results.
Local Workers can use either RCS or direct connections, they should be equivalent. Barring issues like you’re seeing now of course.
We haven’t ever figured out where this bad data is coming in from in the other cases. But we do think it’s got something to do with using the standalone API or the web service to update the Worker’s settings, is that something you folks do?