Not all workers are showing up in the worker panel

Ryan · March 9, 2024, 1:32am

Hey,

I currently have an issue where deadline worker panel isn’t showing all our workers. Yesterday we added 68 new machine. Before we added them, we had 474 that were all showing up in deadline. Today the worker panel is only showing 200, and they are the workers that don’t have any pools/groups. The workers that do have pools and groups assigned are still working but I can’t see them in that panel. I can see those machines from a jobs task in the task panel. I can also see all the workers from Tools>Manage Pools/Groups and they both display all 542 machines.

This has happened before but the fix that has been previously used was to restart the deadline launcher on the deadline server to restart the services. The command we run
sudo systemctl restart deadline10launcher

We’ve tried this 4 times, normally when we do this it fixes the issue but today we only got an increase for 40 something workers and if we then restart deadline monitor we drop back down to 200 machines.

We are currently running on
Deadline Client Version: 10.2.0.10 Release (3b87216c7)
Repository Version: 10.2.010 (3b87216c7)

Thanks,
Ryan

Jonathan_Bouchard · March 11, 2024, 4:37pm

I’m working with Ryan on this issue and figured out that using a direct connection (instead of using the RCS) resolves it.

Something is off with our RCS. I still don’t know what it might be.

Jonathan_Bouchard · March 11, 2024, 4:58pm

After accessing the repository directly once, RCS started to behave and report all our workers again. It must be that a direct access fixed some data…

What is the recommended setup for local workers & monitors, RCS or direct?

Justin_B · March 11, 2024, 6:01pm

It’s not ArgumentException: The given SlaveInfo and SlaveSettings objects have an empty SlaveName property happening again is it?

If not, this is usually caused by something in the Worker’s database representation being missing, so you only get the first page of Worker results, hence only seeing 200.

We’re not entirely sure what the cause might be here, and it seems that there can be multiple causes. In the past we believed it was that there are missing entries in the SlaveSettingsCollection for the corresponding SlaveInfo Collection entries. You can compare them, but another test would be to run the following script which just loads and saves sanitized data:

# File is named update_workers.py
from Deadline.Scripting import RepositoryUtils
from datetime import datetime
import time

def __main__(**kargs):
    workers = RepositoryUtils.GetSlaveSettingsList(True)
    worker_count = len(workers)

    print("Found " + worker_count + " workers!")

    for worker in workers:
        print("Update Worker " + worker.SlaveName + " at " + datetime.now())
        RepositoryUtils.SaveSlaveSettings(worker)

This can be run connected to an RCS or directly to the DB. Use this command:

./deadlinecommand executescriptnogui ~/Downloads/update_workers.py

If that doesn’t seem to work, you can try to follow the older steps:

Get both the collections from the database but running below:

./mongo --port=27100 deadline10db --quiet --eval "cursor = db.SlaveInfo.find(); printjson(cursor.toArray())" > WorkerInfo.txt 
./mongo --port=27100 deadline10db --quiet --eval "cursor = db.SlaveSettings.find(); printjson(cursor.toArray())" > WorkerSettings.txt

Where the core parts are the following:

db.SlaveInfo.find().pretty()
db.SlaveSettings.find().pretty()

Then compare “Name” across both collections.

Open notepad++ and look for the “Name” (with quotes) and hit “Find All in Current Document”
Then copy the result to an Excel sheet.
Then run a vlookup on Slave Info Name entries to compare it with Slave Setting Name entries do you see NAs? these are the missing entries.

Jonathan_Bouchard · March 11, 2024, 9:25pm

It is indeed not the same “ArgumentException” as before. No error message are being reported this time. We’ve looked in all the logs, nothing.

We didn’t run the suggested update_workers.py code, but I’ve been monitoring many times the content of the SlaveInfo and SlaveSettings collections.

The document count is vastly different. We have way more SlaveInfo than SlaveSettings. I haven’t yet checked if all the “SlaveSettings” entries exist in “SlaveInfo”. If we get into this issue again, I’ll take a look at that.

Justin_B · March 12, 2024, 3:55pm

The mismatch in SlaveInfo and SlaveSettings would be contributing to this, those should be 1:1 with each other. The SlaveInfo holding the state of the Worker and the machine (Time spent rendering, CPU usage, hostname, HDD space, etc.) and SlaveSettings describes settings associated with the Worker (pools, groups, limit, etc.).

Interesting! I wonder if hitting the collections with the database did the equivalent of that script I’d mentioned above. That’s not been our experience, but results are results.

Local Workers can use either RCS or direct connections, they should be equivalent. Barring issues like you’re seeing now of course.

We haven’t ever figured out where this bad data is coming in from in the other cases. But we do think it’s got something to do with using the standalone API or the web service to update the Worker’s settings, is that something you folks do?