Web Service worker goes into Stalled status

Donat_Van_Bellinghen · September 17, 2024, 7:05pm

Hi,

We have configured a new installation of deadline 10.3.1.3.
We installed the mongo db, the remote connection server and the web service each on their own virtual machines (Rocky linux 9.4)

The whole installation is running fine, I tested submitting a whole range of jobs and everything works.

The only thing that looks wrong is that, in the deadline monitor, the machine holding the web service goes to ‘stalled’ status after a few hours. The web service is still functional (I use it in scripts to connect to the repo), but that worker appears as stalled. Restarting the deadline10launcher service on that machine fixes the issue, but after x hours, it goes back to ‘stalled’ status

Looking at the logs of the deadline10launcher service on the web service machine I only see these lines repeated over and over :

Launcher Scheduling - GeneralNotice: "1 group(s) were found, but none were considered (0 disabled, the rest had no Workers on this host)"
Counters: http-process-calls-current=0 http-process-calls-count=7 http-process-slow-calls-count=0 http-process-last-slow=none http-process-histo: ApxMean=586.8 P99=4096 BoundsMs=0,1.4,2,2.8,4,5.7,8,11.3,16,22.6,32,45.3,64,90.5,128,181,256,362,512,724.1,1024,1448.2,2048,2896.3 Counts=0,0,0,0,0,0,0,0,0,2,1,1,0,0,1,0,1,0,0,0,0,0,0,1 assign-tasks-calls-current=0 assign-tasks-calls-count=0

What could cause this ‘stalled’ status on the worker holding the web service ?

I’m also wondering if this web service machine should be listed as a worker, as no jobs will ever be executed on it. Would it be better to remove it from the workers, and how should I remove it ?

Regards

anthonygelatka · September 17, 2024, 7:12pm

which version of Mongo installed on 9.4? I think it puts Mongo 6/7/8 as 5 won’t run on 9.4. I found issues with 9.4 where it looked like the DB was failing every so often, which was confusing as it looked like it was all up and running. check the logs or issues there.

Donat_Van_Bellinghen · September 17, 2024, 9:05pm

We installed Mogo 5.0.22 on Rocky 9.4.

Like I said our new deadline install seems to work but is not yet in production. I have submitted a dozen of jobs, successfully.
One issue we recently observed was that the mongodb logs grew a lot (in the 10’s of GB)

anthonygelatka · September 17, 2024, 9:33pm

there’s a few posts about similar issues, i’d check the logs for the DB in case there’s issues there. I ended up using 8.10 and 5.0

Justin_B · September 18, 2024, 3:15pm

MongoDB logs growing really quickly is always a sign of something going wrong - when commands fail or take too long MongoDB will write about it.

Take a look in those logs and let us know what’s in there. And if they align with what’s in this help article.

And to keep the Worker from starting automatically set LaunchSlaveOnStartup to false in the client configuration file.

Donat_Van_Bellinghen · September 18, 2024, 3:23pm

Thanks @Justin_B and @anthonygelatka will check this with my IT department and get the result back

Donat_Van_Bellinghen · September 19, 2024, 8:09am

Hi @Justin_B
I looked into the mongo logs and read the article but I don’t find messages such as those from the article in our mongo logs.
Here’s a small slice of our logs. 148 events in a timespan of 6 seconds.
mongo_log_chunk.zip (7.0 KB)

Justin_B · September 19, 2024, 1:53pm

Interesting - looks like the database is unable to keep up with demand, and is writing a lot of lines about slow queries.

Do you know if this with a Worker trying to render as well? If MongoDB is unable to make use of the CPU it’ll start to take a long time to return queries, which I think is what’s triggering your ‘slow query’ lines.

There are also lines with ‘D1’ in them, which are debug log lines. You may want to drop the verbosity in your config.conf file to save on disk space.

Donat_Van_Bellinghen · September 19, 2024, 3:02pm

FYI, this mongo db (v5.0.22) is used by a new deadline install that is not yet in production.

I have deployed the deadline client on 7 machines : 4 workstations and two VMs (Rocky 9.4), one with the remote connection server, another one with the Web service. Those two VMs are in stalled status (I have not yet disabled the LaunchSlaveOnStartup.

I have launched 15 jobs on this new deadline install.

All this to say that I don’t understand why the database is so busy.
During the timestamps of the log no jobs were rendering at all.

Justin_B · September 23, 2024, 2:01pm

It’s not just jobs that create traffic on the database - it could be that having the debug flags set in the configuration is all that’s happening here.

Let us know when you have a chance to turn that down and how it behaves.