Slaves that won't delete

mplec · May 12, 2018, 12:30am

Hi -
I’m changing a few things in how we manage our jobs and deleting some additional slaves I created on some machines. They disappear for the moment, but come back the next time the launcher starts on the machine. I found the previous discussion about this (forums.thinkboxsoftware.com/vie … ave#p70626) and after quitting out of everything deadline-related on the machine I deleted the slave .ini file but that didn’t solve it – as soon as I re-start the launcher, both slaves come up.

I opened the monitor console while doing the delete and I see this:

2018-05-11 17:00:18: An unexpected error occurred while Remove Remote Slave(s):
2018-05-11 17:00:18: An error occurred while trying to Query the Database (10.50.4.13:27100). It is possible that Deadline failed to Authenticate properly. Please check that the configured MongoDB credentials are correct.
2018-05-11 17:00:18:
2018-05-11 17:00:18: Full error: Command ‘insert’ failed: no write ops were included in the batch (response: { “ok” : 0, “code” : 16, “errmsg” : “no write ops were included in the batch” }) (FranticX.Database.DatabaseConnectionException)
2018-05-11 17:00:18: at h.t(MongoServer aco, Exception acp)
2018-05-11 17:00:18: at h.s(DeadlineMongoDB acm, Exception acn)
2018-05-11 17:00:18: at Deadline.StorageDB.MongoDB.MongoSlaveStorage.DeleteSlaves(String[] slaveNames)
2018-05-11 17:00:18: at Deadline.Monitor.WorkItems.RemoteRemoveSlaveWI.InternalDoWork()
2018-05-11 17:00:18: at Deadline.Monitor.MonitorWorkItem.DoWork()
2018-05-11 17:00:18: ---------- Inner Stack Trace (MongoDB.Driver.MongoCommandException) ----------
2018-05-11 17:00:18: at MongoDB.Driver.Operations.CommandOperation1.Execute(MongoConnection connection) 2018-05-11 17:00:18: at MongoDB.Driver.Operations.BulkUnmixedWriteOperationBase.ExecuteBatch(MongoConnection connection, Batch1 batch, Int32 originalIndex)
2018-05-11 17:00:18: at MongoDB.Driver.Operations.BulkUnmixedWriteOperationBase.Execute(MongoConnection connection)
2018-05-11 17:00:18: at MongoDB.Driver.Operations.InsertOpcodeOperationEmulator.Execute(MongoConnection connection)
2018-05-11 17:00:18: at MongoDB.Driver.MongoCollection.InsertBatch(Type nominalType, IEnumerable documents, MongoInsertOptions options)
2018-05-11 17:00:18: at Deadline.StorageDB.MongoDB.MongoSlaveStorage.DeleteSlaves(String[] slaveNames)

Same on two different machines (one being the machine that the slave that won’t delete is on). Both have no trouble submitting jobs or any other apparent errors using the monitor, so I don’t think it’s the mongo credentials. I ran the Help > View Database Statistics and that seems to work fine, too. I’m happy to check it but honestly I can’t remember where to do that. Any advice?

It’s happening on 5 of the 7 machines I tried to delete slaves from. It feels like a bug, but anything else I can check or try or delete somewhere to work around this?

Thanks,
Matt

mirkoj · May 12, 2018, 1:15pm

Open up folder

C:\ProgramData\Thinkbox\Deadline10\slaves

On slave machine.
You can delete slave or slaves from there.
That will kill them for sure!!
Not sure where is that folder on linux if you are using that or mac, this is windows ofc

mplec · May 14, 2018, 5:45pm

That did the trick, thanks very much!

eamsler · May 15, 2018, 3:07pm

That database stack trace is new to me. Which version of Deadline are you on? The error from the database definitely seems like a bug, so I really should investigate with the dev team on this one.

Also, have you tried deleting other Slaves as a test? I’m wondering if there is some database corruption, but there’s nothing in that error message that points to that. Just trying to figure out how I can reproduce on my end.

mplec · May 15, 2018, 5:47pm

Deadline version is 10.0.14.1.

I was able to delete some of the slaves I was trying to clear out without any problem, but most of them gave me this error.

I don’t know if anything about the slaves gets written into the repo file system, but in case it matters, I did have problems with some file permissions after the last update. I ran the update on a Windows machine direct-attached to our Stornext SAN where the Deadline repo lives. That file system is set up to inherit open permissions already but I think I had the “open up permissions” option enabled on the installer and it somehow made things owned by an unidentified user such that even with admin permission I couldn’t access report log folders on jobs run after the upgrade. I got everything set back to a known user and inheriting and haven’t seen any permissions problems on the windows machines since, but maybe I missed something. The slaves have all existed for months, none were created after the upgrade.

I also tried the delete from a Mac laptop with the repo volume mounted over SMB via a samba server on a linux machine and got the same error. I assume that’s using the posix permissions, but if there’s anything in the repo this relies on let me know and I can confirm that it’s accessible.

I can still repro on another slave so if there’s anything I can enable/add to get more debugging info, or anything I can dump from the database that might be useful to you, let me know and I’ll send it over. In fact, I could probably Team Viewer someone in, too, if you use that.

Thanks,
Matt