gazillion python errors

LaszloSebo · October 16, 2013, 11:33pm

So,… very very strange error thats basically making our switch today a bit of a nightmare.

A percentage of machines being moved to deadline are unable to import python libraries till we restart them a couple of times. I doubt this is a deadline error, but the same scripts are being run in our previous render environment, so i suspect it is somehow connected.

The errors look like this:

C:\Users\ScanlineVFX\AppData\Local\Thinkbox\Deadline6\slave\LAPRO1323\plugins\525f108ac5f29c166cd8e365\JobPreLoad.py
Python Error: ImportError : No module named scanline (Python.Runtime.PythonException)

If i open a python session from the command line and try to import that module, it works just fine.

rrussell · October 17, 2013, 12:38am

That’s really strange that it takes a couple of restarts before it works. Do you have to restart the actual machine, or just the slave?

LaszloSebo · October 17, 2013, 12:40am

Actual machine. Usually the first restart fixes it… im dealing with 400+ machines and growing, so hard to keep track. Need to clear the errors on the jobs, resume them etc, so its getting a bit chaotic. It may be that the initial restart fixes them

LaszloSebo · October 17, 2013, 12:57am

It seems that in some cases at least, a restart of the slave application also fixes the issue (so no reboot necessary).

rrussell · October 17, 2013, 1:36am

It could be an environment issue if restarting the slave fixes the issue (although if that’s the case, you will probably want to restart the launcher as well if it’s still running, since the slave will inherit the launcher’s environment). Once the deadline apps have been restarted, they’ll have the current environment, and can then find the scanline module.

LaszloSebo · October 17, 2013, 2:04am

Theoretical question…

Lets say, the slave has been running for 2 days. Then at one point, it cant reach one of the central servers where the python modules are. If later, it tries to import that same module, would it just say “oh i tried yesterday, and it failed”, or would it try to reload it?

If its within the same python session, i wonder if its just caching bad states of modules…

rrussell · October 17, 2013, 1:02pm

Hmm, I would expect that once the module is loaded, it’s in memory, and doesn’t matter if the path becomes inaccessible later on. As a quick test, I put a custom module in a local folder, imported it, changed the folder name, and was still able to use the module (importing it again did nothing, but of course a reload() threw an error).

LaszloSebo · October 17, 2013, 2:39pm

What happens if when you first try to load the module, its inaccessible, but then becomes accessible later? Does it use the previously cached “oh its not there” state, or does it retry?

rrussell · October 17, 2013, 7:29pm

Wow, it looks like it still uses the cached “oh it’s not there” state. I tested this from a python shell, so this is standard python behavior. I’ve confirmed it works this way in Deadline too. Restarting the slave is enough to fix it on my end though.

This is just another reason why we eventually want to sandbox the running of scripts so that every script runs inside a clean environment…

LaszloSebo · October 17, 2013, 8:54pm

vote +1

LaszloSebo · October 20, 2013, 5:52am

Do you guys have an ETA for that btw? It seems like this is a reoccuring problem… if we have any kind of network connectivity issue on the machine, it gets into a state where libraries get ‘corrupted’, and the machine is unable to render even after the network is reestablished.

rrussell · October 21, 2013, 6:08pm

We’re probably going to be looking at this for Deadline 7. We can’t do it during the 6.x cycle because this will require some significant changes for the render plugin and event plugin systems.

LaszloSebo · October 22, 2013, 9:26pm

What workaround do you suggest in the interim? We have 1000+ errors like this per day currently

Should we restart the slave every hour or so?

rrussell · October 23, 2013, 1:21pm

Is the central server that your python modules sit on accessed from a mapped drive? If so, you could probably set up the drive mapping in Deadline to map the drives before every job:
thinkboxsoftware.com/deadlin … ped_Drives

Also, once the module is imported, you shouldn’t hit these errors anymore. So that would imply that the slaves that report these errors are unable to access your python modules when they start up. Do your mapped drives auto-mount on login? Is it possible that the slave is starting before the drives finish mounting?

rrussell · October 23, 2013, 3:44pm

We had another thought here. What if your python scripts simply checked that the module file exists before importing it? Then, if it doesn’t, you skip the import and just throw an error.

LaszloSebo · October 23, 2013, 5:42pm

They are UNC paths, we are not using mapped drives. Our central python site-customize system works on all operating systems / python versions and several host applications, so it would be a bit tricky to special case it for mapped drives.

It would be prohibitively hard to wrap every import, as we have hundreds / thousands of interconnected python scripts calling each other randomly :-\

rrussell · October 23, 2013, 5:51pm

Is it an option to just do this check from within the scripts you are running through Deadline?

Also, just to confirm, these errors shouldn’t be cropping up once the slave has rendered successful jobs right, since those modules are already loaded? Or does a good slave at some point become a bad slave and needs to be restarted?

LaszloSebo · October 23, 2013, 6:02pm

We are importing central libraries, that import other libraries, that import other libraries, its not simple to map out what uses what.

It usually happens when a machine gets rebooted, which tells me that it might not have the server authenticated yet.

It also happens (albeit more rarely) once the machine has already been rendering. I suspect it might be a combination of it having only rendered certain type of jobs, and not having to import some library, that it needs later.

Coulter · October 24, 2013, 4:52pm

This is a rather hackish workaround, but I’ll throw it out there anyway… Rather than Launcher starting automatically, instead have a startup script that repeatedly attempts to scan the server folders where the python modules are stored. Once the script is able to access the folders, it would then start Launcher. This should guarantee that the python modules are accessible before any part of your pipeline executed via Slave tries to access them. The startup script could also have a timeout that sends an alert to an administrator if it cannot reach the server folder after a time period.

Another option would be to synchronize all the python modules locally on the slaves, but that would be a significant departure from your current pipeline.

LaszloSebo · October 24, 2013, 5:16pm

Thanks for the suggestions James!

While it is hackish, it might be the only way of doing this right now. We already have a central ‘machine manager’ mechanism that does regular health checks, so it might be trivial to build this into that.

As for synchronization of the scripts, that was also raised internally before (actually, like a year ago, to reduce server loads), but in general i’m “one of those people” who hates localization. It introduces a nightmare for maintenance and troubleshooting (say, a dll version of a python lib is held onto by a crashed process, and the new version is never synched to the machine… you get random crashes, and you dont know why, etc). But it might come to that, we will see. I hope not