Job has been sent to DL10 to render with AWS instances - I connect to the slave log and see CheckPathMapping: Swapped “\xxxxxxxx” which is usually fine, but it never stops remapping.
I can see the same files being remapped and cycling round.
I’ve given it at least 20 mins to finish, but doesn’t seem to want to stop. All the slaves are doing the same.
That spin used to happen when the Slave was waiting for files to copy over. I believe we should have fixed that.
We do have an outstanding problem where the database can think it has a file when it really doesn’t (this can happen when files are deleted unexpectedly). The workaround at the moment is to stop the infrastructure and restart it as that will destroy the Gateway where the asset cache database lives. Currently that will mean you have to re-sync your assets into S3.
Restarting the gateway in aws seems to have made it sort itself out.
I had to delete and recreate all the awsportal users and policies just to check everything was working ok, but now when I submit a new job I kept getting these errors about the bucket not existing anymore (I have created a new bucket, not actually deleted any)
Any idea what it’s trying to do? This was a fresh submit job to Deadline.
Re-creating the users seems a bit of overkill, but that reboot of the Gateway is interesting… It means we have an in-memory issue instead of an on-disk one.
The errors about things not existing is not in fact the bucket but the “region” in the path mapping code. We use regions to conceptualize groupings of machines that are distant from each other and they usually have their own file servers so need custom drive and path mapping.
Modifying the asset server settings should reset things, but then becoming out of sync is unexpected on my side. I’ll have to ask those smarter than myself!
I found some errors in the cloudwatch logs that said the file uploads were getting a permission denied error (accesssID didn’t exist anymore, which it did), why is why I recreated everything.
Attached are some logs from the weekend render.
I did notice that if the slaves went offline or rebooted, they wouldn’t be able to find the maxfile to render from anymore, so I had to chunk out the submission to stop any errors overnight.
Now that the render is out of the way, I’m going to reinstall 10.0.20.2, create a new S3 bucket and see if it gives me the same errors.