We had a horrible server outage recently which messed up MongoDB. We restored a previous version from a backup and it was working fine for a week or two but now I am having issues looking at logs. I get the following when I click on entries in ‘View job Reports’ through deadline Monitor.
Error occurred while writing report log: Could not find a part of the path '\\dsmail\DeadlineRepository6\reports\jobs\f8\4\54dcb4ef807f15080c3a3f84.bz2'. (System.IO.DirectoryNotFoundException)
I get this for every single entry. I tried doing a repair (after stopping the service) "mongodb --repair --dbpath “blah blah” and it seemed to work OK but I still cannot see logs. I am really out of my element with this and am not sure what to do next?
Can I ask if your repository directory was also affected? Can you verify how far up that path things break? I am curious if it’s just looking for a file that doesn’t exist, or if there is a deeper issue. Thanks.
I am not certain. Essentially Deadline was running on a VM. The VM did a hard shut down for reasons I will not go into while Deadline was running.
I assumed it just messed up the Mongo Database because it was in the middle of doing ‘stuff’. is there a way I can verify the integrity of the repository? Also please keep in mind this was 1-2 weeks ago. I don’t know if that is a factor or not.
I am open to any solution. If I can somehow save the data and do a reinstall of Deadline Client I am perfectly happy to do that too. I am not sure how to even diagnose the issue.
You can re-install the client at any time since it pulls its status from the Repository.
The Repo and Database however are different animals. To make sure the Repository is back in shape, just run the installer again and say “connect to existing Repository”. As long as any customized Python scripts are within the ‘custom’ folder, that’s totally safe. If you haven’t done any custom Python programming, you’re also safe.
For the database, you can just stop the DB and copy the files under ‘application/data’. Putting them back later will get you back to where you were before re-installing the database if that’s something you want to try.
I reinstalled the repository and then copied all of our custom plugins back. It worked and I got my logs back for maybe an hour or two. Now it is back to how it was before.
There is a slaves folder. It contains a folder with one salve out of our 40 machines. I take it this should contain all the salves? What should I do to get them to repopulate?
Actually, no, that is actually an indication of a larger issue. We have seen this issue a fair bit since 6.0 released last year. Most commonly this happens when there is a rogue 5.x slave on the farm. What you can do is delete the /slaves folder in your repository, and then wait 5 minutes or so. If you find a new slaves folder, open it and note the name of each directory. Those are the hostnames of the machines that have 5.x installed and connecting to this newer repository. Uninstall 5.x on that machine, and then reinstall your repository, pointing to the existing database, and you will be good to go.
I found the stray machine, got rid of Deadline 5 slave and did a reinstall. The farm has been working perfectly for a few days now, so I am declaring victory!
Thanks so much for your help, it is greatly appreciated!