Sunday night our server filled up and for some reason this crashed all of the deadline clients. Now that we have free’d up space the nodes including pulse are still crashing. I’m not exactly sure how long they run before they crash… It looks like it will take one task then dump. The repository lives on a separate server so there were no space issues there.
I tried with the pulse server deleting it out of the repository and letting it recreate itself, but the same issue still arises. I have not tried deleting all the nodes from the repository yet.
I’m running 5.1.0.46398 on win7 with a linux Repository…
Can you send us a Pulse log after it crashes? You can find the logs locally on the Pulse machine at %programdata%\thinkbox\deadline\logs. We can take a look to see if there is anything that explains the crash.
Our Pulse is running on an XP machine. I looked in C:\Documents and Settings\render\Local Settings\Application Data\Thinkbox\Deadline but I don’t see any logs. But I did find logs in C:\Documents and Settings\render\Local Settings\Temp.
Thanks for the logs! It looks like Pulse crashes when trying to purge the temp directory of the repository. Can you go to \your\repository\temp and manually delete everything in there? Also, maybe do the same for \your\repository\trash. After doing that, launch Pulse again and see if things improve.
So the problem came back. I checked the Temp folder in the Repository and there were 86,740 files. I’m deleting them all now hoping that it resolves the issue again. Does it sound like something is not cleaning up properly?
Hmm, I wonder if something is preventing that node from deleting those files. The slave is supposed to delete that file right after creating it. If you open that folder from the node, can you manually delete the files?
We’re also experiencing this issue, Pulse will come on for anywhere between 10 seconds and 20 mins and then bomb out. Also some of the slaves have been completely crashing out and have to be manually restarted (they leave a windows debug window up that seems to prevent control of the slave from deadline). The one slave that left the window open showed the last action as deleting objects from the temp folder.
I’ve just deleted 50,000 timeCheck files from the temp folder that have all come from the slave on my machine, I have full admin permissions so it seems strange that it wouldn’t be able to delete these files…?
Out of curiosity, which OS are you running on the Repository, Pulse, and on your machine? Assuming you’re on Windows, are you running the Slave on your machine as a Service (and/or as a different user)?
Also, do you remember what the time spread on the files were (ie, were the 50k generated over a week or two, or was it more like several months/years)? I know you deleted them all, but if we knew they were all generated in quick succession from the same Slave, it might help us pin this down.
Repository is on Windows 2008 Storage Server,
Pulse is on Windows 2003 Server (SP 2)
Slave is on Windows 7 not as a service logged in as me with full admin rights.
The time spread was over at the very least a few days but I couldn’t tell you how long, I’ve always had weird issues with my machine being picked up as stalled even though I can see it’s still rendering.
Pulse has now been up for over 4 hrs now so I’m hoping this was the problem.
Interesting. If this problem comes up again, and there’s again a flood of ‘timeCheck’ files in the temp folder, I’d recommend trying to move Pulse to a Windows 2008 server machine or later (your current Repo machine would work). I’ve been poking around, and apparently the .NET function we use to delete files behaves differently on older Windows OSes (they specifically named XP, though I imagine it might apply for server 2k3 also), and that might cause issues with Pulse. I’m not totally sure that’s the issue, but it’s enough to warrant a shot if this comes up again.