Pulse crashes while event uses API connection to update job details to another repository

Hi,

we have some issues with Pulse in our Studios. We are using Deadline in two countries and transfer jobs between those two repositories. We also using a event to update a the render status of a “remote job” (a transferred job, which has an empty job dummy as representation in the local repository).
It was working fine for a while, but a week ago Pulse started crashing (freezing without responding to anything) when our local team submitted a job in our local repository and a “remote job” is running. Pulse has to be restarted if this happens to get our scripts back to work.
We also using the python API for some submissions to the local repository (adjusted to our pipeline needs).
This problem doesn’t seems to appear if the local submissions went via the command line tool than over the python API.

If we disable the remote update event it is running fine again without getting any crashes.

Here a part of the code we are using to update the representation job in the other country:

from System.Diagnostics import *
from System.IO import *
from System import TimeSpan

from Deadline.Events import *
from Deadline.Scripting import *
from FranticX.Utils import *

import api.Deadline.DeadlineConnect as Connect

#[...]

class RemoteUpdateEvent (DeadlineEventListener):

    def __init__( self ):
        self.OnJobFinishedCallback   += self.OnJobFinished
        #[...]

    def Cleanup( self ):
        del self.OnJobFinishedCallback
        #[...]

    def OnJobFinished( self, Job ):
        self.UpdateRepresenterJob(Job, "completed")

    # Not implemented by Deadline
    #def OnTaskFinished( self, Job):
    #    self.UpdateRepresenterJob(Job, "tasks")
            
    def UpdateRepresenterJob( self, Job, operation ):
        repoAddress = Job.GetJobPluginInfoKeyValue( "RepoAddress" )
        repoPort      = Job.GetJobPluginInfoKeyValue( "RepoPort" )
        extJobId      = Job.GetJobPluginInfoKeyValue( "RenderJobID" )
        
        # Stop Event if Job is NOT an remote Job
        for var in [repoAddress, repoPort, extJobId]:
            if not var: return

        repository = Connect.DeadlineCon(repoAddress, repoPort)

        if   operation == "completed":
            repository.Jobs.CompleteJob(extJobId)
    
        #[...]

Any ideas why this happens and how we could fix it ??

Thanks,

Michael

Hey Michael,

Do you have this same Event Plugin present in both repositories? Are the jobs in each repo are ‘pointing’ to each other (ie, a two-way mapping as opposed to one-way)? If so, it might just be that the Pulses are getting in an infinite loop of completing the other repository’s jobs, since calling repository.Jobs.CompleteJob does trigger the onJobFinished event on the remote repo, which would run through this event and try completing the job in the originating repo (which could keep the chain going until either Pulse is forcibly stopped).

I’m not sure that’s the case, though, because Deadline should be checking if the job is already completed before triggering events and stuff. Which version of Deadline are you running (also, is it the same version in both repos)? There might’ve been a regression in regard to that check at some point, which may have caused this bug to surface.

Cheers,
Jon

Hey Jon,

only the actual render job is triggered with events and updates the “dummy” job. Also only the Pulse of the repository with the real render job is crashing. We removed the feature for now, so we could keep working without any more issues.

At the moment we are still running the latest version of Deadline 6, but we will upgrade in the next weeks.
Are there any big changes made in version 7, which influences the behavior of Pulse and its stability?

Cheers,
Michael

Pulse is wildly unstable in 7.0 once job counts get over a few thousand, but some of the issues have supposedly been stabilized in 7.1. We’re waiting to wrap our current show before upgrading, so I don’t know for sure how it will behave.

Hey Nathan, do you know of other companies having stability problems? You folks are doing some interesting things with the database objects, but we have clients using Pulse for pretty regular operations on Linux and aren’t having problems to my knowledge.

We definitely want to pool resources if we can reproduce a problem.

Speaking of reproducing, Michael, do you know if there’s a good way for us to reproduce your problem here locally? Is that submission script portable enough for us to have a copy?

I don’t, but I also don’t know (firsthand) of any other medium-to-large studios running Deadline on Linux, and even if there are, they likely aren’t utilizing Pulse as heavily as we are to submit jobs, perform API operations, etc.

I think I’ve just discovered that the Deadline executables are actually 32 bit on Linux, which is really a shame, and may explain some of the instability on our end.

One client was running something every 15 minutes on hundreds of nodes. I believe their Pulse was running on Linux… Will discuss with them.

Keeping this thread on-topic (sorry for steering it off), reproducing steps would be good.