Hi everyone,
Have a rather healthy DL setup with only 2 workers for now. Being working wonderfully for a long time (6+ months).
It’s not the setup for me - done it for the friend’s tiny studio.
Repo is on the cloud server, while 2 workers are physical machines in his office.
Lately noticed the massive slowdown in queue processing. Looking into the numbers I see that tasks takes extremely long time to start. Mainly first frames that go on to the machine. Following frames are fine.
Here you can see - 6 tasks. First two went to the different machines respectively and took long time to start. Following frames went quick.
Looking into the task report - I see that the time sinks in between these steps.
What could it be? There are heavy AUX files (500mb - 1gb) attached to the job and need to be copied from the repo to the workers. This is my first suspicion. I can’t test the speeds yet, probably will get full access to the environment next week.
But was researching meanwhile if someone know what happens between these steps in task processing?
Thanks for any tips:)