[Linux] Job ignored, regardless of pool/group settings

nrusch · October 15, 2012, 10:02pm

I know there is an ongoing discussion of the “job not being picked up” issue in another thread, but I figured it would be cleaner for me to start my own.

I’m seeing the same issue where a job is in the queue, but is ignored by the available slaves.

Initially, I submitted the job to the ‘none’ group and pool. The slave I was testing with had no groups/pools assigned, and it ignored the job. The JSON dump of the job at that point is:

{ "Arch" : false, "Aux" : [ ], "Bad" : [ ], "CompletedChunks" : 0, "Date" : ISODate("2012-10-15T21:27:08.800Z"), "DateComp" : ISODate("0001-01-01T00:00:00Z"), "DateStart" : ISODate("0001-01-01T00:00:00Z"), "Errs" : 0, "FailedChunks" : 0, "IsSub" : true, "LastWriteTime" : ISODate("2012-10-15T21:27:09.172Z"), "Mach" : "ws-vm02", "OutDir" : [ "/home/ruschn/Videos/LocalHeroAlexaTest-Day-LogC-tif" ], "OutFile" : [ "LocalHeroAlexaTest-Day-LogC.%04d.tif" ], "PendingChunks" : 0, "Plug" : "FFmpeg", "PlugInfo" : { "InputFile0" : "/home/ruschn/Videos/LocalHeroAlexaTest-Day-LogC.mov", "OutputFile" : "/home/ruschn/Videos/LocalHeroAlexaTest-Day-LogC-tif/LocalHeroAlexaTest-Day-LogC.%04d.tif", "OutputArgs" : "-vcodec tiff", "UseSameInputArgs" : "False" }, "Props" : { "Name" : "mov to tiff test 1", "User" : "ruschn", "Cmmt" : "", "CmmtTag" : "", "Dept" : "", "Frames" : "0", "Chunk" : 1, "Tasks" : 1, "Grp" : "none", "Pool" : "none", "Pri" : 50, "Conc" : 1, "ConcLimt" : true, "AuxSync" : false, "Int" : false, "Seq" : false, "Reload" : false, "NoEvnt" : false, "OnComp" : 2, "AutoTime" : false, "TimeScrpt" : false, "MinTime" : 0, "MaxTime" : 0, "Timeout" : 1, "Dep" : [ ], "DepFrame" : false, "DepComp" : true, "DepDel" : false, "DepFail" : false, "DepPer" : -1, "NoBad" : false, "JobFailOvr" : false, "JobFailErr" : 0, "TskFailOvr" : false, "TskFailErr" : 0, "SndWarn" : true, "NotOvr" : false, "SndEmail" : false, "NotEmail" : [ ], "NotUser" : [ "ruschn" ], "NotNote" : "", "Limits" : [ ], "ListedSlaves" : [ ], "White" : false, "MachLmt" : 0, "MachLmtProg" : -1, "PrJobScrp" : "", "PoJobScrp" : "", "PrTskScrp" : "", "PoTskScrp" : "", "Schd" : 0, "SchdDays" : 1, "SchdDate" : ISODate("0001-01-01T00:00:00Z"), "SchdDateRan" : ISODate("0001-01-01T00:00:00Z"), "Ex0" : "", "Ex1" : "", "Ex2" : "", "Ex3" : "", "Ex4" : "", "Ex5" : "", "Ex6" : "", "Ex7" : "", "Ex8" : "", "Ex9" : "", "ExDic" : { } }, "QueuedChunks" : 1, "RenderingChunks" : 0, "Stat" : 1, "SuspendedChunks" : 0, "Tile" : false, "TileFrame" : 0, "TileX" : 0, "TileY" : 0, "_id" : "507c7faca2cb531474547f8b" }

Thinking it may have been an anomaly with the ‘none’ group/pool, I added a group and apool, both called ‘all’, put the slave in them, and assigned the job to both. However, the slave is still ignoring the job. The JSON for that slave is:

{ "Arch" : "x86_64", "BadJobs" : 0, "CPU" : 2, "Disk" : NumberLong("2749100032"), "DiskStr" : "2.56 GB ", "Grps" : "all", "Host" : "ws-vm02", "IP" : "100.100.200.8", "JobGrp" : "", "JobId" : "", "JobName" : "", "JobPlug" : "", "JobPool" : "", "JobPri" : -1, "JobUser" : "", "LastWriteTime" : ISODate("2012-10-15T21:56:16.306Z"), "Lic" : "@ws-vm01", "LicEx" : 108, "LicFree" : false, "LicPerm" : false, "Limits" : [ ], "MAC" : "08:00:27:93:F2:E6", "Msg" : "2012/10/15 12:34:15 Slave started", "Name" : "ws-vm02", "OS" : "Linux", "OnTskComp" : "Continue Running", "Pools" : "all", "Port" : 35193, "ProcSpd" : NumberLong(2792), "Procs" : 4, "Pulse" : true, "RAM" : NumberLong(2100809728), "RAMFree" : NumberLong(1455345664), "RndTime" : 0, "Stat" : 2, "StatDate" : ISODate("2012-10-15T19:34:16.908Z"), "TskComp" : 0, "TskFail" : 0, "TskId" : "", "TskName" : "", "TskProg" : "", "TskStat" : "", "Up" : 8520.1162109375, "User" : "ruschn", "Ver" : "v6.0.0.48694 R", "Vid" : "InnoTek Systemberatung GmbH VirtualBox Graphics Adapter", "_id" : "ws-vm02" }

rrussell · October 16, 2012, 1:40pm

Man, we are having absolutely no luck reproducing this problem, but at least 3 users have reported it now!! We’ve looked through your job’s json, and Alcium’s here:
viewtopic.php?f=156&t=8278&p=34833#p34827

Nothing stands out. We’ve tried combinations of pools and groups, machine limits > 0, white lists and black lists, and they seem to be behaving properly.

I think the next thing to do is enable Slave Verbose logging. This can be done from the Application Logging section of the Repository Options. After enabling it, restart a slave that should pick up the job. After it has made a few attempts to search for a job, grab the slave log for the current session and post it. You can find the log by selecting Help -> Explore Log Folder from the Slave application. I’m really hoping there is something here that explains what’s going on.

Also, can you confirm that the machine you’re submitting from and the slave are running beta 2? The version number is 6.0.48694.

Thanks!

Ryan

Alcium · October 16, 2012, 2:21pm

Fresh setup for both Repository and DB
Here is another pair of jobs JSON
One which I have set every possible slave in whitelist, the other is untouched

C:\mongo\application\bin>mongo.exe 127.0.0.1
MongoDB shell version: 2.2.0
connecting to: 127.0.0.1/test

use deadlinedb
switched to db deadlinedb
db.Jobs.find()
{ “Arch” : false, “Aux” : [ ], “Bad” : [ ], “CompletedChunks” : 0, “Date” : ISODate(“2012-10-16T14:11:19.807Z”), “DateComp” : ISODate(“0001-01-01T00:00:00Z”), “DateStart” : ISODate(“0001-01-01T00:00:00Z”), “Errs” : 0, “FailedChunks” : 0, “IsSub” : true, “LastWriteTime” : ISODate(“2012-10-16T14:11:20
.030Z”), “Mach” : “WALK007”, “OutDir” : [ “N:\RnD\Characters\Madeleine\Shading\v046” ], “OutFile” : [ “v046_####.exr” ], “PendingChunks” : 0, “Plug” : “Modo”, “PlugInfo” : { “Version” : “6xx”, “Build” : “None”, “Threads” : “0”, “SceneFile” : “M:/Film/_LIBRARY/Characters/madeleine/Modo/madeleine
model_v046.lxo", “OutputFilename” : "N:\RnD\Characters\Madeleine\Shading\v046\v046”, “OutputFormat” : “openexr” }, “Props” : { “Name” : “madeleine_v046”, “User” : “btomad”, “Cmmt” : “”, “CmmtTag” : “”, “Dept” : “”, “Frames” : “0-150”, “Chunk” : 1, “Tasks” : 151, “Grp” : “none”, “Pool” : “non
e”, “Pri” : 50, “Conc” : 1, “ConcLimt” : true, “AuxSync” : false, “Int” : false, “Seq” : false, “Reload” : false, “NoEvnt” : false, “OnComp” : 2, “AutoTime” : false, “TimeScrpt” : false, “MinTime” : 0, “MaxTime” : 0, “Timeout” : 1, “Dep” : [ ], “DepFrame” : false, “DepComp” : true, “DepDel” : false,
“DepFail” : false, “DepPer” : -1, “NoBad” : false, “JobFailOvr” : false, “JobFailErr” : 0, “TskFailOvr” : false, “TskFailErr” : 0, “SndWarn” : true, “NotOvr” : false, “SndEmail” : false, “NotEmail” : [ ], “NotUser” : [ “btomad” ], “NotNote” : “”, “Limits” : [ ], “ListedSlaves” : [ "RENDERHP001
", “RENDERHP002”, “RENDERHP003”, “RENDERHP004”, “RENDERHP005”, “RENDERHP006”, “RENDERHP007”, “RENDERHP008”, “RENDERHP009”, “RENDERHP010”, “RENDERHP011”, “RENDERHP012”, “RENDERHP013”, “RENDERHP014”, “RENDERHP015”, “RENDERHP016”, “RENDERHP018”, “RENDERHP019”, “RENDERHP020”, “REN
DERHP022”, “RENDERHP023”, “RENDERHP024”, “RENDERHP025”, “RENDERHP026”, “RENDERHP028”, “RENDERHP029”, “RENDERHP030”, “RENDERHP031”, “RENDERHP032”, “RENDERHP033”, “RENDERHP034”, “RENDERHP036”, “RENDERHP037”, “RENDERHP038”, “RENDERHP040”, “RENDERHP041”, “RENDERHP042”, "RENDERHP043
", “RENDERHP044”, “RENDERHP045”, “RENDERHP046”, “RENDERHP047”, “RENDERHP048”, “RENDERHP049”, “RENDERHP051”, “RENDERHP052”, “RENDERHP053”, “RENDERHP054”, “RENDERHP055”, “RENDERHP056”, “RENDERHP057”, “RENDERHP058”, “RENDERHP059”, “RENDERHP060”, “RENDERHP061”, “RENDERHP064”, “REN
DERHP065”, “RENDERHP069”, “RENDERHP070”, “RENDERHP071”, “RENDERIBM003”, “RENDERIBM006”, “RENDERIBM008” ], “White” : true, “MachLmt” : 0, “MachLmtProg” : -1, “PrJobScrp” : “”, “PoJobScrp” : “”, “PrTskScrp” : “”, “PoTskScrp” : “”, “Schd” : 0, “SchdDays” : 1, “SchdDate” : ISODat
e(“0001-01-01T00:00:00Z”), “SchdDateRan” : ISODate(“0001-01-01T00:00:00Z”), “Ex0” : “”, “Ex1” : “”, “Ex2” : “”, “Ex3” : “”, “Ex4” : “”, “Ex5” : “”, “Ex6” : “”, “Ex7” : “”, “Ex8” : “”, “Ex9” : “”, “ExDic” : { } }, “QueuedChunks” : 151, “RenderingChunks” : 0, “Stat” : 1, “SuspendedChunks” : 0, “Tile”
: false, “TileFrame” : 0, “TileX” : 0, “TileY” : 0, “_id” : “507d6b084352eead10e483ae” }
{ “Arch” : false, “Aux” : [ ], “Bad” : [ ], “CompletedChunks” : 0, “Date” : ISODate(“2012-10-16T14:14:43.265Z”), “DateComp” : ISODate(“0001-01-01T00:00:00Z”), “DateStart” : ISODate(“0001-01-01T00:00:00Z”), “Errs” : 0, “FailedChunks” : 0, “IsSub” : true, “LastWriteTime” : ISODate(“2012-10-16T14:14:43
.549Z”), “Mach” : “WALK006”, “OutDir” : [ ], “OutFile” : [ ], “PendingChunks” : 0, “Plug” : “Modo”, “PlugInfo” : { “Version” : “6xx”, “Build” : “None”, “Threads” : “0”, “SceneFile” : “M:/Film/_LIBRARY/Characters/mereJack/Modo/characterrig_v1_merejack_v031_Shading_v023.lxo” }, “Props” : { “Name” : “c
haracterrig_v1_merejack_v031_Shading_v023.lxo”, “User” : “bdevreese”, “Cmmt” : “”, “CmmtTag” : “”, “Dept” : “”, “Frames” : “0-300”, “Chunk” : 1, “Tasks” : 301, “Grp” : “none”, “Pool” : “none”, “Pri” : 50, “Conc” : 1, “ConcLimt” : true, “AuxSync” : false, “Int” : false, “Seq” : false, “Reload” : fals
e, “NoEvnt” : false, “OnComp” : 2, “AutoTime” : false, “TimeScrpt” : false, “MinTime” : 0, “MaxTime” : 0, “Timeout” : 1, “Dep” : [ ], “DepFrame” : false, “DepComp” : true, “DepDel” : false, “DepFail” : false, “DepPer” : -1, “NoBad” : false, “JobFailOvr” : false, “JobFailErr” : 0, “TskFailOvr” : fals
e, “TskFailErr” : 0, “SndWarn” : true, “NotOvr” : false, “SndEmail” : false, “NotEmail” : [ ], “NotUser” : [ “bdevreese” ], “NotNote” : “”, “Limits” : [ ], “ListedSlaves” : [ ], “White” : false, “MachLmt” : 0, “MachLmtProg” : -1, “PrJobScrp” : “”, “PoJobScrp” : “”, “PrTskScrp” : “”, “PoTskScrp” : “”
, “Schd” : 0, “SchdDays” : 1, “SchdDate” : ISODate(“0001-01-01T00:00:00Z”), “SchdDateRan” : ISODate(“0001-01-01T00:00:00Z”), “Ex0” : “”, “Ex1” : “”, “Ex2” : “”, “Ex3” : “”, “Ex4” : “”, “Ex5” : “”, “Ex6” : “”, “Ex7” : “”, “Ex8” : “”, “Ex9” : “”, “ExDic” : { } }, “QueuedChunks” : 301, "RenderingChunks
" : 0, “Stat” : 1, “SuspendedChunks” : 0, “Tile” : false, “TileFrame” : 0, “TileX” : 0, “TileY” : 0, “_id” : “507d6bd3c6c4865658bc61ec” }

I can handle letting you have remote other one of our computers if you want to directly interact with the system
in the meantime I’ll try enable berbose log

Alcium · October 16, 2012, 2:30pm

Here we go with a log (shortened for display purposes, but always showing the same bits)

2012-10-16 16:27:39: BEGIN - RENDERHP011\render
2012-10-16 16:27:39: Start-up
2012-10-16 16:27:39: 2012-10-16 16:27:39
2012-10-16 16:27:39: Deadline Slave 6.0 [v6.0.0.48694 R]
2012-10-16 16:27:43: Auto Configuration: No auto configuration could be detected, using local configuration
2012-10-16 16:27:43: slave initialization beginning.
2012-10-16 16:27:45: Info Thread - Created.
2012-10-16 16:27:48: Starting between task wait - seconds: 1
2012-10-16 16:27:49: Scheduler Thread - Slave initialization complete.
2012-10-16 16:27:49: Scheduler Thread - Performing house cleaning…
2012-10-16 16:27:49: Trying to connect using license server ‘@srv-deadline’
2012-10-16 16:27:49: The license file being used will expire in 45 days.
2012-10-16 16:27:49: Scheduler - Job chooser found no jobs.
2012-10-16 16:27:49: Starting between task wait - seconds: 1
2012-10-16 16:27:50: Cleaning up partially (un)archived jobs
2012-10-16 16:27:50: Scheduler Thread - Performing house cleaning…
2012-10-16 16:27:50: Scheduler - Job chooser found no jobs.
…

rrussell · October 16, 2012, 3:08pm

Remote access would be great!

The slave isn’t complaining about Limits (which was the bug in beta 1), so I think remote access would be the way to go. You can send the info to me directly. Just click on my user name and send me an email.

Thanks!

Ryan

Alcium · October 16, 2012, 3:41pm

The Pulse log might have its interests
deadlinepulse(SRV-DEADLINE)-2012-10-16-0002.log (132 KB)

nrusch · October 16, 2012, 5:00pm

Same result for me after enabling verbose logging. Same behavior whether job machine limit is set to 0 or something else.

rrussell · October 16, 2012, 8:47pm

We’ve finally tracked this down to a bug that only affected the RELEASE builds, which is why we could never reproduce this in our debuggers. We’re hoping to get a new build uploaded tomorrow, but if not tomorrow, definitely before the end of the week.

m_hinks · October 17, 2012, 7:27am

Quality, well done Ryan.

The issues I’m having with this and Mongo, I am going to remove it all and starrt again for the next beta 3 release.

Mark