Database corruption v2

Alcium · January 23, 2013, 10:32am

OK I think I spotted something

The jobs starting with the _id attribute are always the one locking deadline Pulse into a loop (also affects deadlinecommand CleanupJobs command)

The only other difference is that this job also doesn’t have a “LastWriteTime” Attribute, which is why it’s crashed I suppose

Hope this helps

If your db master has a command that would allow me to quickly remove from mongodb these stucked jobs I’d really appreciate

Thanks

rrussell · January 23, 2013, 2:08pm

That should definitely be helpful. We’ll see if we can reproduce here. At the very least, a missing last write time shouldn’t result in a crash, so we’ll have to figure out why this is the case.

Thanks!

Ryan

Alcium · January 23, 2013, 6:28pm

More information :
happens mainly on very short jobs, with autodelete only

example with 2 very similar jobs :

The one wich is locking pulse into infinite loop

{ “_id” : “510025ae04a30340c02f9485”, “Props” : { “Name” : “convert infiniMap : acab15_boispoutre__dsp-col_4k_R.exr”, “User” : “render”, “Cmmt” : “”, “CmmtTag” : “”, “Dept” : “”, “Frames” : “0”, “Chunk” : 1, “Tasks” : 1, “Grp” : “none”, “Pool” : “small_tasks”, “Pri” : 45, “Conc” : 1, “ConcLimt” : true, “AuxSync” : false, “Int” : false, “Seq” : false, “Reload” : false, “NoEvnt” : false, “OnComp” : 1, “AutoTime” : false, “TimeScrpt” : false, “MinTime” : 0, “MaxTime” : 600, “Timeout” : 1, “Dep” : [ ], “DepFrame” : false, “DepComp” : true, “DepDel” : false, “DepFail” : false, “DepPer” : -1, “NoBad” : false, “JobFailOvr” : false, “JobFailErr” : 0, “TskFailOvr” : false, “TskFailErr” : 0, “SndWarn” : true, “NotOvr” : false, “SndEmail” : false, “NotEmail” : [ ], “NotUser” : [ “render” ], “NotNote” : “”, “Limits” : [ ], “ListedSlaves” : [ ], “White” : false, “MachLmt” : 0, “MachLmtProg” : -1, “PrJobScrp” : “”, “PoJobScrp” : “”, “PrTskScrp” : “”, “PoTskScrp” : “”, “Schd” : 0, “SchdDays” : 1, “SchdDate” : ISODate(“0001-01-01T00:00:00Z”), “SchdDateRan” : ISODate(“0001-01-01T00:00:00Z”), “PlugInfo” : { “Version” : “2.7”, “Arguments” : ““M:/Film/_LIBRARY/Prop/acaB15/Textures/MR/acab15_boispoutre__dsp-col_4k_R.exr” “M:/Film/_LIBRARY/Prop/acaB15/Textures/INFMR/acab15_boispoutre__dsp-col_4k_R.exr””, “ScriptFile” : “Z:\WG_Code\mdc\pipeline\scripts\infiniMap\convertInfiniMap.py” }, “Ex0” : “”, “Ex1” : “”, “Ex2” : “”, “Ex3” : “”, “Ex4” : “”, “Ex5” : “”, “Ex6” : “”, “Ex7” : “”, “Ex8” : “”, “Ex9” : “”, “ExDic” : { “Version” : “2.7”, “Arguments” : ““M:/Film/_LIBRARY/Prop/acaB15/Textures/MR/acab15_boispoutre__dsp-col_4k_R.exr” “M:/Film/_LIBRARY/Prop/acaB15/Textures/INFMR/acab15_boispoutre__dsp-col_4k_R.exr””, “ScriptFile” : “Z:\WG_Code\mdc\pipeline\scripts\infiniMap\convertInfiniMap.py” } }, “IsSub” : false, “Mach” : “RENDERHP024”, “Date” : ISODate(“2013-01-23T18:02:22.341Z”), “DateStart” : ISODate(“0001-01-01T00:00:00Z”), “DateComp” : ISODate(“0001-01-01T00:00:00Z”), “Plug” : “Python”, “OutDir” : [ ], “OutFile” : [ ], “Tile” : false, “TileFrame” : 0, “TileX” : 0, “TileY” : 0, “Stat” : 1, “Aux” : [ ], “Bad” : [ ], “CompletedChunks” : 0, “QueuedChunks” : 1, “SuspendedChunks” : 0, “RenderingChunks” : 0, “FailedChunks” : 0, “PendingChunks” : 0, “Errs” : 0 }

The one which is fine and will autodelete soon

{ “Aux” : [ ], “Bad” : [ ], “CompletedChunks” : 0, “Date” : ISODate(“2013-01-23T18:02:26.210Z”), “DateComp” : ISODate(“0001-01-01T00:00:00Z”), “DateStart” : ISODate(“0001-01-01T00:00:00Z”), “Errs” : 0, “FailedChunks” : 0, “IsSub” : true, “LastWriteTime” : ISODate(“2013-01-23T18:02:26.491Z”), “Mach” : “WALK005”, “OutDir” : [ ], “OutFile” : [ ], “PendingChunks” : 0, “Plug” : “Python”, “Props” : { “Name” : “convert infiniMap : acab15_boispoutre__msk-1-col_4k_B.tga”, “User” : “render”, “Cmmt” : “”, “CmmtTag” : “”, “Dept” : “”, “Frames” : “0”, “Chunk” : 1, “Tasks” : 1, “Grp” : “none”, “Pool” : “small_tasks”, “Pri” : 45, “Conc” : 1, “ConcLimt” : true, “AuxSync” : false, “Int” : false, “Seq” : false, “Reload” : false, “NoEvnt” : false, “OnComp” : 1, “AutoTime” : false, “TimeScrpt” : false, “MinTime” : 0, “MaxTime” : 600, “Timeout” : 1, “Dep” : [ ], “DepFrame” : false, “DepComp” : true, “DepDel” : false, “DepFail” : false, “DepPer” : -1, “NoBad” : false, “JobFailOvr” : false, “JobFailErr” : 0, “TskFailOvr” : false, “TskFailErr” : 0, “SndWarn” : true, “NotOvr” : false, “SndEmail” : false, “NotEmail” : [ ], “NotUser” : [ “render” ], “NotNote” : “”, “Limits” : [ ], “ListedSlaves” : [ ], “White” : false, “MachLmt” : 0, “MachLmtProg” : -1, “PrJobScrp” : “”, “PoJobScrp” : “”, “PrTskScrp” : “”, “PoTskScrp” : “”, “Schd” : 0, “SchdDays” : 1, “SchdDate” : ISODate(“0001-01-01T00:00:00Z”), “SchdDateRan” : ISODate(“0001-01-01T00:00:00Z”), “PlugInfo” : { “Version” : “2.7”, “Arguments” : ““M:/Film/_LIBRARY/Prop/acaB15/Textures/MR/acab15_boispoutre__msk-1-col_4k_B.tga” “M:/Film/_LIBRARY/Prop/acaB15/Textures/INFMR/acab15_boispoutre__msk-1-col_4k_B.exr””, “ScriptFile” : “Z:\WG_Code\mdc\pipeline\scripts\infiniMap\convertInfiniMap.py” }, “Ex0” : “”, “Ex1” : “”, “Ex2” : “”, “Ex3” : “”, “Ex4” : “”, “Ex5” : “”, “Ex6” : “”, “Ex7” : “”, “Ex8” : “”, “Ex9” : “”, “ExDic” : { “Version” : “2.7”, “Arguments” : ““M:/Film/_LIBRARY/Prop/acaB15/Textures/MR/acab15_boispoutre__msk-1-col_4k_B.tga” “M:/Film/_LIBRARY/Prop/acaB15/Textures/INFMR/acab15_boispoutre__msk-1-col_4k_B.exr””, “ScriptFile” : “Z:\WG_Code\mdc\pipeline\scripts\infiniMap\convertInfiniMap.py” } }, “QueuedChunks” : 1, “RenderingChunks” : 0, “Stat” : 1, “SuspendedChunks” : 0, “Tile” : false, “TileFrame” : 0, “TileX” : 0, “TileY” : 0, “_id” : “510025b2ea14c30c64a38fe1” }

rrussell · January 23, 2013, 6:35pm

Can you post the Pulse log when it’s in an infinite loop? We’re starting to dig into this issue now, and the more information we have, the better!

Thanks!

Ryan

Alcium · January 23, 2013, 6:39pm

Pulse repeating

Sorry Pulse verbose logging wasn’t enabled

I think I can reproduce it easily but these jobs are urgent,
so we disabled auto-delete for now, I’ll try to test it with verbose mode as soon as possible

rrussell · January 23, 2013, 7:04pm

Thanks! In beta 10, we will be making a change that makes it impossible for a job to ever exist without a LastWriteTime property (unless of course someone manually removes that property directly from the database). So any new jobs submitted after upgrading to beta 10 will always have this property set, even if it’s just the default value.

In theory, this should fix the problem you are seeing with Pulse.

Cheers,

Ryan

Alcium · January 24, 2013, 8:14am

Thanks

I can confirm this happens with auto-delete jobs only,

more than 12 000 jobs were running this night without any trouble

at least we’re testing the new database styled Deadline heavily, and it is way more responsive than the previous approach
congrats guys

Alcium · January 24, 2013, 9:24am

The more we have the best we fix so here we go
An autodelete job appeared from a forgotten manual submit script

I catched the log from Pulse verbosed, luckily it did not stay in infinite loop, but there’s definitely something weird happening

See attached for the log

I also tried db.Jobs.find( { “LastWriteTime” : { $exists : false } } ) on the mongo db but it did not return any result
database-log.txt (10.2 KB)

rrussell · January 24, 2013, 4:09pm

Good to know. Just an FYI that beta 10 should be released later today, and will include the fix that ensures jobs created by beta 10 or later will always have that property defined.

That’s awesome! We’re really glad to hear it’s working well for you and that you notice the improvements!

im_thatoneguy · January 24, 2013, 7:02pm

Alcium I would be intrigued to know how much disk space and RAM your Mongo-DB isusing. I think you’ve probably created a stress test on the outer-bounds of what we’ll ever see in our queue so a little real-world insight would be great if you could share!

Alcium · January 25, 2013, 1:22pm

I did not notice any particular usage, this was really in capability of the server

We have an HP Proliant DL 360 G6 With 1x Xeon E5520 @ 2.27GHz CPU / 6 GB RAM
The Deadline Repository is on a SSD drive (which was the best solution for sharing the dozens of files for pre-6.0 version)

Really nothing went high in the Deadline system I think the db was about 1GB

this works really well as long as pulse is not going crazy with weird corrupted jobs