Found machines that are hanging, no cpu usage at all.
In the log, this is the last message:
0: INFO: Compiling scene geometry
0: INFO: Preparing ray server…
0: INFO: Building static raycast accelerator…
0: INFO: Building motion blur raycast accelerator…
0: INFO: Preparing direct light manager…
0: INFO: Preparing global light manager…
---- 2013/11/07 13:43 ----
An error occurred while checking task status: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017):
Timeout waiting for a MongoConnection. (FranticX.Database.DocumentException)
---- 2013/11/07 13:46 ----
Connecting to slave log: LAPRO0059
Listener Thread - 172.18.1.159 has connected
The render was around 35% done btw, so it was already mid progress
Could have been some kindof a deadlock, the slave would not shut down. Had to kill it via the task manager, and as soon as i did that, max would start rendering again (at least, its cpu usage went up from zero)
Since going to beta 10, we had the mongo db ‘hang’ for about 10-20 seconds twice. No machine can connect to it in that time.
It shows in the logs like this:
0: INFO: Prepass 2 of 2…: done [00:01:25.0]
---- 2013/11/07 13:42 ----
Error occurred while updating network settings: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017):
Timeout waiting for a MongoConnection. (FranticX.Database.DocumentException)
---- 2013/11/07 13:43 ----
An error occurred while checking task status: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017):
Timeout waiting for a MongoConnection. (FranticX.Database.DocumentException)
---- 2013/11/07 14:44 ----
Error occurred while updating network settings: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to server deadline.scanlinevfxla.com:27017: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.2.209:27017. (FranticX.Database.DatabaseConnectionException)
Scheduler Thread - An error occurred while reporting task progress: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to server deadline.scanlinevfxla.com:27017: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.2.209:27017. (FranticX.Database.DatabaseConnectionException)
---- 2013/11/07 14:58 ----
Scheduler Thread - Task “40_1080-1080” could not be found because task has been modified:
current status = Rendering, new status = Queued
Sometimes the machine hangs up on its current frame, and even though connectivity to mongo comes back, it never continues rendering.
Although, its more worrying that mongo ‘disappears’ for that long… i ssh’d into the server while it was happening last, and the mongo process was not using any cpu.
The database keeps hanging… its been hanging for 4 minutes now!
Its still a running process, running at 5% cpu usage. Its logs havent had anything new added for a while.
When it restarted, i get this in its logs (you can see the 4 minute gap in there):
Thu Nov 7 16:03:32.648 [conn146732744] update deadline6db.LimitGroups query: { _id: "527c288afaa7dc2808a711d9", StubCount: { $lt: 500 }, Stubs.Holder: { $ne: "lapro1367-secondary" } } update: { $push: { Stubs: { Holder: "lapro1367-secondary", Time: new Date(1383869010686) } }, $inc: { StubCount: 1 }, $set: { LastWriteTime: new Date(1383869010686) } } nscanned:1 nupdated:1 keyUpdates:1 locks(micros) w:170 1999ms
Thu Nov 7 16:03:32.648 [conn146737740] update deadline6db.LimitGroups query: { _id: "527c29b74c05aa0ddcfbdbec", StubCount: { $lt: 500 }, Stubs.Holder: { $ne: "lapro1405-secondary" } } update: { $push: { Stubs: { Holder: "lapro1405-secondary", Time: new Date(1383869010654) } }, $inc: { StubCount: 1 }, $set: { LastWriteTime: new Date(1383869010654) } } nscanned:1 nupdated:1 keyUpdates:1 locks(micros) w:134 1998ms
Thu Nov 7 16:03:32.648 [conn146737363] command deadline6db.$cmd command: { findAndModify: "JobTasks", query: { _id: "527c29948faf8405f41dacb3_2" }, sort: {}, update: { $set: { Stat: 5, Slave: "LAPRO1442-secondary", WtgStrt: false, Comp: new Date(1383869010654) } } } ntoreturn:1 keyUpdates:0 locks(micros) w:1998931 reslen:360 1999ms
Thu Nov 7 16:07:10.412 [conn146917999] getMore: cursorid not found deadline6db.SlaveInfo 2471325587580410478
Thu Nov 7 16:07:10.445 [conn146888250] SocketException handling request, closing client connection: 9001 socket exception [2] server [172.18.4.114:59861]
Thu Nov 7 16:07:10.719 [conn146901395] info DFM::findAll(): extent 1:cef000 was empty, skipping ahead. ns:deadline6db.PermissionGroups
Thu Nov 7 16:07:10.793 [conn146960267] command deadline6db.$cmd command: { $eval: return new Date();, nolock: true } ntoreturn:1 keyUpdates:0 reslen:53 318ms
Thu Nov 7 16:07:10.932 [conn146888019] SocketException handling request, closing client connection: 9001 socket exception [2] server [172.18.4.114:59860]
Everything hangs while this is happening (the whole farm basically).
I tried to get db stats from it while it was hanging, but it would hang on that too, then only give it to me when it “came back to life” again.
Mongo Database Statistics
name: deadline6db
collections: 23
objects: 6805433
average object size: 408.0 Bytes
data size:: 2.6 GB
storage size: 5.1 GB
extents: 187
indexes: 42
index size: 662.2 MB
file size: 7.9 GB
BalancerInfo
namespace: deadline6db.BalancerInfo
objects: 0
average object size: error: Element 'avgObjSize' not found. (System.Collections.Generic.KeyNotFoundException)
DeadlineSettings
namespace: deadline6db.DeadlineSettings
objects: 7
average object size: 354.6 KB
data size: 2.4 MB
storage size: 13.0 MB
extents: 3
last extent size: 12.8 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 8.0 KB
index sizes:
_id_: 7.984 KB
DeletedDocuments
namespace: deadline6db.DeletedDocuments
objects: 197539
average object size: 1.4 KB
data size: 264.4 MB
storage size: 320.3 MB
extents: 14
last extent size: 88.3 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 5
total index size: 45.3 MB
index sizes:
_id_: 10.012 MB
DeletedTime_1: 5.341 MB
DocID_1: 10.830 MB
OrigCollection_1: 8.405 MB
OrigCollection_1_DeletedTime_1: 10.698 MB
DeletedJobReportEntries
namespace: deadline6db.DeletedJobReportEntries
objects: 0
average object size: error: Element 'avgObjSize' not found. (System.Collections.Generic.KeyNotFoundException)
DeletedJobs
namespace: deadline6db.DeletedJobs
objects: 8
average object size: 31.2 KB
data size: 249.9 KB
storage size: 44.3 MB
extents: 7
last extent size: 14.3 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 16.0 KB
index sizes:
_id_: 7.984 KB
LastWriteTime_1: 7.984 KB
JobReportEntries
namespace: deadline6db.JobReportEntries
objects: 676945
average object size: 360.0 Bytes
data size: 232.4 MB
storage size: 600.4 MB
extents: 16
last extent size: 160.9 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 94.9 MB
index sizes:
_id_: 50.183 MB
Job_1: 44.686 MB
JobReports
namespace: deadline6db.JobReports
objects: 6440
average object size: 14.0 KB
data size: 88.3 MB
storage size: 440.5 MB
extents: 14
last extent size: 121.4 MB
padding factor: 1.03800000003159
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 535.0 KB
index sizes:
_id_: 534.953 KB
Jobs
namespace: deadline6db.Jobs
objects: 14275
average object size: 9.7 KB
data size: 135.6 MB
storage size: 188.6 MB
extents: 12
last extent size: 53.2 MB
padding factor: 1.00000000000072
system flags: HasIdIndex
user flags: None
indexes: 6
total index size: 4.5 MB
index sizes:
_id_: 1.037 MB
LastWriteTime_1: 670.688 KB
Props.Grp_1: 590.844 KB
Props.Pool_1: 558.906 KB
Stat_1_Props.Pool_1_Props.Grp_1: 902.234 KB
Stat_1_Props.SecPool_1_Props.Grp_1: 806.422 KB
JobStatistics
namespace: deadline6db.JobStatistics
objects: 77906
average object size: 1.4 KB
data size: 105.0 MB
storage size: 118.2 MB
extents: 11
last extent size: 35.9 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 6.6 MB
index sizes:
_id_: 4.405 MB
EntryTime_1: 2.207 MB
JobTasks
namespace: deadline6db.JobTasks
objects: 2011338
average object size: 337.0 Bytes
data size: 646.4 MB
storage size: 900.6 MB
extents: 16
last extent size: 241.4 MB
padding factor: 1.16500000048194
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 156.8 MB
index sizes:
_id_: 156.771 MB
LimitGroups
namespace: deadline6db.LimitGroups
objects: 14307
average object size: 1.2 KB
data size: 16.2 MB
storage size: 82.3 MB
extents: 10
last extent size: 26.6 MB
padding factor: 1.0000000000763
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 1.8 MB
index sizes:
_id_: 1.162 MB
LastWriteTime_1: 678.672 KB
PermissionGroups
namespace: deadline6db.PermissionGroups
objects: 5
average object size: 13.4 KB
data size: 66.9 KB
storage size: 140.0 KB
extents: 2
last extent size: 132.0 KB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 8.0 KB
index sizes:
_id_: 7.984 KB
PulseInfo
namespace: deadline6db.PulseInfo
objects: 2
average object size: 484.0 Bytes
data size: 968.0 Bytes
storage size: 8.0 KB
extents: 1
last extent size: 8.0 KB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 16.0 KB
index sizes:
_id_: 7.984 KB
LastWriteTime_1: 7.984 KB
RepositoryStatistics
namespace: deadline6db.RepositoryStatistics
objects: 12742
average object size: 291.0 Bytes
data size: 3.5 MB
storage size: 10.7 MB
extents: 6
last extent size: 8.0 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 966.1 KB
index sizes:
_id_: 598.828 KB
EntryTime_1: 367.281 KB
ScriptInfos
namespace: deadline6db.ScriptInfos
objects: 89
average object size: 239.0 Bytes
data size: 20.8 KB
storage size: 40.0 KB
extents: 2
last extent size: 32.0 KB
padding factor: 1.00000000000001
system flags: HasIdIndex
user flags: None
indexes: 3
total index size: 24.0 KB
index sizes:
_id_: 7.984 KB
ScriptType_1_ScriptName_1: 7.984 KB
EntryTime_1: 7.984 KB
SlaveInfo
namespace: deadline6db.SlaveInfo
objects: 1307
average object size: 2.7 KB
data size: 3.4 MB
storage size: 20.0 MB
extents: 6
last extent size: 15.0 MB
padding factor: 1.00000000000159
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 135.7 KB
index sizes:
_id_: 71.859 KB
LastWriteTime_1: 63.875 KB
SlaveReports
namespace: deadline6db.SlaveReports
objects: 1307
average object size: 446.9 KB
data size: 570.4 MB
storage size: 1.0 GB
extents: 18
last extent size: 271.5 MB
padding factor: 1.00000000000717
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 71.9 KB
index sizes:
_id_: 71.859 KB
SlaveSettings
namespace: deadline6db.SlaveSettings
objects: 1307
average object size: 586.0 Bytes
data size: 748.3 KB
storage size: 2.7 MB
extents: 5
last extent size: 2.0 MB
padding factor: 1.0000000000003
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 127.8 KB
index sizes:
_id_: 71.859 KB
LastWriteTime_1: 55.891 KB
SlaveStatistics
namespace: deadline6db.SlaveStatistics
objects: 3789533
average object size: 161.0 Bytes
data size: 580.7 MB
storage size: 1.1 GB
extents: 18
last extent size: 293.3 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 350.5 MB
index sizes:
_id_: 238.681 MB
EntryTime_1: 111.789 MB
system.indexes
namespace: deadline6db.system.indexes
objects: 42
average object size: 93.0 Bytes
data size: 3.8 KB
storage size: 20.0 KB
extents: 2
last extent size: 16.0 KB
padding factor: 1
system flags: None
user flags: None
indexes: 0
total index size: 0.0 Bytes
index sizes:
Tasks
namespace: deadline6db.Tasks
objects: 143
average object size: 2.4 KB
data size: 346.7 KB
storage size: 948.0 KB
extents: 3
last extent size: 908.0 KB
padding factor: 1.08200000000004
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 8.0 KB
index sizes:
_id_: 7.984 KB
UserInfo
namespace: deadline6db.UserInfo
objects: 192
average object size: 162.0 Bytes
data size: 30.3 KB
storage size: 168.0 KB
extents: 3
last extent size: 128.0 KB
padding factor: 1.00000000000001
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 8.0 KB
index sizes:
_id_: 7.984 KB
Server Status
host: deadline.scanlinevfxla.com
version: 2.4.1
process: mongod
pid: 21548
uptime: 748424
uptimeMillis: 748424457
uptimeEstimate: 698879
localTime: 2013-11-08T00:07:46.878Z
asserts
regular: 0
warning: 0
msg: 0
user: 0
rollovers: 0
backgroundFlushing
flushes: 12469
total_ms: 27676356
average_ms: 2219.6131205389365
last_ms: 1501
last_finished: 2013-11-08T00:06:56.231Z
connections
current: 1972
available: 18028
totalCreated: 146965271
cursors
totalOpen: 12
clientCursors_size: 12
timedOut: 3329
dur
commits: 1
journaledMB: 0.008192
writeToDataFilesMB: 0.009743
compression: 0.82914979757085017
commitsInWriteLock: 0
earlyCommits: 0
timeMs
dt: 29833
prepLogBuffer: 0
writeToJournal: 940
writeToDataFiles: 28790
remapPrivateView: 0
extra_info
note: fields vary by platform
heap_usage_bytes: 1280955304
page_faults: 15328
globalLock
totalTime: 748424459000
lockTime: 17585321729
currentQueue
total: 860
readers: 860
writers: 0
activeClients
total: 1
readers: 0
writers: 1
indexCounters
accesses: 4541723855
hits: 4541642440
misses: 0
resets: 0
missRatio: 0
locks
.
timeLockedMicros
R: 627490879
W: 17585321729
timeAcquiringMicros
R: 989119559
W: 146559613
admin
timeLockedMicros
timeAcquiringMicros
local
timeLockedMicros
r: 2101391
w: 0
timeAcquiringMicros
r: 1410968
w: 0
deadline6db
timeLockedMicros
r: 883354504854
w: 192844603742
timeAcquiringMicros
r: 33418756278154
w: 53062434768427
network
bytesIn: 464669805404
bytesOut: 77144160251676
numRequests: 3610725033
opcounters
insert: 190115
query: 56098659
update: 10320581
delete: 1172149
getmore: 1483441
command: 24803808
opcountersRepl
insert: 0
query: 0
update: 0
delete: 0
getmore: 0
command: 0
recordStats
accessesNotInMemory: 1354
pageFaultExceptionsThrown: 198
deadline6db
accessesNotInMemory: 1354
pageFaultExceptionsThrown: 198
local
accessesNotInMemory: 0
pageFaultExceptionsThrown: 0
writeBacksQueued: false
mem
bits: 64
resident: 5997
virtual: 503055
supported: true
mapped: 6174
mappedWithJournal: 12348
metrics
document
deleted: 1534615
inserted: 58849569
returned: 94079381414
updated: 307044713
getLastError
wtime
num: 453157619
totalMillis: 209
wtimeouts: 0
operation
fastmod: 22517205
idhack: 1555317093
scanAndOrder: 0
queryExecutor
scanned: 13726548827
record
moves: 1542139
repl
apply
batches
num: 0
totalMillis: 0
ops: 0
buffer
count: 0
maxSizeBytes: 268435456
sizeBytes: 0
network
bytes: 0
getmores
num: 0
totalMillis: 0
ops: 0
readersCreated: 0
oplog
insert
num: 0
totalMillis: 0
insertBytes: 0
preload
docs
num: 0
totalMillis: 0
indexes
num: 0
totalMillis: 0
ttl
deletedDocuments: 0
passes: 12443
ok: 1
This seems to be affecting the majority of our farm now.
The logs dont always show a connectivity problem to the mongo db, but a huge percentage of the farm is just sitting there… The slaves say they are rendering, but 3dsmax is at 0 percent cpu usage, and nothing is happening. The last lines of the logs seem to reflect some activity:
0: INFO: Prepass 2 of 2… [00:01:25.1] [00:01:28.7 est]
0: INFO: Prepass 2 of 2… [00:01:25.2] [00:01:28.8 est]
---- 2013/11/07 21:06 ----
0: INFO: Prepass 2 of 2… [00:01:25.4] [00:01:29.0 est]
0: INFO: Prepass 2 of 2… [00:01:25.5] [00:01:29.1 est]
0: INFO: Prepass 2 of 2… [00:01:25.7] [00:01:29.2 est]
0: INFO: Prepass 2 of 2… [00:01:25.8] [00:01:29.3 est]
0: INFO: Prepass 2 of 2… [00:01:26.0] [00:01:29.4 est]
0: INFO: Prepass 2 of 2… [00:01:26.1] [00:01:29.6 est]
0: INFO: Prepass 2 of 2… [00:01:26.3] [00:01:29.6 est]
0: INFO: Prepass 2 of 2… [00:01:26.4] [00:01:29.6 est]
0: INFO: Prepass 2 of 2… [00:01:26.6] [00:01:29.6 est]
0: INFO: Prepass 2 of 2… [00:01:26.7] [00:01:26.8 est]
0: INFO: Prepass 2 of 2… [00:01:26.9] [00:01:26.9 est]
0: INFO: Prepass 2 of 2… [00:01:27.0] [00:01:27.1 est]
0: INFO: Prepass 2 of 2… [00:01:27.2] [00:01:27.2 est]
0: INFO: Prepass 2 of 2… [00:01:27.3] [00:01:27.4 est]
0: INFO: Prepass 2 of 2… [00:01:27.5] [00:01:27.5 est]
0: INFO: Prepass 2 of 2… [00:01:27.6] [00:01:27.6 est]
0: INFO: Prepass 2 of 2… [00:01:27.8] [00:01:27.8 est]
0: INFO: Prepass 2 of 2…: done [00:01:27.8]
---- 2013/11/07 22:13 ----
Connecting to slave log: LAPRO1436
Listener Thread - fe80::458:24f1:53ed:ff77%14 has connected
Listener Thread - Received message: StreamLog
Listener Thread - Responded with: Success
In some cases, we do see connectivity issues:
0: INFO: Preparing direct light manager…
0: INFO: Preparing global light manager…
---- 2013/11/07 20:19 ----
Scheduler Thread - An error occurred while reporting task progress: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to server deadline.scanlinevfxla.com:27017: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.2.209:27017. (FranticX.Database.DatabaseConnectionException)
0: STDOUT: Cannot read image file “s:\a\inferno2\projects\vulcan\scenes\SHR_shr_rsrc\images\textures\vehTriremeA\v0001_rui_Initial\LOD\trireme_ship_col.1075_Lod-03_rlin.exr”. Unexpected end of file.
Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to server deadline.scanlinevfxla.com:27017: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.2.209:27017. (FranticX.Database.DatabaseConnectionException)
---- 2013/11/07 20:21 ----
0: STDOUT: Cannot read image file “s:\a\inferno2\projects\vulcan\scenes\SHR_shr_rsrc\images\textures\vehTriremeA\v0001_rui_Initial\LOD\trireme_ship_col.1074_Lod-03_rlin.exr”. Unexpected end of file.
---- 2013/11/07 20:22 ----
0: STDOUT: Cannot read image file “s:\a\inferno2\projects\vulcan\scenes\SHR_shr_rsrc\images\textures\vehTriremeA\v0001_rui_Initial\LOD\trireme_ship_col.1073_Lod-03_rlin.exr”. Unexpected end of file.
---- 2013/11/07 21:49 ----
0: STDOUT: Cannot read image file “s:\a\inferno2\projects\vulcan\scenes\SHR_shr_rsrc\images\textures\vehTriremeA\v0001_rui_Initial\LOD\trireme_ship_col.1066_Lod-03_rlin.exr”. Unexpected end of file.
0: STDOUT: Cannot read image file “s:\a\inferno2\projects\vulcan\scenes\SHR_shr_rsrc\images\textures\vehTriremeA\v0001_rui_Initial\LOD\trireme_ship_col.1067_Lod-03_rlin.exr”. Unexpected end of file.
---- 2013/11/07 22:16 ----
Connecting to slave log: LAPRO1409
Listener Thread - fe80::2c47:2783:53ed:ff92%13 has connected
Listener Thread - Received message: StreamLog
Listener Thread - Responded with: Success
This started with beta10. We will now try to roll back to beta9
Both beta9 and beta10 clients hang like this. Maybe some change in the database is causing this?
We have Viet from IT back on site trying to figure out if something went wrong with the deadline server,… the mongo logs dont indicate anything