extremely slow status change (pending scan)

LaszloSebo · December 16, 2013, 5:29pm

Our pending job scan takes about 20 minutes (!) per cycle. The majority of that time is not actually the pending job testing, but changing the task statuses…
For example:

2013-12-16 00:29:38: Executing dependency script: /mnt/s2/exchange/software/managed/pythonScripts/site-packages/scl/deadline/scriptDependency.py
2013-12-16 00:29:38: Running Scanline handleJobDependencies checker v0.42 for JobID: 52aeb133c3f6ebe230ad647d TaskIDs: [u’0’, u’1’, u’2’, u’3’, u’4’, u’5’, u’6’, u’7’, u’8’, u’9’, u’10’, u’11’, u’12’, u’13’, u’14’, u’15’, u’16’, u’17’, u’18’, u’19’, u’20’, u’21’, u’22’, u’23’, u’24’, u’25’, u’26’, u’27’, u’28’, u’29’, u’30’, u’31’, u’32’, u’33’, u’34’, u’35’, u’36’, u’37’, u’38’, u’39’, u’40’, u’41’, u’42’, u’43’, u’44’, u’45’, u’46’, u’47’, u’48’, u’49’, u’50’, u’51’, u’52’, u’53’, u’54’, u’55’, u’56’, u’57’, u’58’, u’59’, u’60’, u’61’, u’62’, u’63’, u’64’, u’65’, u’66’, u’67’, u’68’, u’69’, u’70’, u’71’, u’72’, u’73’, u’74’, u’75’, u’76’, u’77’, u’78’, u’79’, u’80’, u’81’, u’82’, u’83’, u’84’, u’85’, u’86’, u’87’, u’88’, u’89’, u’90’, u’91’, u’92’, u’93’, u’94’, u’95’, u’96’, u’97’, u’98’, u’99’, u’100’, u’101’, u’102’, u’103’, u’104’, u’105’, u’106’, u’107’, u’108’, u’109’, u’110’, u’111’, u’112’, u’113’, u’114’, u’115’, u’116’, u’117’, u’118’, u’119’, u’120’, u’121’, u’122’, u’123’, u’124’, u’125’, u’126’, u’127’, u’128’, u’129’, u’130’, u’131’, u’132’, u’133’, u’134’, u’135’, u’136’, u’137’, u’138’, u’139’, u’140’, u’141’, u’142’, u’143’, u’144’, u’145’, u’146’, u’147’, u’148’, u’149’, u’150’, u’151’, u’152’, u’153’, u’154’, u’155’, u’156’, u’157’, u’158’, u’159’, u’160’, u’161’]
2013-12-16 00:29:38: Dependency to check: [Flowline] /mnt/flspace/CL/vulcan/WW_108_0150/cache/flowline/Spray/v0304_str_Solved/Spray
2013-12-16 00:29:38: globbing: set([])
2013-12-16 00:29:38: cachePathStates: set([952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981])
2013-12-16 00:29:38: globbing done. Checking against task list…
2013-12-16 00:29:38: Queueing: [‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘30’]
2013-12-16 00:29:38: Dependency script returned 31 tasks that can start: /mnt/s2/exchange/software/managed/pythonScripts/site-packages/scl/deadline/scriptDependency.py

Note that the whole test took less then a second. Then, there is a 2 minute delay before it actually releases the tasks…:

2013-12-16 00:31:07: Pending Job Scan - Released pending tasks (0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30) for job "[VULCAN] WW_108_0150_v0304_str_Solved_cache_flowline_Mist_2 " because the frames they depends on have finished and/or their required assets are available.

This is a single job, and since we have hundreds, it really adds up. People end up trying manually release their jobs, but that also takes ages.

The deadline mongo machine is now extremely powerful, we are on beta13, and pulse is running on a different machine. Something else is up…

LaszloSebo · December 16, 2013, 5:30pm

Another example:

2013-12-16 00:27:46:  Executing dependency script: /mnt/s2/exchange/software/managed/pythonScripts/site-packages/scl/deadline/scriptDependency.py
2013-12-16 00:27:46:  Running Scanline handleJobDependencies checker v0.42 for JobID: 52aeb16201f73e0cac147a8d TaskIDs: [u'7', u'8', u'9', u'10', u'11', u'12', u'13', u'14', u'15', u'16', u'17', u'18', u'19', u'20', u'21', u'22', u'23', u'24', u'25', u'26', u'27', u'28', u'29', u'30', u'31', u'32', u'33', u'34', u'35', u'36', u'37', u'38', u'39', u'40', u'41', u'42', u'43', u'44', u'45', u'46', u'47', u'48', u'49', u'50', u'51', u'52', u'53', u'54', u'55', u'56', u'57', u'58', u'59', u'60', u'61', u'62', u'63', u'64', u'65', u'66', u'67', u'68', u'69', u'70', u'71', u'72', u'73', u'74', u'75', u'76', u'77', u'78', u'79', u'80', u'81', u'82', u'83', u'84', u'85', u'86', u'87', u'88', u'89', u'90', u'91', u'92', u'93', u'94', u'95', u'96', u'97', u'98', u'99', u'100', u'101', u'102', u'103', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'120', u'121', u'122', u'123', u'124', u'125', u'126', u'127', u'128', u'129', u'130', u'131', u'132', u'133', u'134', u'135', u'136', u'137', u'138', u'139', u'140', u'141', u'142', u'143', u'144', u'145', u'146', u'147', u'148', u'149', u'150', u'151', u'152', u'153', u'154', u'155', u'156', u'157', u'158', u'159', u'160', u'161', u'162', u'163', u'164', u'165', u'166', u'167', u'168', u'169', u'170', u'171', u'172', u'173', u'174', u'175', u'176', u'177', u'178', u'179', u'180', u'181', u'182', u'183', u'184', u'185', u'186', u'187', u'188', u'189', u'190', u'191', u'192', u'193', u'194', u'195', u'196', u'197', u'198', u'199', u'200', u'201', u'202', u'203', u'204', u'205', u'206', u'207', u'208', u'209']
2013-12-16 00:27:46:    Dependency to check: [Flowline] /mnt/flspace/CL/vulcan/MR_102_0250/cache/flowline/Dirt03/v0045_mmu_killAllWhite_03/Dirt
2013-12-16 00:27:46:  globbing: set([])
2013-12-16 00:27:46:  cachePathStates: set([1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1149, 1150, 1151, 1152, 1153, 1154, 1155, 1156, 1157, 1158, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, 1177, 1178, 1179, 1180, 1181, 1182, 1183, 1184, 1185, 1186, 1187, 1188, 1189, 1190, 1191, 1192, 1193, 1194, 1195, 1196, 1197, 1198, 1199, 1200, 1201, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214, 1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226, 1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1238, 1239, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248, 1249, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1259, 1260, 1261, 1262, 1263, 1264, 1265, 1266, 1267, 1268, 1269, 1270, 1271, 1272, 1273, 1274, 1275, 1276, 1277, 1278, 1279, 1280, 1281, 1282, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, 1294, 1295, 1296, 1297, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1312, 1313, 1314, 1315, 1316, 1317, 1318, 1319, 1320, 1321, 1322, 1323, 1324])
2013-12-16 00:27:46:    globbing done. Checking against task list..
2013-12-16 00:27:46:  Queueing: ['7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209']
2013-12-16 00:27:46:  Dependency script returned 198 tasks that can start: /mnt/s2/exchange/software/managed/pythonScripts/site-packages/scl/deadline/scriptDependency.py

Then it manages to release the tasks 2 mins later:

2013-12-16 00:29:36:      Pending Job Scan - Released pending tasks (7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209) for job "[VULCAN] MR_102_0250_v0053_mmu_wholeSection2_03_cache_flowline_DirtTrail03_2 " because the frames they depends on have finished and/or their required assets are available.

rrussell · December 17, 2013, 3:13am

This is probably due to the general write slowness we discussed during our call. We are continuing to look into ways to alleviate the slow writing issue you guys are seeing.

Cheers,
Ryan