Redshift standlone with corrupted rs files on slaves

Jonathan_de_Blok · August 31, 2018, 9:09am

I’m setting up the whole AWS portal system and i’m having issues with Redshift standalone. All the ec2 stuff is working fine, I can start the spotfleets etc.

But I’m getting errors while rendering redshift standalone files:

2018-08-31 08:38:09: 0: STDOUT: Loading: /mnt/Data/CCloutput_assets0e07af1e79b51dbb794d5f065b4c793e/test_0001.rs
2018-08-31 08:38:09: 0: STDOUT: Failed to load proxy ‘/mnt/Data/CCloutput_assets0e07af1e79b51dbb794d5f065b4c793e/test_0001.rs’. Invalid descriptor ‘SGMTMRKR’ (Expected ‘REDSHIFT’).
2018-08-31 08:38:09: 0: STDOUT: Failed to load /mnt/Data/CCloutput_assets0e07af1e79b51dbb794d5f065b4c793e/test_0001.rs. Aborting

The .rs files are placed in the right bucket by the asset server and if I manually dig around and rename one of the files to .rs it renders fine locally using redshift cmdline. So the assets in the bucket seems to be fine.

But a renderslaves get a mangled(?) version or atleast something with a wrong descriptor as shown in the above log.

Any clues?

eamsler · August 31, 2018, 1:51pm

That is some excellent diagnosing there! Agreed it must be corrupted due to the SGMTMRKR in the error.

If the file is coming back from S3 and rendering fine, it does seem like the cache on the render node has a bad copy… We store it on a local volume for faster access, but re-saving it should cause a re-sync up to S3 and back to the local volume.

Has this happened more than once? We’re downloading it to the volume using Amazon libraries, then providing it to Redshift through some magic. Is there a good way for us to reliably reproduce this one? We don’t have data corruption issues often (or they’re not reported).

Jonathan_de_Blok · August 31, 2018, 2:16pm

Now that I read it again… ’ SGMTMRKR’ that could be 8 character shorthand for ‘segment marker’… that might be a clue?

Anyways… it happens every time, I’ve started/stopped the whole ec2 infrastructure a couple of times during testing, I assume that’s the equivalent of a fresh start cache wise. The only think I didn’t do is empty the bucket, I’ll give that a go later when I get back to it.

As far as reproducing it… it’s just a simple RS standalone submission, to rule things out here is my test sequence of rs files, nothing pretty, just a torus, light and a camera: _assets.zip (355.4 KB)

I’ll let you know if clearing the bucket helped…

Jonathan_de_Blok · September 1, 2018, 10:31am

I’ve emptied out the bucket and it got more strange… the first few frames rendered fine and I thought it was fixed… and then after a few frames it was the same SGMTMRKR problem again…

Capture

This is all using a single render slave, so it’s not a matter of that some machines are working and others are not…

FWIW:
Deadline Client Version: 10.0.12.1 Release (1281798c0)
FranticX Client Version: 2.4.0.0 Release (69a5159bb)
License Mode: Usage Based
Repository Version: 10.0.12.1 (1281798c0)
Integration Version: 10.0.12.1 (1281798c0)
3PL Settings Version: 28/11/2017

eamsler · September 2, 2018, 2:40am

Hmm. Good point about “Segment Marker”. I kind of glossed over that.

You’re version of Deadline is pretty old at this point (we’re at 10.0.20 now) but if it’s happening reliably, we should try and upgrade Redshift on the AMI to match what you have in-office. There’s some docs on how to create a custom AMI over here.

The base images are also designed to match whatever version of Deadline you’re currently on, but I’m not sure if we’ve upgraded Redshift since SP12. If you do make a custom image, it may be worth upgrading Deadline first so you’ll have more of a runway for your efforts. You’d want to stop the AWS Portal services, upgrade Deadline, then install the new AWS Portal components.

Jonathan_de_Blok · September 3, 2018, 1:41pm

Ok, I’m going to upgrade everything to the latest version and see what happens

I’m also tempted to go for a custom AMI based on the Thinkbox base AMI so I can have some control… but a few things that aren’t clear at the moment:

-I can’t find a base Redshift Standalone AMI, only a Redshift+Maya one… are those the same?

-I’m planning on using Redshift on-demand though deadline, is that supported using a customized base AMI? (So the Redshift usage is burning credits per hour and deadline handles all the licensing etc)

-When Redshift releases an update/new version what’s the usual time frame before that comes available in one of the the official Thinkbox AMIs?

Jonathan_de_Blok · September 3, 2018, 2:27pm

mmm… now the remote connection server is borked, see image. I’ve disabled the firewall to exclude any port issues.

Going to wipe all of deadline and start fresh…

Jonathan_de_Blok · September 3, 2018, 3:19pm

Got a bit furher… now I get this error when rendering to Redshift Standalone:

Loading: /mnt/Data/CCloutput_assets0e07af1e79b51dbb794d5f065b4c793e/test_0000.rs
2018-09-03 18:34:59: 0: STDOUT: Failed to load proxy ‘/mnt/Data/CCloutput_assets0e07af1e79b51dbb794d5f065b4c793e/test_0000.rs’. Proxy version mismatch. Found version ‘46’, current version is ‘44’.

So I guess that means you guys need to update the RS version on the ami?

cmoore · September 3, 2018, 6:56pm

What Redshift version are you using? If there is a version mismatch you are able to create a custom AMI with the version you need. You’d just load up our base Redshift AMI and then install the version of Redshift you need. Here is a document that covers creating a custom AMI.

https://docs.thinkboxsoftware.com/products/deadline/10.0/1_User%20Manual/manual/aws-custom-ami.html

Regards,

Charles

Jonathan_de_Blok · September 3, 2018, 7:16pm

I’m on the latest public v2.6.20.

I’m working on the custom AMI right now… my l337 linux skills are a bit rusty but I think I can manage

I took the maya+RS base ami since that’s what the portal starts up if you select ‘redshift’.

While i’m at it, is there a magic command to upgrade deadline itself to the lastest version? The AMI is ‘still’ on 10.0.17.3 iirc and locally om on 20.0 now.

cmoore · September 3, 2018, 7:24pm

Hey Jonathan,

At the moment I recommend starting the AMI you are modifying from the latest version that matches your local installation instead of upgrading Deadline on the AMI. Sometimes there may be fixes on the AMI itself that may not be implemented through an upgrade of the Deadline Client.

Regards,

Charles

Jonathan_de_Blok · September 3, 2018, 7:37pm

Thanks for the headsup! Ahh by looking a bit better in the AMI list from user 357466774442 I found this one, which has the right deadline version on it… I mus thave missed that one.

Deadline Slave Base Image Linux 10.0.20.1 with Maya 2017_Update4 and Redshift 2.5.62 2018-08-24T155329Z - ami-09924e55c29f637c7

round 2…

Jonathan_de_Blok · September 3, 2018, 8:26pm

for future reference:

After updating RS I got an error when running the redshiftcmdline:

STDOUT: /usr/redshift/bin/redshiftCmdLine: error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory

a

sudo yum install '*/libgomp.so.1'

fixed that…

Jonathan_de_Blok · September 3, 2018, 8:35pm

Great success

Capture3

Jonathan_de_Blok · September 3, 2018, 9:27pm

One mystery left… the output files are missing. The output path is not modified by the path remapper?

My output path ‘C:/Cloutput/assets/CGI/cam1/cam1_v0000_$F4.exr’ (the $F4 is a Houdini thing that puts the framenumber in the filename)

the loading seems fine from the asset server bucket:
Loading: /mnt/Data/CCloutput_assets0e07af1e79b51dbb794d5f065b4c793e/test_0002.rs

but saving seems a bit wierd, on linux to the original path:

2018-09-03 21:23:28: 0: STDOUT: Saving: **C:/Cloutput/_assets/CGI/cam1/cam1_v0000_0000.exr**

2018-09-03 21:23:28: 0: STDOUT: Shutdown Rendering Sub-Systems...

2018-09-03 21:23:28: 0: STDOUT: Shutdown mem management thread...

2018-09-03 21:23:28: 0: STDOUT: Freeing GPU mem...(device 0)

2018-09-03 21:23:28: 0: STDOUT: Done (CUDA reported free mem before: 763 MB, after: 7165 MB)

2018-09-03 21:23:29: 0: STDOUT: Shutdown GPU Devices...

2018-09-03 21:23:29: 0: STDOUT: Device 0

2018-09-03 21:23:29: 0: STDOUT: Devices shut down ok

2018-09-03 21:23:29: 0: STDOUT: Finished Shutting down Rendering Sub-Systems

2018-09-03 21:23:30: 0: INFO: Process exit code: 0

2018-09-03 21:23:30: 0: INFO: Notifying AWS Portal File Transfer about task end.

2018-09-03 21:23:30: 0: INFO: AWS Portal File Transfer end...

2018-09-03 21:23:30: 0: INFO: Executing file transfer command: python /opt/Thinkbox/S3BackedCache/bin/task.py end 5b8da63f7741f4061c7814bc

2018-09-03 21:23:30: 0: INFO: Result: In new task.py 2

2018-09-03 21:23:30: 0: task.py received the following known arguments: Namespace(cmd=u'end', job_id='5b8da63f7741f4061c7814bc', url='localhost:4002')

2018-09-03 21:23:30: 0: ----- EndTask -----

2018-09-03 21:23:30: 0: job_id: &quot;5b8da63f7741f4061c7814bc&quot;

2018-09-03 21:23:30: 0: EOP

2018-09-03 21:23:30: 0: INFO: Done AWS Portal File Transfer end.

And the exr is not is the bucket and nowhere in the assets server folder either?