Failed to Communicate with Infrastructure Asset Server

chancenorris · November 1, 2018, 2:37am

Again, following the video on how to set up AWS Deadline Portal. Here are the steps taken so far…

Submit Maya Arnold job from Maya
Monitor Accepted Job
Create Deadline Infrastructure
Set Up limits for Usage Based License (Maya UBL)
Set Up Limits for Arnold (Arnold Local Licenses)
Right clicked on Deadline Infrastructure to create spot (Get the following error)

Error Message Start…

There was an error when attempting to communicate with the AWS Portal CentralServer on your infrastructure.

A possible cause of this error is the port 4001 being blocked on the machine runnning AWS Portal Link. Please un-block this port and try again. If that’s not the issue, there may be something wrong with either your infrastructure or your AWS user account permissions.

Would you still like to start a Spot Fleet?

Error Message End …

Thanks,

Chance

chancenorris · November 1, 2018, 3:56am

Steps taken so far…

unblocked 4001
Turned off Firewall (didn’t help so I turned it back on)

Still getting same error

" There was an error when attempting to communicate with the AWS Portal CentralServer on your infrastructure."

I have the log file if you need it.

eamsler · November 1, 2018, 4:19pm

So we just got off the phone here and I wish I’d read this first, but is @cmoore helping you with this via the ticket system?

squeakybadger · November 22, 2018, 7:10pm

Hi,

I seem to be having this error now as well. Was working fine earlier in the week, but today I’m just getting the 4001 error when trying to start a spot fleet.

Updated to 10.0.22.3 to see if it happens, but still the same error (updated the iam policy as well)

Any ideas on what to try now?

Thanks.

chancenorris · November 22, 2018, 7:49pm

I had to contact AWS to increase my limits for EC2 instances. The instance used for the Deadline infrastructure I believe is m5.4xlarge. It was set to 0.
You need to go into your AWS account into the EC2 instances page to request a limit increase if its set to 0.

Hope this helps.
Chance

eamsler · November 22, 2018, 11:27pm

Thanks Chance! We actually reverted that one because new users were hitting issues.

Me thinks you need to check to make sure your AWS Portal services are running, and that you don’t have a firewall that’s popped back up unexpectedly blocking 4001 on the machine running those services:

squeakybadger · November 23, 2018, 12:07pm

Hi Edwin,

Tried again this morning to start up a spot fleet - failed 4 times with the 4001 error and then started working and I was able to get some instances running.

I’ve attached some log snippets from around that time, nothing was changed during that period of trying to start a spot fleet/failing/retrying/successful, just me telling it no on the error and then starting a new one.

I always make sure the 2 services are running or have been restarted before firing up a cloud render (and deadline usually moans in red in the monitor if one of the services isn’t communicating properly.

Firewall has all the correct permissions to let things in and out, but I usually disable it when startup up a cloud render.

logSnippets.zip (2.9 KB)

cmoore · November 23, 2018, 4:50pm

How quickly are you trying to start a fleet after launching the infrastructure. The error might be popping up if the Gateway machines components have not fully initialized yet before trying to start a fleet. Could this be the case? Or is the timing of the error random? It sounds like if you ignore that error and continue the fleet would still launch and connect.

squeakybadger · November 26, 2018, 5:59pm

I give it 5-10 mins after starting the infrastructure.

If I ignore the error and continue on, either the slaves won’t appear in the Deadline monitor, or they don’t start the job (can’t remember which, I haven’t gone ahead with it for a while)

I’ve had our network guy have a fiddle with the firewall ports, so I’ll give it another bash.

cmoore · December 4, 2018, 9:49pm

You should not have to open any inbound ports in your firewall to make AWS Portal work. You only need outbound port 22 and outbound https to aws endpoints. If slaves are not connecting we want to take a look at the AWS Portal Link logs at that point in time to see if you are getting connection errors. AWS Portal Link handles establishing ssh tunnels to the Gateway machine living in the AWS Infrastructure.

https://docs.thinkboxsoftware.com/products/deadline/10.0/1_User%20Manual/manual/aws-portal-troubleshooting/aws-portal-log-locations.html

If you alternate between more than one public ip in your local network it is possible that the wrong public ip is getting set on your Gateways security group for port 22. Troubleshooting documents on slaves not connecting to the local repository can be found here:

https://docs.thinkboxsoftware.com/products/deadline/10.0/1_User%20Manual/manual/aws-portal-troubleshooting/slaves-not-connecting.html

Regards,

Charles

JamieMurray · December 5, 2018, 2:38pm

Hi Charles,

I’m now just getting chance to test out an AWS infrastructure we spun up earlier in the week and when I go to spawn a spot fleet I’m getting the exact same error.

Currently using Deadline 10.0.21.5 onsite and I believe 10.0.22.3 offsite. We haven’t customised anything at all permissions wise on the external infrastructure other than to connect to EFS storage within the VPC.

We’re using the stock Gateway instance as set in the infrastructure for 10.0.21.5. There’s no port blocking nor anything trying to use it.

I’ve tried, rebooting gateway instance, restarting AWS Portal Link and Deadline RCS process. All to no avail.

We have not deployed AWS asset server here as we’re managing our assets using our own introspection and asset transfer systems. We’re all good on EC2 Limits etc as well.

Logs from console view show the below:

2018-12-05 14:50:45: Error when attempting to communicate with the CentralServer on your Infrastructure: Status(StatusCode=Unknown, Detail="Exception in central controller: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "Socket closed"\n\tdebug_error_string = "{"created":"@1544021445.289086347","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1099,"grpc_message":"Socket closed","grpc_status":14}"\n>")

Should add that even if we say ‘create spot fleet anyway’ it continues to the no AWS asset service running warning which is fine and we continue but then give’s us the below in console when getting invalid attrib error:

2018-12-05 14:58:36: Traceback (most recent call last):

2018-12-05 14:58:36: File &quot;C:\Anaconda3\conda-bld\deadline_1539267647748\work\DeadlineProject\DeadlineUI\Commands\DashCommands.py&quot;, line 501, in InnerExecute

2018-12-05 14:58:36: AttributeError: 'NoneType' object has no attribute 'Length'

Have ran some analysis and the process that’s running the AWS Portal Link service is present and correct and is listening on port 4001 onsite.

Can we just hand create spot fleets in AWS console till we get a solution?

Cheers

Jamie

eamsler · December 5, 2018, 4:44pm

The issue with hand-creating fleets is that you need to provide the user data field we pass into the instances. I’ll see if I can get an example of that, but given the fact that the pressure is building in your side I’d rather we schedule a call. I’ll set something up for tomorrow at 3:30pm your time and use your contact info from past support requests to invite you.

Update: I’m kicking Chance out of our meeting.

cmoore · December 5, 2018, 5:48pm

The 4001 error is a status check to see if the central asset server (component of the asset server on the Gateway) can communicate with the on-premise controller. The host you are running the gui on requests a status check from the AWS Portal machine to see if they are communicating.

Two known reasons:

#1 - The Gateway + Daemons have not fully initialized yet.
#2 - Issues with AWS Portal establishing communication to the Gateway. Check AWS Portal Link logs.

https://docs.thinkboxsoftware.com/products/deadline/10.0/1_User%20Manual/manual/aws-portal-troubleshooting/aws-portal-log-locations.html

Jamie - You are most likely getting this error because you do not use the asset server component. So all the status checks are failing. Is the attribute error stopping you from launching a fleet through the UI?

JamieMurray · December 6, 2018, 1:50pm

Hi Edwin,

More than happy to take the call and have accepted invite.

Ahead of the call I can take some time now and re-deploy a new infrastructure with both AWS Portal Link and Asset Manager elements installed to test if this is causing the problem.

cheers

Jamie

JamieMurray · December 6, 2018, 1:54pm

Interesting to note that when executing AWS Portal Link exe I’m seeing the below warning:

eamsler · December 6, 2018, 3:20pm

Silly question we’ll answer in a bit here, but did you uninstall the old version and install the version that ships with your version of Deadline?

RomboBellogia · December 6, 2018, 4:20pm

Hi, we’re going through the same issue here, we can force start the spot fleet, but then none of the slaves get anything from even the simplest jobs.

JamieMurray · December 6, 2018, 4:36pm

Hi there,

On the call I was finishing up the last of the re-install of AWS portal link and asset server services. I used fresh AWS credentials on the asset server service installation. Thanks for taking time to call Edwin, helpful as ever.

We’re able to spin up a new infrastructure and spawn some spot instances. I’ve not got to the point we’re rendering just yet though so will give it a test next week once back from leave. Would appear it was the lack of having AWS Asset Server Service present that’s tripping us up on this build (10.0.21.5)

Suspect I’d like to strip out my AWS services one more time and set them up using new IAM credentials for sanity’s sake. @eamsler : Where do I ‘remove’ the preferences from so that AWS setup is kind of asking for these new credentials again?

Cheers

Jamie

eamsler · December 20, 2018, 3:24pm

@RomboBellogia have you made any progress on your end?

RomboBellogia · December 21, 2018, 12:10pm

Hi Eamsler, we were in a hurry to render before setting this up, and the trouble we got into getting the license servers to run even with all the firewalls down for testing purposes made us give up for a more traditional and simpler renderfarm service - I will come back to this after this job, as I see this to be the way to go - But for now, it is still too tricky for smaller studios to use.
I will try again soon.