GCSv5.4 Troubleshooting Guide
- 1. Introduction
- 2. Troubleshooting Firewall Issues
- 3. Troubleshooting Certificate Issues
- Appendix A: Globus Logging Locations
- Appendix B: Obtaining Debug Log Events
This document will discuss methods to troubleshoot common issues for GCSv5.4 based endpoints.
2. Troubleshooting Firewall Issues
Firewall issues can be a source of problems for endpoint admins and users. Below we’ll discuss some of the most common firewall related issues.
2.1. Troubleshooting GCS Manager Related Firewall Issues
The GCS Manager provides the interface for your endpoint that the Globus service uses to communicate with and manage it. If the GCS Manager on your endpoint is not accessible, you will encounter problems when attempting to use your endpoint.
2.1.1. Common Errors
Below are some common errors that may indicate the GCS Manager for the involved endpoint is not accessible.
Example Timeout Error
You may see the above error when attempting to browse a collection hosted on your endpoint in the Globus file manager. This sort of error could also be due to attempting to access a directory with a very large number of files and sub-directories.
If you see this sort of error for all paths you attempt to browse for all collections on an endpoint, then you should suspect an issue with the GCS Manager.
If you only see this issue when browsing particular paths, then this issue is likely due to the path in question simply having too many files and sub-directories for them to be listed in a timely manner. In such a case, arranging the directory in question so as to have a smaller number of entries should resolve the issue.
Example Unexpected Error
You may see the above error when attempting to modify the properties of your endpoint, the subscription status of your endpoint, or the properties of a collection hosted on your endpoint in the Globus webapp. In such a context, this error is usually a sign that GCS Manager on the endpoint is not accessible.
2.1.2. Troubleshooting Steps
To check if the GCS Manager is properly accessible on a given node in your endpoint, we’ll first want to check that the GCS Manager Service and Apache Service are both up and running. You can easily check this by looking at the outputs of the 'globus-connect-server self-diagnostic' command.
If you find that either of these services are down, then the GCS Manager Service will not be accessible.
If they’re down, you may simply be able to start the services with the 'systemctl start' command.
If you find that they won’t properly start, then you’ll likely want to contact Globus Support for further assistance.
Assuming that the needed services are up and running, we’ll next want to see what public DNS shows for the FQDN of your endpoint’s GCS Manager. We’ll start by first looking up the GCS Manager URL by running the following command on one of the nodes in your endpoint:
$ globus-connect-server endpoint show Display Name: ABC University Endpoint ID: 00000000-1111-2222-3333-444444444444 Subscription ID: 01234567-89ab-cdef-0123-456789abcdef Public: True GCS Manager URL: https://a1b2c.9f8e.data.globus.org Network Use: normal Organization: ABC University
We’ll now use 'dig' to check the IP addresses associated with the GCS Manager’s FQDN in DNS like so:
$ dig +short a1b2c.9f8e.data.globus.org 198.51.100.2 198.51.100.3
You should see the public IP address of each node in your endpoint returned.
If you find that the public IP address of the node in question isn’t returned, then that is a sign that the node wasn’t properly deployed.
If you find that a private IP address for a node is returned, then this is also a sign that the node wasn’t properly deployed.
If this endpoint consists of only a single node and you don’t see the expected IP address, then review our documentation here as to how to deploy the first node in an endpoint.
If this endpoint consists of multiple nodes and you don’t see the expected IP addresses, then review our documentation here as to how to deploy additional nodes beyond the first node in an endpoint.
GCS Manager Connectivity Testing Process
After checking to make sure that your node’s IP address is properly listed in the DNS record for your GCS Manager’s FQDN, we’ll want to attempt to connect to the GCS Manager at that IP address from various different locations to test that it is accessible from those locations.
You’ll want to run these tests in the order presented, as each subsequent test assumes the previous tests were successful and the diagnosis suggested are not necessarily accurate if the tests are run out of order. For the examples given below, our scenario assumes we’re wanting to test accessing the GCS Manager with an FQDN of "a1b2c.9f8e.data.globus.org" on the 198.51.100.2 system. You will of course want to alter these commands to suit your own tests on your own systems. A successful connection attempt should produce a connection to the GCS Manager service that pulls down a json document with contents showing various bits of information about the endpoint. If you encounter failures at any step you’ll need to resolve those issues before moving on to the next step.
The commands used for the 'curl' tests will vary depending on which system we’re running the test from.
The following commands will be run from the terminal on the 198.51.100.2 system:
$ curl -vk --resolve a1b2c.9f8e.data.globus.org:443:127.0.0.1 https://a1b2c.9f8e.data.globus.org/api/info $ curl -vk --resolve a1b2c.9f8e.data.globus.org:443:198.51.100.2 https://a1b2c.9f8e.data.globus.org/api/info
The following command will be run from the terminal on all other systems:
$ curl -vk --resolve a1b2c.9f8e.data.globus.org:443:198.51.100.2 https://a1b2c.9f8e.data.globus.org/api/info
1) Run the 'curl' test on the 198.51.100.2 system. Please remember that the 'curl' test in this case consists of two commands.
A "timeout" error or a "no route to host" error for the first command suggests a host firewall issue related to policy for self connections directed at loopback.
A success with the first command, but a "timeout" error or a "no route to host" error for the second command suggests a host firewall issue related to policy for self connections directed at the public IP address or possibly a networking issue.
A successful https connection to the GCS Manager address that doesn’t produce the expected output suggests a problem with the GCS Manager itself.
2) Run the 'curl' test on a second host on the same network segment as the 198.51.100.2 host. To be clear, there should be no network firewall between the second host and the 198.51.100.2 host.
A "timeout" error or a "no route to host" error here suggests a host firewall issue related to policy for inbound connections from other hosts.
3) Run the 'curl' test on a third host on your campus network, but on a different subnetwork than the 198.51.100.2 host. For example, if the 198.51.100.2 host is in a DMZ, then the third host should be outside of the DMZ. Both hosts should still be behind the campus border firewall.
A "timeout" error or a "no route to host" error here suggests an issue related to an internal network firewall.
4) Run the 'curl' test on a fourth host that is not on your campus network. The fourth host should be outside of your campus border firewall:
A "timeout" error or a "no route to host" error here suggests an issue related to the campus border firewall.
If you are able to successfully complete the above troubleshooting steps but still find that you’re having problems related to your GCS Manager service, then you’ll likely want to open a ticket with Globus Support to look into the matter further.
2.2. Troubleshooting Data Channel Related Firewall Issues
During a transfer data is moved between endpoints using Data Channel connections. If there are problems establishing these Data Channel connections between endpoints, then transfers will not work correctly.
2.2.1. Common Errors
You will most likely become aware of Data Channel issues with your endpoint after you or your users notice that transfers to or from your endpoint appear to fail. You can see the details for your transfers by going to the Activity page in the Globus web interface. When looking at the Event Log tab for a job that involves an endpoint that has Data Channel connectivity issues, you’ll see fault events that will clue you in to the problem. We’ll discuss these events generally below.
A fault event will look something like this:
Error (transfer) Endpoint:
XYZ University EndpointServer:
dtn-hostname.xyzu.edu:2811File: (Varies) Command: (Varies) Message:
Data channel authentication failed(This is common for data channel issues, but the 'Message' value may be different) Details: (See below)
The above is telling us that it is the "dtn-hostname.xyzu.edu" node in the "XYZ University Endpoint" endpoint that is reporting the problem. Faults involving Data Channel issues will often have "Data channel authentication failed" as the fault "Message" value - but this is not always the case. The "Details" field for the fault event will contain a more complete explanation of the nature of the fault event. There are many possible variations on the values that could be seen for this field for a fault event related to data channel issues.
We’ll go over a few representative examples to give admins a better idea of what this field might be telling them. All "Details" field examples are given within the context of being part of a fault event with the other fault field values set as shown above.
Example Data Channel Error A - Connection Reset By Peer
Details: 500-Command failed. : globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.\r\n500-globus_xio: System error in recv:
Connection reset by peer\r\n500-globus_xio: A system call failed:
Connection reset by peer\r\n500 End.\r\n
Example Data Channel Error B - An Existing Connection Was Forcibly Closed By The Remote Host
Details: 500-Command failed. : globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.\r\n500-globus_xio: System error in recv: Unknown error\r\n500-globus_xio: A system call failed:
An existing connection was forcibly closed by the remote host.\r\r\n500-\r\n500 End.\r\n
The examples above are telling us that the endpoint reporting the problem encountered an issue when attempting to negotiate a data channel session with the remote endpoint. A "Details" field value like this often means that the session negotiation process was able to start, but something interfered with that process. Firewall policy is often the root cause of such issues - especially policy that selectively filters ssl/tls traffic.
Example Data Channel Error C - Connection Timed Out
Details: 500-Command failed. : globus_gridftp_server_file.c:globus_l_gfs_file_server_write_cb:3163: \r\n500-callback failed.\r\n500-globus_xio_tcp_driver.c:globus_l_xio_tcp_system_connect_cb:2022: \r\n500-
Unable to connect to 198.51.100.10:50672\r\n500-globus_xio_system_select.c: globus_l_xio_system_handle_write:1108:\r\n500-System error in connect:
Connection timed out\r\n500-globus_xio: A system call failed:
Connection timed out\r\n500 End.\r\n
Example Data Channel Error D - Connection Timed Out
Details: 500-Command failed. : globus_xio: The GSI XIO driver failed to establish a connection via the underlying protocol.\r\n500-globus_xio:
Unable to connect to 198.51.100.10:50778\r\n500-globus_xio: System error in connect:
Connection timed out\r\n500-globus_xio: A system call failed:
Connection timed out\r\n500 End.\r\n
The examples above are telling us that the endpoint reporting the problem encountered an issue when attempting to negotiate a data channel session with the remote endpoint. A "Details" field value like this often means that the session negotiation process was unable to even be started. This sort of issue is often due to firewall policy blocking traffic in the data port range.
Example Data Channel Error E - Could Not Verify Credential
Details: 500-Command failed. : an authentication operation failed \r\n500-globus_xio_gsi: gss_init_sec_context failed.\r\n500-GSS failure: \r\n500-GSS Major Status: Authentication Failed\r\n 500-GSS Minor Status Error Chain:\r\n500-globus_gsi_gssapi: SSL handshake problems\r\n500-OpenSSL Error: ssl/statem/statem_clnt.c:1914: in library: SSL routines, function tls_process_server_certificate: certificate verify failed\r\n500-globus_gsi_callback_module:
Could not verify credential\r\n500-globus_gsi_callback_module:
Can't get the local trusted CA certificate: Untrusted self-signed certificate in chainwith hash d4c3b2a1\r\n500-\r\n500 End.\r\n
This sort of error tells us that the endpoint doesn’t trust the cert being offered for the data channel connection. This generally only happens if there is something interfering with the establishment of the data channel session between the two endpoints involved in the transfer. Data channel traffic looks similar to https traffic in some ways, so firewall or network policy designed to limit or monitor such traffic can interfere with the establishment of data channel sessions between endpoints.
We sometimes see these sorts of errors for endpoints located behind https intercept proxies or similar devices. Globus data channel traffic cannot be proxied in this way. Sites that do operate with policy designed to intercept https/ssl traffic will need to configure exceptions for Globus data channel traffic for endpoints operating on their network.
2.2.2. Troubleshooting Steps
If you have not already done so, you’ll want to read our doc discussing the basics of Data Channel traffic.
When troubleshooting Data Channel issues it’s important to remember all transfers involve a source endpoint and a destination endpoint and that the factors causing the issues could be located at the site of the source endpoint, the destination endpoint, or possibly even both. You’ll want to first verify that firewall policy at your own site is consistent with the requirements for GCSv5.4 as given in our doc.
Data Channel traffic looks very similar to ssl/tls traffic in many ways, and testing that firewall policy at your site is configured to as to permit Data Channel traffic to/from your endpoint can be done using tools that allow you set up ssl/tls sessions. A convenient way to do this is to use the 'openssl' utility.
We’ll want to create a cert and key pair that we can use for the tests we do with the 'openssl' utility. We’ll then use that cert and key pair in our commands to create a simple 'openssl' listener and a simple 'openssl' client that will connect to the listener. It is important that both the 'openssl' listener and client use this same cert and key pair or the tests discussed will not work correctly.
A successful connection attempt from the client to the listener will generate debug outputs on both sides showing a successful ssl session negotiation. Users at the terminal on both sides can communicate back and forth by simulating a sort of crude text chat via typing messages into the terminal and pressing the ENTER key, so long as the connection was successful and remains in place. By checking to ensure that the 'openssl' client is able to connect to the 'openssl' listener, and verifying that the client and listener are able to 'talk' back and forth to each other, we can determine if data channel traffic appears to be blocked and can also get an idea as to where it might be blocked.
In addition to ensuring that communication is possible over the connection between the 'openssl' listener and client, you’ll also want to verify that the cert offered by the listener in the connection test actually matches what the listener is expected to offer. When the session is initiated between the client and listener, the outputs in the terminal of the client will show the cert for the listener that the client was presented with for the connection attempt and will also show the verification status for that cert. The operator of the client will want to ensure that they see the line "Verify return code: 0 (ok)" for the cert offered by the listener.
If any return code other than '0 (ok)' is shown in the outputs for the client for the cert offered by the listener, then the operator of the client will want to compare the cert offered by the listener in the outputs to the local copy of the cert that was initially created for use in the testing. The reason you want to manually verify this is that it can sometimes happen that sites can have devices set up (https intercept proxy or similar) which will replace certs for ssl sessions so as to allow the device to monitor (intercept) the connection. This sort of behavior will interfere with data channel traffic, so we’ll want to catch this if it’s happening.
Data Channel Connectivity Testing Process
We’ll use a few 'openssl' commands to create our cert/key pair and to create our 'openssl' listeners and clients that we’ll use for the testing.
The cert and key pair we’ll use can be created using the 'openssl' utility. The cert/key pair will be generated only once and that same cert/key pair will then need to be copied to each system involved in the testing.
openssl req -x509 -newkey rsa:2048 -nodes -sha256 -days 7 -subj "/C=US/O=Globus Online/CN=FXP DCAU Cert" -keyout key.pem -out cert.pem
A simple ssl listener can be created using the 'openssl' command.
openssl s_server -tls1_2 -port 50500 -key key.pem -cert cert.pem -CAfile cert.pem
The 'openssl' utility can be used to connect to such a listener as a client.
openssl s_client -tls1_2 -connect 198.51.100.20:50500 -key key.pem -cert cert.pem -CAfile cert.pem
To troubleshoot suspected issues with inbound Data Channel connections to your endpoint you’ll want to follow the steps below. You’ll want to perform these steps in the order presented, as each subsequent step assumes the previous steps were successful and the diagnosis suggested are not necessarily accurate if the steps are run out of order. If you encounter a failure at any step, you’ll need to resolve those issues before moving on to subsequent steps.
1) Set up an 'openssl' listener on the system hosting your endpoint bound to a port in the data port range. Attempt to connect to that 'openssl' listener with an 'openssl' client on the same host via both loopback and your system’s public IP address.
A failure with the connection to loopback suggests a host firewall policy issue related to self connections directed at loopback.
A failure with the connection to the system’s public IP address suggests either a host firewall issue related to self connections directed at the system’s public IP address or possibly a networking issue.
2) Set up an 'openssl' listener on the system hosting your endpoint as discussed previously. Attempt to connect to that 'openssl' listener with an 'openssl' client on a second host on the same network segment as the system hosting the endpoint. To be clear, there should be no network firewall between these hosts.
A failure here suggests an issue with host firewall policy related to inbound connections from other hosts.
3) Set up an 'openssl' listener on the system hosting your endpoint as discussed previously. Attempt to connect to that 'openssl' listener with an 'openssl' client on a third host on your campus network, but on a different subnetwork than the system hosting the endpoint. For example, if the system hosting the endpoint is in a DMZ, then the third host should be outside of the DMZ. Both hosts should still be behind the campus border firewall.
A failure here suggests an issue related to an internal network firewall.
4) Set up an 'openssl' listener on the system hosting your endpoint as discussed previously. Attempt to connect to that 'openssl' listener with an 'openssl' client on a fourth host that is not on your campus network. The fourth host should be outside of your campus border firewall.
A failure here suggests an issue related to the campus border firewall.
To troubleshoot suspected issues with outbound Data Channel connections from your endpoint you’ll use the same process described above for troubleshooting issues with inbound Data Channel traffic, except you’ll swap the locations where the 'openssl' listener and 'openssl' client are located.
If an admin at site A has gone through the above steps and found that their endpoint seems to pass the tests for both inbound and outbound data channel connections, but there still appear to be a data channel issue with transfers involving endpoints at some other site B, the next step is to reach out to the admins at that site B and request that they verify data channel connectivity for the system’s hosting their endpoints in the same manner.
If the site B admins report success in such verification, but data channel issues for transfers between endpoints at site A and site B persist, the next step is to attempt to directly test data channel connectivity between the endpoints involved. This is done simply by setting up an 'openssl' listener on one endpoint and attempting to connect to it with an 'openssl' client on the other endpoint. It is important to remember that a proper test will involve a set of tests in which each node in each endpoint participates running as the 'openssl' client and also separately as the 'openssl' listener so that the ability to establish data channel sessions in both directions (from site A to site B, as well as from site B to site A) is properly tested.
If admins at both site A and site B find that they are not able to properly establish ssl sessions in one (or both) directions between the systems hosting their endpoints, then they will need to reach out to their networking teams for further assistance to attempt to discover why this is so.
If you are able to successfully complete the above troubleshooting steps but still find that you’re having problems related to Data Channel traffic on your endpoint, then you’ll likely want to open a ticket with Globus Support to look into the matter further.
3. Troubleshooting Certificate Issues
The GCSv5.4 software makes use of certificates so that your endpoint can identify itself and interoperate with other parts of the Globus ecosystem. These certificates must be valid or your endpoint will not work correctly.
3.1. Troubleshooting Certificate Expiration Issues
If the certificates being used by your endpoint expire then your endpoint will stop working.
3.1.1. Common Errors
Below are some common errors that may indicate that your endpoint’s certificate has expired.
Example Certificate Has Expired Error
Command Failed: Error (connect) Endpoint:
XYZ University EndpointServer: dtn-hostname.xyzu.edu:443 Message: Could not connect to server --- Details: an authentication operation failed\nglobus_xio_gsi: gss_init_sec_context failed.\nGSS failure: \nGSS Major Status: Authentication Failed\nGSS Minor Status Error Chain:\nglobus_gsi_gssapi: SSL handshake problems\nOpenSSL Error: ../ssl/statem/statem_clnt.c:1913: in library: SSL routines, function tls_process_server_certificate: certificate verify failed\nglobus_gsi_callback_module: Could not verify credential\nglobus_gsi_callback_module:
The certificate has expired: Credential with subject: /CN=a1b2c.9f8e.data.globus.org has expired.\n\n
This error is letting us know that the certificate being used by the "XYZ University Endpoint" endpoint has expired. This error will most commonly be encountered by users of the Globus webapp attempting to access a collection on an endpoint with an expired certificate.
3.1.2. Troubleshooting Steps
By default, the GCSv5.4 software will configure your endpoint to use a certificate issued by the Let’s Encrypt service. The GCSv5.4 software will automatically renew such a certificate for you. If the automatic renewal process for the Let’s Encrypt certificate malfunctions in some way, then the certificate will expire and you can see errors such as shown above. This automatic renewal process is handled by the GCS Manager Assistant Service. If this service is down, then the automatic renewal of the endpoint’s Let’s Encrypt certificate will fail.
You can check the status of the GCS Manager Assistant Service with this command:
systemctl -l status gcs_manager_assistant.service
If the service is down, then you can try to restart it with a command such as this:
systemctl start gcs_manager_assistant.service
If the service was down, wait ~5 minutes after restarting it before attempting to access your endpoint’s collections so as to give the service time to catch up on its tasks. You’ll also want to ensure that the GCS Manager Assistant Service is enabled so that it will automatically restart when the system is rebooted. If the GCS Manager Assistant Service is disabled, you can enable it with a command such as this:
systemctl enable gcs_manager_assistant.service
It is also possible to configure an endpoint to use a custom certificate, issued by some other CA than Let’s Encrypt. Such certificates will NOT be automatically renewed by the GCSv5.4 software, so you will need to handle such renewals yourself.
If you are able to successfully complete the above troubleshooting steps but still find that you’re having problems related to the certificate on your endpoint, then you’ll likely want to open a ticket with Globus Support to look into the matter further.
Appendix A: Globus Logging Locations
Logging for the Globus Connect Server and GridFTP daemon, as well as HTTPS Transfer logs, can be found in the below locations:
1) Globus Connect Server application log:
2) GridFTP log:
3) HTTPS transfer logs (Apache/HTTPD access and error logs):
4) In addition to standard logging, High Assurance (HA) Endpoint Collections also provide audit logging capabilities which are further detailed on the Globus Connect Server Audit page.
Appendix B: Obtaining Debug Log Events
When troubleshooting an issue, it can often be helpful to obtain debug log events from the GridFTP and GCS Manager services that correspond to the problem you’re having. The following steps will allow you to gather such log events.
1) Enable debug logging for the GridFTP service by creating a file named '/etc/gridftp.d/z_logging' that contains only the following:
2) Enable debug logging for the GCS Manager by creating a file named '/etc/sysconfig/gcs_manager' (for RHEL derived distributions) or '/etc/default/gcs_manager' (for Debian derived distributions) that contains only the following:
After that, you’ll need to restart the service like so:
systemctl restart gcs_manager.service
If you’re simply wanting to enable debug logging for these services then you can stop here. If you’re wanting to capture log events related to a specific action or associated with a particular error that you can reproduce, then continue below.
3) We’ll now put a marker in the log files for the GridFTP service and GCS Manager service to make the logs easier to parse:
for log in /var/log/gridftp.log /var/log/globus-connect-server/gcs-manager/gcs.log; do echo ----Start Test $(date)----- >> $log; done
4) At this point, go ahead and take the actions needed to reproduce the error you’re seeing so we can capture the log events associated with the attempt.
5) We’ll wait 60 seconds after completing step 4, and then put more markers in the log files to make them easier to parse:
for log in /var/log/gridftp.log /var/log/globus-connect-server/gcs-manager/gcs.log; do echo ----End Test $(date)----- >> $log; done
At this point, you can create a copy of the marked portions of the '/var/log/gridftp.log' and '/var/log/globus-connect-server/gcs-manager/gcs.log' files and use them to assist in your own troubleshooting or provide them to Globus support if you’ve been directed to follow these steps in a support ticket.