Automating Transfer and Share of Data from Instruments
Introduction
This document addresses common use cases for data-intensive scientific instruments and core facilities, specifically for making data collected at the instrument accessible. After an experimental run, the collected data needs to be transferred from the acquisition machine to centralized storage and shared with researchers and their collaborators. This document outlines steps for automating this process, including methods for automatically triggering data transfers once data processing is complete.
This document addresses the following user story:
As a facility operator, provide automated tools for moving data gathered at an instrument off the acquisition machine, so that PIs/researchers can efficiently manage access to the data and ensure its availability for analysis and collaboration.
Solution overview
The use case will utilize the following constructs in the Globus ecosystem:
-
Globus guest collections on the instrument machine and on the destination storage for data transfers and sharing. Note: requires institutional subscription.
-
Service Account/Application Credential (client ID and secret) for secure data and permissions management independent of staff rotations.
-
Globus Groups for convenient roles and permissions management.
-
Globus Flows to define a set of steps once, and reliably repeatedly execute it upon a variety of triggers (manual, time-based periodic, or event-driven).
One key consideration through this is the accounts and identities that are used for installing and configuring various entities. Administrator accounts can be used, and permissions can be granted for other staff to manage these entities. For example, the flow that will be used for automation can be deployed under an administrator’s identity, and permissions can be set for others to update, manage and monitor it. Alternatively, service accounts or application credentials, can be used to create and own these entities. This option ensures that the resources are not tied to any particular user’s account. This document outlines using the administrator’s identity, and refers to tools available to set this up using service accounts.
Steps
1. Application credential for task automation
In addition to user identities, Globus also supports identities and credentials specifically for applications, which can be used to automate end-to-end workflows. This application identity enables automated actions, such as running a flow to transfer data and granting a user or group permission to access data at the destination.
To create an application credential, register the application as aclient with Globus, which generates a unique identifier and a secret. Registered applications receive an identity in the format: CLIENT_ID@clients.auth.globus.org
This application identity functions like any other identity within the Globus ecosystem and can be used to assign permissions and roles in various services. For instance, the application can be assigned roles like 'monitor' on an endpoint, 'read' or 'write' on a collection, or 'executor' on a flow.
-
Follow the instructions in Step 2.1 of this recipe to create an application identity. Be sure to copy and save the secret, as it will be required later.
-
Finally, ensure that other administrators are set up on the project where the application credential is registered to enable project management and continuity. (see Managing projects)
2. Data access on the acquisition machine
Install the Globus Connect agent on the acquisition machine or on a system that has mounted access to the storage where data is written. This setup will enable data transfer from the acquisition machine.
For Windows systems, use Globus Connect Personal (GCP). For Linux systems, you can use either Globus Connect Personal (GCP) or Globus Connect Server (GCS).
-
If the acquisition machine runs Windows, install Globus Connect Personal (GCP).
-
Follow the installation guide and the open ports and firewall requirements.
-
The installation can be done using an administrator’s Globus account, with roles configured for others to manage the deployment. Alternatively, installation can be performed with a service account.
-
For Windows machines that cannot be networked due to security restrictions (e.g., if the OS version is outdated or unsupported), consider setting up a dual-homed Linux machine nearby. This machine can mount the acquisition storage and follow the Linux installation recommendations.
-
GCP can be run under the local account that users log into, or ideally, configured to start as a background service under any appropriate local account.
-
Ensure that the GCP installation is configured with read/write and sharing access to the acquisition storage. You may also need to grant the necessary OS-level permissions, especially if the account running GCP differs from the one used by the instrument software.
-
-
If the machine is Linux, install either Globus Connect Server (GCS) or Globus Connect Personal (GCP), depending on network requirements.
-
If the required ports for Globus Connect Server can be open, install GCS. Refer to the installation guide and check open ports and firewall requirements. You can complete the installation using an administrator’s Globus account, which allows sharing deployment administration and management permissions with other admins. Alternatively, a service account can be used for installation.
-
If no inbound ports can be opened, install GCP instead. Follow the installation guide and the open ports and firewall requirements.
-
You can run GCP under the local account used by instrument users, or the account used by the instrument software. Alternatively, configure it to start as a background service, such as a systemd user unit or another preferred background method.
-
Be sure to configure the GCP installation with read/write and sharing access to the acquisition storage. Additionally, ensure that any required OS-level permissions are granted, especially if the account running GCP differs from the one used by the instrument software.
-
-
Associate your endpoint with your institution’s subscription
-
If your identity is already a member of your organization’s Globus Subscription group (see how to check), you can do it yourself (instructions).
-
Alternatively, you can request assistance from your subscription administrators or managers. Be sure to include all necessary information in your request (what to include in your request).
-
-
Create a guest collection
-
The previous steps have generated a mapped collection on the instrument storage. A guest collection, which enables automation, can now be created on this mapped collection. Roles, permissions, and activity will apply to the guest collection rather than the underlying mapped collection. This setup is a one-time step.
-
You can create a single guest collection for multiple instruments and users, or create individual guest collections for each instrument or group, depending on your needs.
-
Follow steps 1 through 5 in the instructions to login and create a guest collection. Make sure that the base directory for the guest collection encompasses all of the acquisition storage paths you want to be accessible for future transfers.
-
To create the guest collection, you can use the administrator’s identity. Alternatively, if you prefer to use app credentials, follow the steps outlined here.
-
-
Set monitor/manager roles
-
The steps above have created a guest collection owned and administered by your Globus identity. It is very beneficial to add other staff members with the 'administrator' or 'activity manager' roles on the guest collection. For more details, refer to the GCS documentation, which describes roles that also apply universally to GCP collections.
-
To grant roles, navigate to your guest collection on Globus Web App, click the chevron to expand the collection properties, and go to the "Roles" tab.
-
You can assign roles to individual users or to Globus groups. Using groups can simplify management in the future; for example, instead of updating permissions for individual instrument endpoints, you can add or remove staff members in a single "My Facility Administrators" group.
-
-
Set permissions for the application identity you created in Step 1 of this document.
-
Set permission for the application identity to read data from the guest collection. See Setting permissions using the webapp.
-
3. Data access on the target storage
Similar to the acquisition machine above, configure a guest collection on the target storage.
-
If your target storage system already has a subscribed Globus Connect Server (GCS) installation, you can skip the installation step. If not, install and subscribe a GCS (or GCP) as described previously.
-
Next, create a guest collection on the target storage system, similar to the one created on the acquisition machine. Assign roles to the guest collection, typically granting access to the same individuals or groups.
-
Grant the application identity from Step 1 of this document permission to manage access and read/write data on the guest collection. This requires assigning the identity the access manager role. These permissions can be set through the web application.
4. Flows for task automation
Deploy a flow to automate data transfer and set permissions for data access on the target storage system.
-
Refer to How to Create a Flow for detailed instructions on creating and deploying a flow.
-
A flow definition and input schema designed for transfer and sharing are available here. You can modify this flow using the Globus Flows IDE to suit your specific requirements. (Flows IDE with definition selected)
-
It is recommended to maintain the flow definition and input schema in a version control system like GitHub for better version management and collaboration.
-
The flow can be preconfigured with the source guest collection (from Step 2 of this document) and the destination collection (from Step 3 of this document).
-
Set roles for other staff members, such as
administrator
orviewer
to manage or view the flow. Additionally, grant the application credential (from Step 1 of this document) the necessary permissions to execute the flow. For more details, see How to Create a Flow, and refer to the role assignment section. -
Grant the starter role for the application credential from Step 1 of this document to allow it to run the flow.
5. Trigger of flow run
A common use case for instruments is triggering a flow based on specific events, such as the appearance of a particular file indicating the completion of a task on the instrument. This use case can be handled by deploying a watcher script, which monitors for the desired event and triggers the flow when the event occurs.
-
If no automation is required, a run of the flow can be started using the Globus Web Application, or the CLI.
-
For automation, we provide a watcher script template and several examples of flows that can be initiated by trigger events here: Globus Flows Trigger Examples
-
The example script utilizes inotify to monitor the file system for changes and trigger a flow run when the desired event occurs.
-
Use the application credential from Step 5.1 in the trigger script to authenticate and initiate the flow run.
-
-
With either method, it is recommended that monitor permissions are set for administrators on the runs. Permissions and roles are outlined, and they can be configured at the time a run is started or at a later time.
-
You can also use the Globus CLI to call the flow run command using the client credentials, see Client credentials configured for use with CLI.
Monitoring and managing runs
Any user who has permissions to monitor the runs can use the Globus Web App, as outlined in this document.