Amazon Web Services S3 Connector
The Globus AWS S3 storage connector can be used for access and sharing of data on AWS S3. The connector is available as an add-on subscription to organizations with a Globus subscription - please contact us for pricing.
This document describes how to use the AWS S3 Connector connector to configure AWS S3 Storage Gateways and Collections. After these steps are complete, any Globus user you have authorized can register a credential to access AWS S3 buckets that they have access to and, if enabled, can create guest collections for sharing access using those credentials by following the instructions in How To Share Data Using Globus.
This document assumes that you or another administrator has already installed Globus Connect Server v5 on one or more data transfer nodes, and that you have an administrator role on that endpoint.
The installation must be done by a system administrator, and has the following distinct set of steps:
-
Create a storage gateway on the endpoint configured to use the AWS S3 Connector.
-
Create a mapped collection using the AWS S3 Storage Gateway to provide access to AWS S3 Storage Gateway data.
Please contact us at support@globus.org if you have questions or need help with installation and use of the AWS S3 Connector.
S3 Connector Virtual Filesystem
The S3 connector provides a distributed object store, where each data object is accessed based on a bucket name and an object name.
The S3 connector attempts to make this look like a regular filesystem,
by treating the bucket name as the name of a directory in the root of
the storage gateway’s file system. For example, if a user has access
to buckets bucket1 and bucket2, then those buckets would show up as
directories when listing /.
By default, the buckets listed in the root directory are generated by
listing the buckets owned by the registered default key, as well as any
buckets included as a path prefix of additional keys.
However, when the storage gateway --bucket policy option is used, only
those buckets are shown. If bucket listing is not possible with the default
key, additional keys have not been registered, and the storage gateway
--bucket policy is not set, root listing will not be possible. In that case users may attempt to access any bucket path directly.
The S3 connector treats the / character as a delimiter in the S3
API so that it can present something that looks like
subdirectories. For example, the object object1 in bucket1 would
appear as /bucket1/object1 to the S3 connector, and the object
object2/object3 in bucket2 would appear as a file called object3
in the directory /bucket2/object2.
Authenticated and Anonymous Access
Each S3 Storage Gateway can be configured to perform either authenticated or unauthenticated access to S3 data. When creating an S3 Storage Gateway, you must choose which type of access to require.
- authenticated
-
Globus users must register an S3 Credential with Globus Connect Server in order to access data on its collections. The credential must be associated with a policy that allows the IAM permissions used by the AWS S3 Connector.
- unauthenticated
-
Globus users can only access public AWS Buckets.
AWS S3 Configuration Encryption
All configuration information, including AWS S3 secrets and user credential information, is encrypted with a secret key on the node servicing the request before storing it locally and uploading it to GCS cloud services for distribution to other nodes in the endpoint. The encryption key is only available locally to the node and is secured such that only the node admin has access.
Storage Gateway
An S3 Storage Gateway is created with the command globus-connect-server storage-gateway create s3, and can be updated with the command globus-connect-server storage-gateway update s3.
Before looking into the policy options specific to the AWS S3 Connector, please familiarize yourself with the Globus Connect Server v5 Data Access Guide which describes the steps to create and update a storage gateway, using the POSIX connector as an example. The commands to create and update a storage gateway for the AWS S3 Connector are similar.
S3 Storage Gateway Policies
The --s3-user-credential, --s3-requester-pays, --s3-unauthenticated, --bucket, and --s3-endpoint command-line options control access to an Amazon S3 or compatible resource.
Endpoint
The --s3-endpoint command-line option is used by Globus Connect Server to contact the S3 API to access data on this storage gateway. This may be an Amazon S3 URL, a regional Amazon S3 URL, or the URL endpoint of another compatible storage system.
For our example, we’ll use Amazon S3’s standard US-East-1 regional S3
Endpoint which is located at https://s3.amazonaws.com
--s3-endpoint https://s3.amazonaws.com
IPv6 Configuration
This connector supports transferring files over IPv6 networks. This requires
the s3_endpoint value to be one of the Amazon dualstack endpoints. The list
of regions and their dualstack endpoints is available from the
S3 documentation.
Access Mode
The --s3-user-credential and --s3-unauthenticated command-line options are mutually exclusive.
If the --s3-user-credential command-line option is enabled, then each user accessing collections on this storage gateway must register an S3 access key id and secret key with the storage gateway.
The --admin-managed-credential command-line option can also be set to allow admins the ability to register an S3 access key id and secret key for users.
If the --s3-unauthenticated command-line option is enabled, then all accesses to collections on this storage gateway will be done using unauthenticated access. In this case, the root of the S3 Connector Virtual Filesystem will only be able to list buckets that are explicitly made visible by using the --bucket command-line option.
For our example, we’ll create a Storage Gateway that provides authenticated access to data buckets.
--s3-user-credential
Glacier and Archival Restore Support (new in 5.4.90)
If the --s3-restore command-line option is enabled, then requests made to S3 will check the current archival status of an object before downloading. When the objects is in one of the Glacier storage classes, or an archive tier of Intelligent Tiering, a restore request will be issued, the task will monitor the status of that restore, and will download the object when ready. See the additional info when restoring.
For our example, we’ll create a Storage Gateway that automatically restores objects when necessary.
--s3-restore
Storage Class (new in 5.4.90)
If the --s3-storage-class command-line option is set, then the connector will set the storage class of newly uploaded objects to the provided value. See the note note if considering setting this to a Glacier storage class.
For our example, we’ll create a Storage Gateway writes all new objects to
the STANDARD_IA storage class.
--s3-storage-class STANDARD_IA
Requester Pays (new in 5.4.59)
If the --s3-requester-pays command-line option is enabled, then requests made to S3 will include the request-payer parameter which allows the costs associated with those requests to be charged to the AWS account making the request. See the AWS Requester Pays documentation for more information.
The --s3-requester-pays command-line option requires the --s3-user-credential command-line option.
If the --s3-requester-pays command-line option is enabled, S3 operations from mapped and guest collection accesses will be charged to the AWS account associated with the registered user credential. Globus users must acknowledge this behavior when creating the user credential. This can be done when registering credentials via the Globus Web App, or using the globus-connect-server user-credentials s3-create command with the --s3-requester-pays option.
For our example, we’ll create a Storage Gateway that allows access to Requester Pays buckets. Users will need to acknowledge this when registering credentials.
--s3-requester-pays
Bucket Restrictions
The --bucket command-line option argument is the name of a bucket which is allowed access by this storage gateway.
For our example, we’ll create a Storage Gateway that restricts access to two buckets owned by our organization: research-data-bucket-1, and research-data-bucket-2. Users will be restricted to only those buckets when using collections created on this storage gateway, and only if their credential has permissions to do so.
--bucket research-data-bucket-1 --bucket research-data-bucket-2
If no buckets are configured, then any buckets accessible using the user’s registered S3 key_id and secret_key may be accessed by collections on this storage gateway. If any are configured, then they act as restrictions to which buckets are visible and accessible on collections on this storage gateway.
Creating the Storage Gateway
Now that we have decided on all our policies, we’ll use the command to create the storage gateway.
% globus-connect-server storage-gateway create s3 \
"S3 Storage Gateway" \
--domain example.org \
--s3-endpoint https://s3.amazonaws.com \
--s3-user-credential \
--bucket research-data-bucket-1 --bucket research-data-bucket-2
Storage Gateway Created: 7187a9a0-68e4-48ea-b3b9-7fd06630f8ab
This was successful and outputs the ID of the new storage gateway (
in this case) for our reference. Note that this will always
be a unique value if you run the command. If you forget the id of a storage
gateway, you can always use the command
globus-connect-server storage-gateway
list to get a list of the storage gateways on the endpoint.7187a9a0-68e4-48ea-b3b9-7fd06630f8ab
You can also add other policies to configure additional identity mapping and path restriction policies as described in the Globus Connect Server v5 Data Access Guide.
Note that this creates the storage gateway, but does not yet make it accessible via Globus and HTTPS. You’ll need to follow the steps in the next section.
Collection
An AWS S3 Collection is created with the command globus-connect-server collection create, and can be updated with the command globus-connect-server collection update.
As the AWS S3 Connector does not introduce any policies beyond those used by the base collection type, you can follow the sequence in the Collections Section of the Globus Connect Server v5 Data Access Guide. Recall however, that the paths are interpreted as described above in S3 Connector Virtual Filesystem.
User Credential
As mentioned above, when the storage gateway is configured to provide authenticated access to AWS S3, users must register a default S3 access key and secret key and can optionally register additional keys for accessing specified buckets.
Additional S3 Keys (new in 5.4.78)
Users may register S3 key pairs and associated bucket prefixes in addition to the required default S3 key pair. When accessing an object, the S3 access key associated with the bucket prefix that matches the object’s bucket is used. If no bucket prefix matches the object’s bucket, then the default key is used.
For example, if a user would like to access an object in the bucket lockedbucket,
but their default access key does not have permissions to access lockedbucket,
the user can register an additional key pair with a bucket prefix of /lockedbucket/.
When the connector attempts to access any object in lockedbucket, it will attempt to do so with the additional key, rather than the default key,
because lockedbucket matches the bucket prefix /lockedbucket/.
The following globus-connect-server CLI commands can be used to manage additional keys.
Appendix A: Notes
All registered AWS credentials must have the following IAM permissions when accessing S3:
Required S3 Permissions
In order for the AWS S3 Connector to properly access S3 resources on a user’s behalf, credentials that have been granted the following S3 permissions are required.
s3:ListAllMyBuckets is required on the * resource to automatically populate the root listing. This is not required when the storage-gateway --bucket option is used, or when a collection base path includes the bucket.
s3:ListBucket and s3:ListBucketMultipartUploads on the buckets: resource arn:aws:s3:::[bucket-name].
s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListMultipartUploadParts and s3:AbortMultipartUpload on the objects: resource arn:aws:s3:::[bucket-name]/*.
Example JSON policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllBuckets",
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets"
],
"Resource": "*"
},
{
"Sid": "Bucket",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": "arn:aws:s3:::example-bucket"
},
{
"Sid": "Objects",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload"
],
"Resource": "arn:aws:s3:::example-bucket/*"
}
]
}
Checksum Verification
The S3 connector uses MD5 for checksum verification. When a MD5 checksum is required, it will either be read from S3 metadata, be calculated from the transfer stream, or it will require a download of the data from S3.
Checksums can be read from metadata (eTag) on single part objects. For objects that were previously uploaded by this connector, this will be any objects smaller than 500MB by default. However, single-part objects will have a different threshold if the part size configuration is changed, or if they were created by a different client.
When the checksum isn’t available in metadata, it will be calculated on the data stream during the transfer. This calculated checksum will then be used for the subsequent checksum comparison.
When the checksum is neither available in metadata nor from a recent transfer of the object, calculation will require downloading the full object from S3. This can happen in the following scenarios.
-
If there is a fault during a transfer of a multi-part object, that operation will be retried and can restart from the middle of the object. Since only part of the data will be transferred, it isn’t possible to calculate the checksum using that transfer stream.
-
If you perform a sync transfer using checksum as the comparison, and objects don’t have checksum available in metadata. Since the checksum request happens before the transfer data stream, optimization is not possible.
-
If the bucket’s encryption method is SSE-KMS. In this case the eTag values are not MD5, so this optimization is not possible.
Setting Storage Class to Glacier
When setting the storage class policy to the Glacier classes GLACIER or DEEP_ARCHIVE, partial restarts are disabled, and any faults during upload will be retried with a full transfer.
The reason for this is that checksum verification on a restarted transfer requires downloading the full data (see above). Because writes to one of these classes are not available for immediate download, it would not be possible to verify a restarted partial transfer. If the bucket’s encryption method is SSE-KMS, checksum verification is not possible at all; you can use the collection update --verify option to disable verification in this case.
It may be preferable to configure a lifecycle policy to transition objects to these storage classes some time after they are created.
Glacier Restore
Best Practices
-
Enabling restore on a storage gateway does not prevent it from accessing objects in non-archival storage classes. However, these operations will be slightly less efficient than not having the restore option enabled, as each object must be checked for the need to restore.
-
Glacier restore operations can take from minutes to multiple days, depending on the storage class and configuration. The Globus task will monitor the restore progress from the request to the completion of the restore. Objects will be restored in batches, so transfers of large numbers of files will result in multiple batches of restores and transfers before the full task can complete. To improve overall performance when attempting to transfer many files, especially from DEEP_ARCHIVE or when using the Bulk restore tier, you can initiate a restore operation directly to AWS in parallel with the Globus transfer request. When the connector detects that an object has a restore request in progress, it will wait on that object and successfully transfer, whether the request was initiated by the connector or not.
Configuration
The default when restoring objects is to use the Standard retrieval tier, which works for all archival storage classes and restores within 3-5 hours.
It is possible to change this setting by editing the restore_tier value in the configuration file /etc/globus/globus-gridftp-server-s3.conf. See the
AWS documentation for more information on the timing and cost of these tiers.
You can also set the restore_days value in the configuration to change the number of days that the restored object will be accessible. The default value is 4 days. This days parameter does
not affect restores from Intellegent Tiering archive classes.
Multipart Upload Configuration
By default, multipart uploads created by the connector will use a part size of 500MB. This allows for uploading files near the maximum allowed object size, and strikes a good balance between performance and restart efficiency.
In most cases it is unnecessary to change this default value, but it is possible by editing the part_size value in the configuration file /etc/globus/globus-gridftp-server-s3.conf. Setting a small value will limit the maximum object size, and will make metadata checksums unavailable on objects larger than that size. Setting a large value will limit the points at which a transfer can be restarted after a fault (transfers can only restart at part boundaries).
The part_size configuration must be set the same on all nodes of an endpoint.
AWS S3 Storage Connector Modification Times Note
-
Due to mod-times not being modifiable on s3 Storage, the timestamp for objects transferred into s3 storage based Collections will reflect the time/date that the objects are written to s3 Storage; however, once transferred out of s3 (to a filesystem that supports mod-time modification) the timestamps will reflect the object’s original mod-time when initially transferred.
-
The original mod-time for objects is stored in the
mtimemetadata tag on the object(s). -
Timestamp preservation is currently only supported for files.
Appendix B: Document Types for the AWS S3 Connector
S3StoragePolicies Document
Connector-specific storage gateway policies for the S3 connector
Version 1.1.0 adds support for the s3_requester_pays property
Version 1.2.0 adds support for the s3_allow_multi_keys property
Version 1.3.0 adds support for the s3_storage_class and s3_restore properties
One of the following schemas:
{
"DATA_TYPE": "s3_storage_policies#1.0.0",
"s3_buckets": [
"string"
],
"s3_endpoint": "https://s3.amazonaws.com",
"s3_user_credential_required": true
}
S3UserCredentialPolicies Document
Connector-specific user credential policies for the S3 connector
Version 1.1.0 adds support for the s3_requester_pays property.
Version 1.2.0 adds support for the s3_multi_keys property list.
One of the following schemas:
{
"DATA_TYPE": "s3_user_credential_policies#1.0.0",
"s3_key_id": "string",
"s3_secret_key": "string"
}