This guide provides an overview for making file metadata searchable, resulting
in a Search index of files. Once information about files has been indexed,
the index can be queried for files in a manner similar to
This guide is written as a companion to the Searchable Files Demo App. The demo app demonstrates a straightforward approach for extracting file metadata and making it searchable, but it is just that: a demo. For a fully fledged solution, users must modify or replace some or all of the demo app. We devote a section of this doc to guidance on how the demo app can be extended or modified.
If you want to jump straight in, the demo app can be run without reading this guide. Then return here for more details on why the demo is a useful starting point!
There are three strong reasons to make file metadata searchable in Globus Search.
No requirement for local access
Using a Searchable Files index instead of
findor other classic tools means that users don’t need shell access on a system storing data in order to search for files. This is especially appropriate if a dataset is exposed on a Globus Collection, but local accounts cannot or should not be provisioned for the end users.
Search queries do not create filesystem load
When a user runs
find . -name '*.tiff',
findrecursively traverses directories looking for matching filenames. All of those reads can be difficult for a filesystem to serve, especially with high-utilization multi-user systems. If the filesystem is backed by spinning disk or tape, or really serves as a frontend to S3, Ceph, or another storage system, even one
findcommand can be expensive and slow.
When queries are sent to the Search service instead, no local load is created. The work of examining files, as
findwould do, has been done up-front, and users can query the data hundreds or even thousands of times more efficiently than the filesystem could sustain.
Permissions and visibility from Globus Groups and Globus Auth
Globus Search uses Globus Groups and Globus Auth identities for its visibility and permissions system. These are the same systems used by Globus Transfer and Collections to handle access control for their actual data.
As a result, it is easy to harmonize the visibility of file metadata in a Searchable Files index with the accessibility of that data via Globus Transfer and other applications.
Globus provides a fully open-source demo application for building a Searchable Files index, the Searchable Files Demo App.
The demo application provides a scaffold upon which more complex solutions can be built, and serves as a walkthrough for this type of application. Research computing teams and other users can use all or part of the demo app to build a Searchable Files index, or merely treat it as guidance about how their own indexing needs should be met.
The demo app does not seek to support a wide variety of filesystem types and
use-cases. It operates by doing a recursive directory crawl, just like
would do, to lookup file metadata. However, even this simplistic approach
satisfies all of the major goals for a Searchable Files index, as it only does
this operation once (as opposed to once for every user query).
The application is also written to be run by a single user, and to be run
either manually or via
cron. It does not support filesystem monitoring or
other advanced features out of the box.
The demo app provides several utilities, along with four major components. These provide the main design for a pipeline which indexes filesystem metadata.
The pipeline runs as
/---------------------\ | Metadata Extraction | extract \---------------------/ | V /-----------------------\ | Annotation & | assemble | Visibility Policy | \-----------------------/ | V /-----------------\ | Data Submission | submit \-----------------/ | V /-----------------\ | Task Monitoring | watch \-----------------/
The Extractor is a component which either traverses a filesystem or responds
to filesystem events, for example via
inotify. Its job is to produce
meaningful file metadata on a per-file basis, in a format which is understood
by the next stage in the pipeline. The Extractor may be a sophisticated or
unsophisticated script, depending on your exact needs.
Projects like Xtract are purpose-built to get various kinds of metadata from different filetypes and filesystems.
For more advanced file metadata, replace the extractor with a tool which understands your data.
The Assembler is the component responsible for consuming extracted file metadata. It may need access to a database, the Globus Transfer API, or other resources in order to make decisions about data visibility, or to add specialized annotations to your datasets.
For example, the filesystem may be organized such that
/projects is a
directory, and files in
/projects/foo/ are part of project
foo. In that
case, the assembler can be customized to read the absolute path to files and
add the attribute
project_name: foo as appropriate. This will enable users to
project_name:foo AND filename:*.tiff to find TIFF files
This stage of the pipeline produces documents for Globus Search as its output. The reference for these documents is the Ingest API.
In the demo app, the assembler splits the
mode fields, containing
the first 1000 characters of text files and the permissions of files, from the
rest of the data, and assigns special visibility to these fields. This is a
technique which can be used generally to make some parts of the metadata
restricted while other parts may be public.
The two final stages of the pipeline are all about getting the resulting data into the Search Index.
When the data is submitted to Search, it will not be immediately available for queries — not until the ingest task which processes that data is complete. Well-behaved Globus Search clients should not only ensure that submission succeeds, but also wait for and monitor the status of the tasks which they create.
For very simple use-cases, it may be possible to use the demo app as-is, or with minimal modifications. The more sophisticated the scenario, the more extensive the changes and replacements will have to be.
Here are a few ways in which the app can be updated to suit various needs.
One of the more straightforward changes possible is to adjust what metadata is computed from files or how it is stored.
For example, using python’s hashlib, the extractor can compute file checksums as part of data extraction.
For a real Searchable Files index, there’s no need for the index creation
command, or any facility for setting the index ID. Simply replace all loading
index_id from storage with a known constant, the ID of the index which
will be used.
The demo app has
logout commands which require that a specific
user is used to submit data to Globus Search.
If a new application is created in the
Globus Auth Developers Site, it can have a
client ID, secret, and "client identity". It’s then possible to replace the
login requirement with
Client Credentials Authorization.
In order for this to work, the "client identity" will need permissions to write to the index. This can be achieved with the Role Create API.
As written, the various stages in the pipeline are separate commands, each running independently. As these steps are always meant to run in series, there’s no need for these various commands to be separate.
A single command — suitable for running via cron — or a daemon which runs the steps periodically, can replace the entire application.
This change is most suitable once user sign-in has been replaced with client credentials.
The Linux Kernel provides an API for monitoring filesystem events,
inotify can be used to watch a directory for new, modified, or deleted files.
These events can, in turn, be used to trigger the same data indexing pipeline
used by the demo app.
On a large filesystem with many events, submitting each file update as a separate task to Globus Search will become slow. If this becomes an issue, the events can be batched and sent every few seconds or minutes.
When a file is removed, either as detected by
inotify or checking against
some database, it should be removed from the Searchable Files index.
This needs to be done either via delete-by-query operations or, more simply, using the Delete by Subject API.
In the existing Searchable Files Demo App, the
subject field is always set to
the same value as the
(i.e. Files are identified by their path.)