Searchable Files

This guide provides an overview for making file metadata searchable, resulting in a Search index of files. Once information about files has been indexed, the index can be queried for files in a manner similar to find or locatedb.

This guide is written as a companion to the Searchable Files Demo App. The demo app demonstrates a straightforward approach for extracting file metadata and making it searchable, but it is just that: a demo. For a fully fledged solution, users must modify or replace some or all of the demo app. We devote a section of this doc to guidance on how the demo app can be extended or modified.

Note

If you want to jump straight in, the demo app can be run without reading this guide. Then return here for more details on why the demo is a useful starting point!

Motivation

There are three strong reasons to make file metadata searchable in Globus Search.

No requirement for local access

Using a Searchable Files index instead of find or other classic tools means that users don’t need shell access on a system storing data in order to search for files. This is especially appropriate if a dataset is exposed on a Globus Collection, but local accounts cannot or should not be provisioned for the end users.
Search queries do not create filesystem load

When a user runs find . -name '*.tiff', find recursively traverses directories looking for matching filenames. All of those reads can be difficult for a filesystem to serve, especially with high-utilization multi-user systems. If the filesystem is backed by spinning disk or tape, or really serves as a frontend to S3, Ceph, or another storage system, even one find command can be expensive and slow.

When queries are sent to the Search service instead, no local load is created. The work of examining files, as find would do, has been done up-front, and users can query the data hundreds or even thousands of times more efficiently than the filesystem could sustain.
Permissions and visibility from Globus Groups and Globus Auth

Globus Search uses Globus Groups and Globus Auth identities for its visibility and permissions system. These are the same systems used by Globus Transfer and Collections to handle access control for their actual data.

As a result, it is easy to harmonize the visibility of file metadata in a Searchable Files index with the accessibility of that data via Globus Transfer and other applications.

The Searchable Files Demo Application

Globus provides a fully open-source demo application for building a Searchable Files index, the Searchable Files Demo App.

The demo application provides a scaffold upon which more complex solutions can be built, and serves as a walkthrough for this type of application. Research computing teams and other users can use all or part of the demo app to build a Searchable Files index, or merely treat it as guidance about how their own indexing needs should be met.

Demo App Limitations

The demo app does not seek to support a wide variety of filesystem types and use-cases. It operates by doing a recursive directory crawl, just like find would do, to lookup file metadata. However, even this simplistic approach satisfies all of the major goals for a Searchable Files index, as it only does this operation once (as opposed to once for every user query).

The application is also written to be run by a single user, and to be run either manually or via cron. It does not support filesystem monitoring or other advanced features out of the box.

A Pattern for File Indexing

The demo app provides several utilities, along with four major components. These provide the main design for a pipeline which indexes filesystem metadata.

The pipeline runs as

/---------------------\
| Metadata Extraction |     extract
\---------------------/
        |
        V
/-----------------------\
|     Annotation &      |   assemble
|   Visibility Policy   |
\-----------------------/
        |
        V
/-----------------\
| Data Submission |         submit
\-----------------/
        |
        V
/-----------------\
| Task Monitoring |         watch
\-----------------/

Metadata Extraction

The Extractor is a component which either traverses a filesystem or responds to filesystem events, for example via inotify. Its job is to produce meaningful file metadata on a per-file basis, in a format which is understood by the next stage in the pipeline. The Extractor may be a sophisticated or unsophisticated script, depending on your exact needs.

Tip

Projects like Xtract are purpose-built to get various kinds of metadata from different filetypes and filesystems.

For more advanced file metadata, replace the extractor with a tool which understands your data.

Annotation

The Assembler is the component responsible for consuming extracted file metadata. It may need access to a database, the Globus Transfer API, or other resources in order to make decisions about data visibility, or to add specialized annotations to your datasets.

For example, the filesystem may be organized such that /projects is a directory, and files in /projects/foo/ are part of project foo. In that case, the assembler can be customized to read the absolute path to files and add the attribute project_name: foo as appropriate. This will enable users to query for project_name:foo AND filename:*.tiff to find TIFF files associated with foo.

This stage of the pipeline produces documents for Globus Search as its output. The reference for these documents is the Ingest API.

In the demo app, the assembler splits the head and mode fields, containing the first 1000 characters of text files and the permissions of files, from the rest of the data, and assigns special visibility to these fields. This is a technique which can be used generally to make some parts of the metadata restricted while other parts may be public.

Submission & Monitoring

The two final stages of the pipeline are all about getting the resulting data into the Search Index.

When the data is submitted to Search, it will not be immediately available for queries — not until the ingest task which processes that data is complete. Well-behaved Globus Search clients should not only ensure that submission succeeds, but also wait for and monitor the status of the tasks which they create.

Extending and Replacing the Demo App

For very simple use-cases, it may be possible to use the demo app as-is, or with minimal modifications. The more sophisticated the scenario, the more extensive the changes and replacements will have to be.

Here are a few ways in which the app can be updated to suit various needs.

Changing the File Metadata

One of the more straightforward changes possible is to adjust what metadata is computed from files or how it is stored.

For example, using python’s hashlib, the extractor can compute file checksums as part of data extraction.

Tip

Checksums could easily be fed into the Globus Transfer API as part of the Transfer Item, to validate data transfers against data found in the search index.

The Globus SDK Helper for handling data transfers supports setting external_checksum as well.

Using a fixed Index ID

For a real Searchable Files index, there’s no need for the index creation command, or any facility for setting the index ID. Simply replace all loading of the index_id from storage with a known constant, the ID of the index which will be used.

Removing User Sign-In

The demo app has login and logout commands which require that a specific user is used to submit data to Globus Search.

If a new application is created in the Globus Auth Developers Site, it can have a client ID, secret, and "client identity". It’s then possible to replace the login requirement with Client Credentials Authorization.

In order for this to work, the "client identity" will need permissions to write to the index. This can be achieved with the Role Create API.

Rebuilding as a Daemon or Cron

As written, the various stages in the pipeline are separate commands, each running independently. As these steps are always meant to run in series, there’s no need for these various commands to be separate.

A single command — suitable for running via cron — or a daemon which runs the steps periodically, can replace the entire application.

This change is most suitable once user sign-in has been replaced with client credentials.

Replacing the Directory Crawl with inotify

The Linux Kernel provides an API for monitoring filesystem events, inotify. inotify can be used to watch a directory for new, modified, or deleted files. These events can, in turn, be used to trigger the same data indexing pipeline used by the demo app.

There are many python libraries for inotify, such as watchdog. Alternatively, the inotifywait command-line utility can be used to get inotify events as text output.

On a large filesystem with many events, submitting each file update as a separate task to Globus Search will become slow. If this becomes an issue, the events can be batched and sent every few seconds or minutes.

Deleting Removed Files

When a file is removed, either as detected by inotify or checking against some database, it should be removed from the Searchable Files index.

This needs to be done either via delete-by-query operations or, more simply, using the Delete by Subject API.

In the existing Searchable Files Demo App, the subject field is always set to the same value as the relpath field. (i.e. Files are identified by their path.)