Searchable Files
This guide provides an overview for making file metadata searchable, resulting
in a Search index of files. Once information about files has been indexed,
the index can be queried for files in a manner similar to find
or locatedb
.
This guide is written as a companion to the Searchable Files Demo App. The demo app demonstrates a straightforward approach for extracting file metadata and making it searchable, but it is just that: a demo. For a fully fledged solution, users must modify or replace some or all of the demo app. We devote a section of this doc to guidance on how the demo app can be extended or modified.
If you want to jump straight in, the demo app can be run without reading this guide. Then return here for more details on why the demo is a useful starting point!
Motivation
There are three strong reasons to make file metadata searchable in Globus Search.
-
No requirement for local access
Using a Searchable Files index instead of
find
or other classic tools means that users don’t need shell access on a system storing data in order to search for files. This is especially appropriate if a dataset is exposed on a Globus Collection, but local accounts cannot or should not be provisioned for the end users. -
Search queries do not create filesystem load
When a user runs
find . -name '*.tiff'
,find
recursively traverses directories looking for matching filenames. All of those reads can be difficult for a filesystem to serve, especially with high-utilization multi-user systems. If the filesystem is backed by spinning disk or tape, or really serves as a frontend to S3, Ceph, or another storage system, even onefind
command can be expensive and slow.When queries are sent to the Search service instead, no local load is created. The work of examining files, as
find
would do, has been done up-front, and users can query the data hundreds or even thousands of times more efficiently than the filesystem could sustain. -
Permissions and visibility from Globus Groups and Globus Auth
Globus Search uses Globus Groups and Globus Auth identities for its visibility and permissions system. These are the same systems used by Globus Transfer and Collections to handle access control for their actual data.
As a result, it is easy to harmonize the visibility of file metadata in a Searchable Files index with the accessibility of that data via Globus Transfer and other applications.
The Searchable Files Demo Application
Globus provides a fully open-source demo application for building a Searchable Files index, the Searchable Files Demo App.
The demo application provides a scaffold upon which more complex solutions can be built, and serves as a walkthrough for this type of application. Research computing teams and other users can use all or part of the demo app to build a Searchable Files index, or merely treat it as guidance about how their own indexing needs should be met.
Demo App Limitations
The demo app does not seek to support a wide variety of filesystem types and
use-cases. It operates by doing a recursive directory crawl, just like find
would do, to lookup file metadata. However, even this simplistic approach
satisfies all of the major goals for a Searchable Files index, as it only does
this operation once (as opposed to once for every user query).
The application is also written to be run by a single user, and to be run
either manually or via cron
. It does not support filesystem monitoring or
other advanced features out of the box.
A Pattern for File Indexing
The demo app provides several utilities, along with four major components. These provide the main design for a pipeline which indexes filesystem metadata.
The pipeline runs as
/---------------------\
| Metadata Extraction | extract
\---------------------/
|
V
/-----------------------\
| Annotation & | assemble
| Visibility Policy |
\-----------------------/
|
V
/-----------------\
| Data Submission | submit
\-----------------/
|
V
/-----------------\
| Task Monitoring | watch
\-----------------/
Metadata Extraction
The Extractor is a component which either traverses a filesystem or responds
to filesystem events, for example via inotify
. Its job is to produce
meaningful file metadata on a per-file basis, in a format which is understood
by the next stage in the pipeline. The Extractor may be a sophisticated or
unsophisticated script, depending on your exact needs.
Projects like Xtract are purpose-built to get various kinds of metadata from different filetypes and filesystems.
For more advanced file metadata, replace the extractor with a tool which understands your data.
Annotation
The Assembler is the component responsible for consuming extracted file metadata. It may need access to a database, the Globus Transfer API, or other resources in order to make decisions about data visibility, or to add specialized annotations to your datasets.
For example, the filesystem may be organized such that /projects
is a
directory, and files in /projects/foo/
are part of project foo
. In that
case, the assembler can be customized to read the absolute path to files and
add the attribute project_name: foo
as appropriate. This will enable users to
query for project_name:foo AND filename:*.tiff
to find TIFF files
associated with foo
.
This stage of the pipeline produces documents for Globus Search as its output. The reference for these documents is the Ingest API.
In the demo app, the assembler splits the head
and mode
fields, containing
the first 1000 characters of text files and the permissions of files, from the
rest of the data, and assigns special visibility to these fields. This is a
technique which can be used generally to make some parts of the metadata
restricted while other parts may be public.
Submission & Monitoring
The two final stages of the pipeline are all about getting the resulting data into the Search Index.
When the data is submitted to Search, it will not be immediately available for queries — not until the ingest task which processes that data is complete. Well-behaved Globus Search clients should not only ensure that submission succeeds, but also wait for and monitor the status of the tasks which they create.
Extending and Replacing the Demo App
For very simple use-cases, it may be possible to use the demo app as-is, or with minimal modifications. The more sophisticated the scenario, the more extensive the changes and replacements will have to be.
Here are a few ways in which the app can be updated to suit various needs.
Changing the File Metadata
One of the more straightforward changes possible is to adjust what metadata is computed from files or how it is stored.
For example, using python’s hashlib, the extractor can compute file checksums as part of data extraction.
Checksums could easily be fed into the Globus Transfer API as part of the Transfer Item, to validate data transfers against data found in the search index.
The
Globus SDK Helper
for handling data transfers supports setting external_checksum
as well.
Using a fixed Index ID
For a real Searchable Files index, there’s no need for the index creation
command, or any facility for setting the index ID. Simply replace all loading
of the index_id
from storage with a known constant, the ID of the index which
will be used.
Removing User Sign-In
The demo app has login
and logout
commands which require that a specific
user is used to submit data to Globus Search.
If a new application is created in the
Globus Auth Developers Site, it can have a
client ID, secret, and "client identity". It’s then possible to replace the
login
requirement with
Client Credentials Authorization.
In order for this to work, the "client identity" will need permissions to write to the index. This can be achieved with the Role Create API.
Rebuilding as a Daemon or Cron
As written, the various stages in the pipeline are separate commands, each running independently. As these steps are always meant to run in series, there’s no need for these various commands to be separate.
A single command — suitable for running via cron — or a daemon which runs the steps periodically, can replace the entire application.
This change is most suitable once user sign-in has been replaced with client credentials.
Replacing the Directory Crawl with inotify
The Linux Kernel provides an API for monitoring filesystem events, inotify
.
inotify
can be used to watch a directory for new, modified, or deleted files.
These events can, in turn, be used to trigger the same data indexing pipeline
used by the demo app.
There are many python libraries for inotify
, such as
watchdog. Alternatively, the
inotifywait command-line utility
can be used to get inotify
events as text output.
On a large filesystem with many events, submitting each file update as a separate task to Globus Search will become slow. If this becomes an issue, the events can be batched and sent every few seconds or minutes.
Deleting Removed Files
When a file is removed, either as detected by inotify
or checking against
some database, it should be removed from the Searchable Files index.
This needs to be done either via delete-by-query operations or, more simply, using the Delete by Subject API.
In the existing Searchable Files Demo App, the subject
field is always set to
the same value as the relpath
field.
(i.e. Files are identified by their path.)