Globus Search Overview
This is a high-level overview of the Globus Search (hereafter just "Search") API, introducing basic concepts and terminology.
1. What is it for?
The Search API allows you to store data, setting policies on its visibility, and then retrieve that data through search queries.
A few examples of the service’s capabilities include:
-
On documents describing tweets, sorting by number of retweets
-
On documents describing novels, filtering to those published in 1870 and 1890, but not the years in between
-
On documents describing clothing, counting the number of matching items which contain cotton, polyester, or both
-
On documents describing academic papers, counts of those with Paul Erdos as an author, but not as the first author
2. Search Indices
A Search index is a place for you to store and search over your data. Every operation done against search — storing data, setting permissions, and performing searches — is done with respect to a particular index.
Indices can be used to create logical groupings of data for your applications to use. They also provide a level of control and separation between datasets, and can be used to enforce policies about who can write different datasets.
You would not, however, typically use a separate index for public vs. private data: for that, you can employ Search’s granular visibility controls.
3. Data Format
To use the Search API, it’s necessary to understand the two basic concepts which are used to break down data: Subjects and Entries.
3.1. Subjects
The Subject is the ID for a document. Subjects are required to be unique per index.
Subjects are arbitrary strings which must be unique per index. Oftentimes, the subject is the ID or URI of some external object, and the data attached to it is metadata.
Subjects are not automatically included in the query criteria for a search. Search matching is done on the "contents" of the subject: the entries.
3.2. Entries
Even though a Subject may only appear once per index, Search permits multiple pieces of data (up to 10) to be applied to that Subject. We refer to each of these subdocuments as an entry.
Different Entries may have different visibility rules even though they describe the same Subject. This is intentional, and allows for situations in which a single conceptual object is represented as a subject, split into parts (entries) with different visibilities.
When using multiple Entries to describe a single Subject, you will supply distinct entry ID values for each entry under that Subject.
If you omit the entry ID, that is equivalent to an Entry ID of null
.
An entry_id
of null
is valid for all operations (e.g.
Get Entry).
Where null
and "null"
cannot be distinguished, the null
ID is specified
by omission. When JSON data is used, an explicit null
may also be used.
4. Principal URNs
Principal URNs are a way of phrasing Globus Identities and Globus Groups as URNs. This format is used by Search to represent users and permissions.
To formulate a Principal URN, prefix Identity IDs with
urn:globus:auth:identity:
and Group IDs with urn:globus:groups:id:
.
For example:
-
urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c
-
urn:globus:groups:id:fdb38a24-03c1-11e3-86f7-12313809f035
5. Visibility Values
Visibility of documents in Search can be controlled at a document-level. These visibility settings are set either with Principal URNs or special values defined here.
Principal URNs indicate that a document is visible to the user or users who satisfy that principal. For an identity URN, that refers to the user who has that identity in their identity set. For a group URN, that refers to all of the members of the group.
The following other two values are defined for visibility rules:
public
-
The document is visible to all users and to unauthenticated queries.
all_authenticated_users
-
The document is visible to all users but not to unauthenticated queries.