Ingest
1. Overview
The Ingest API is used to send metadata into Globus Search. It is the primary way in which you add data to an index.
You send documents by POST
ing them to the Ingest API, and
getting back a Task ID.
The Task ID lets you check on the status of your request to add data to an
index. Globus Search will automatically retry certain failures and will
guarantee the ordered delivery of ingest requests.
You can then check the status of that Task using the Get Task API. Once your task is complete, the data will be available to search queries.
2. Ingest Document Walkthrough
The Ingest API accepts GIngest
documents.
You can read the full GIngest specification below for a more rigorous definition and some examples, but we will first cover the two forms this document can take and include several examples.
Every GIngest
document is either a single GMetaEntry
document:
{ "ingest_type": "GMetaEntry", "ingest_data": { ... } }
or a single GMetaList
document:
{ "ingest_type": "GMetaList", "ingest_data": { "gmeta": [ { ... } ] } }
A GMetaList
has the field gmeta
, containing an array of
GMetaEntry
documents. There’s no constraint on the documents themselves and
they do not have to be related. So from here on out we’ll really focus on the
GMetaEntry
.
2.1. GMetaEntry, Subjects, and Entries
If you haven’t read the Overview of Globus Search, you should really
stop and read it now.
We’re going to talk about Subjects
and Entries
and you’ll need to know what
these are to read and write sensible GMetaEntry
documents.
Let’s start with a really simple GMetaEntry
document
{
"subject": "https://search.api.globus.org/abc.txt",
"visible_to": [
"public"
],
"content": {
"name": "abc.txt",
"extension": "txt",
"type": "file"
}
}
This describes the Subject
https://search.api.globus.org/abc.txt
, a
"Search result", with public visibility and one searchable field:
metadata-schema/file#type: file
.
The subject
is any string you wish to use as a search result — we just made
this one up and the content
is an almost arbitrary JSON blob describing
it.
As mentioned in the Overview, there is no Entry ID for this data, so
the Entry ID is null
.
We’ll cover the few restrictions on content
in a moment, but first, let’s
look at that special field: visible_to
.
visible_to
is a list of security principals allowed to read the metadata.
That is to say people, or descriptors for groups of people, who can see this
result when they query Globus Search.
Each string must be a Principal URN, or one of
the special strings "public"
or "all_authenticated_users"
.
The meanings of these strings is covered in the overview of visibility values.
2.1.1. GMetaEntry.content
content
is the main body of a GMetaEntry, and it is this data which will be
indexed and queryable in Search.
2.1.2. GMetaEntry.id
Our original example entry document did not specify an id
and therefore used
a null
ID.
The ID is used to distinguish between multiple Entries for a single Subject.
It is also needed to access the entry operations like the Get Entry API, which provides read capabilities for individual entries.
id
is an arbitrary string field in a GMetaEntry
document. For example,
here’s a GMetaEntry
with an explicit id
:
{
"id": "filetype",
"subject": "https://search.api.globus.org/abc.txt",
"visible_to": [
"public"
],
"content": {
"type": "file"
}
}
2.2. Complete Example Document
Now that everything has been introduced, let’s combine them all into a single
GIngest
document with multiple subjects and multiple entries.
{
"ingest_type": "GMetaList",
"ingest_data": {
"gmeta": [
{
"id": "filetype",
"subject": "https://search.api.globus.org/abc.txt",
"visible_to": [
"public"
],
"content": {
"type": "file"
}
},
{
"id": "size",
"subject": "https://search.api.globus.org/abc.txt",
"visible_to": [
"urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c"
],
"content": {
"size": "1000000",
"size_human": "1MB"
}
},
{
"subject": "https://search.api.globus.org/def.txt",
"visible_to": [
"public"
],
"content": {
"type": "file",
"size": "1000000",
"size_human": "1MB"
}
}
]
}
}
Two of the entries have explicit id
fields and one is using the implicit
null
id
.
One of them sets visible_to
to only let data be viewed by a single specific
identity, while the other two are public
.
Two documents describe https://search.api.globus.org/abc.txt
and one
describes https://search.api.globus.org/def.txt
.
To submit this to Search, use the Ingest API.
For example, to submit the above document to index
4de0e89e-a395-11e7-bc54-8c705ad34f60
using a token xxxxx
, you
could run the following command:
curl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer xxxxx' \
-XPOST 'https://search.api.globus.org/v1/index/4de0e89e-a395-11e7-bc54-8c705ad34f60/ingest' \
--data '
{
"ingest_type": "GMetaList",
"ingest_data": {
"gmeta": [
{
"id": "filetype",
"subject": "https://search.api.globus.org/abc.txt",
"visible_to": [
"public"
],
"content": {
"type": "file"
}
},
{
"id": "size",
"subject": "https://search.api.globus.org/abc.txt",
"visible_to": [
"urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c"
],
"content": {
"size": "1000000",
"size_human": "1MB"
}
},
{
"subject": "https://search.api.globus.org/def.txt",
"visible_to": [
"public"
],
"content": {
"type": "file",
"size": "1000000",
"size_human": "1MB"
}
}
]
}
}
'
3. Monitoring an Ingest Task
When you submit an Ingest task, the response will include a task_id
.
You can then poll the status of the task to wait for it to succeed or fail. For example:
task_id="05c1ec1b-2400-44e2-9797-922c29199042"
curl \
-H 'Authorization: Bearer xxxxx' \
-XGET 'https://search.api.globus.org/v1/task/05c1ec1b-2400-44e2-9797-922c29199042'
may output
{
"state_description": "Task succeeded",
"task_id": "05c1ec1b-2400-44e2-9797-922c29199042",
"state": "SUCCESS",
"creation_date": "2018-12-13T18:08:42.746911",
"completion_date": "2018-12-13T18:08:44.539611",
"additional_details": null,
"message": null,
"index_id": "696af25c-8c24-469a-b5e0-67d3e4b71df7"
}
When state
is SUCCESS
or FAILED
, the task is complete.