1. Description

The Ingest API is used to send metadata into Globus Search. It is the primary way in which you add data to an index.

You send documents by POSTing them to the Ingest API, and getting back a Task ID. The Task ID lets you check on the status of your request to add data to an index. Globus Search will automatically retry certain failures and will guarantee the ordered delivery of ingest requests.

You can then check the status of that Task using the Task API. Once your task is complete, the data will be available to search queries.

2. Ingest Document Walkthrough

The Ingest API accepts GIngest documents.

You can read the full GIngest specification below for a more rigorous definition and some examples, but we will first cover the two forms this document can take and include several examples.

Every GIngest document is either a single GMetaEntry document:

{
  "ingest_type": "GMetaEntry",
  "ingest_data": { ... }
}

or a single GMetaList document:

{
  "ingest_type": "GMetaList",
  "ingest_data": {
    "gmeta": [
      {
        ...
      }
    ]
  }
}

A GMetaList has the field gmeta, containing an array of GMetaEntry documents. There’s no constraint on the documents themselves and they do not have to be related. So from here on out we’ll really focus on the GMetaEntry.

Note:

Why all the "G"s?

We could have called these things MetaEntry and MetaList and Ingest. That would have been fine too. The "G" just stands for "Globus". We didn’t want to call things GlobusMetaEntry though, what a mouthful!

And who hasn’t mispelled "Gloubs" at least once? So GMeta is easy to say, unambiguous, and that much shorter to type. That’s all. No secrets here.

2.1. GMetaEntry, Subjects, and Entries

If you haven’t read the Overview of Globus Search, you should really stop and read it now. We’re going to talk about Subjects and Entries and you’ll need to know what these are to read and write sensible GMetaEntry documents.

Let’s start with a really simple GMetaEntry document

{
  "subject": "https://search.api.globus.org/abc.txt",
  "visible_to": ["public"],
  "content": {
    "metadata-schema/file#type": "file"
  }
}

This describes the Subject https://search.api.globus.org/abc.txt, a "Search result", with public visibility and one searchable field: metadata-schema/file#type: file.

The subject is any string you wish to use as a search result — we just made this one up and the content is an almost arbitrary JSON blob describing it.

As mentioned in the Overview, there is no Entry ID for this data, so the Entry ID is null. We’ll cover the few restrictions on content in a moment, but first, let’s look at that special field: visible_to.

visible_to is a list of security principals allowed to read the metadata. That is to say people, or descriptors for groups of people, who can see this result when they query Globus Search. Each string will be in the form of a Principal URN, or the special string "public" which means… drumroll … that the data is public and anyone can see it via a Search Query.

Note:public applies even to unauthenticated users, searching without having logged in anywhere.

2.1.1. GMetaEntry.content

content is the "meat" of a GMetaEntry. It’s the… well, the content. The stuff you want to put into Search to describe a subject. It has almost no constraints, but for one: the special @context field.

The @context is used to implement shorthand for long and complex field names. For example, the following two content blocks are identical in meaning:

{
  "@context": {
    "f": "file_meta"
  },
  "f:type": "file",
  "f:extension": "txt",
  "f:name" : "abc.txt"
}

is the same as

{
  "file_meta#type": "file",
  "file_meta#extension": "txt",
  "file_meta#name": "abc.txt"
}

@context never gets indexed in Globus Search.

The GMetaContent document type is fully and formally defined below, but it’s quite short and simple.

2.1.2. GMetaEntry.id

Our original example entry document did not specify an id and therefore used a null ID.

The ID is used to distinguish between multiple Entries for a single Subject. It is also needed to access the Entry Ops API, which provides create, read, update, and delete operations on individual entries.

id is an arbitrary string field in a GMetaEntry document. For example, here’s a GMetaEntry with an explicit id:

{
  "id": "filetype",
  "subject": "https://search.api.globus.org/abc.txt",
  "visible_to": ["public"],
  "content": {
    "metadata-schema/file#type": "file"
  }
}

2.2. Complete Example Document

Now that everything has been introduced, let’s combine them all into a single GIngest document with multiple subjects and multiple entries.

{
  "gmeta": [
    {
      "id": "filetype",
      "subject": "https://search.api.globus.org/abc.txt",
      "visible_to": ["public"],
      "content": {
        "metadata-schema/file#type": "file"
      }
    },
    {
      "id": "size",
      "subject": "https://search.api.globus.org/abc.txt",
      "visible_to": ["urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c"],
      "content": {
        "metadata-schema/file#size": "1000000",
        "metadata-schema/file#size_human": "1MB"
      }
    },
    {
      "subject": "https://search.api.globus.org/def.txt",
      "visible_to": ["public"],
      "content": {
        "@context": {
          "f": "metadata-schema/file"
        },
        "f:type": "file",
        "f:size": "1000000",
        "f:size_human": "1MB"
      }
    }
  ]
}

Two of the entries have explicit id fields and one is using the implicit null id. One document uses the @context shorthand while the other two do not.

One of them sets visible_to to only let data be viewed by a single specific identity, while the other two are public.

Two documents describe https://search.api.globus.org/abc.txt and one describes https://search.api.globus.org/def.txt.

To submit this to Search, use the Ingest API below…

3. Document Types

3.1. GMetaContent

GMetaContent is arbitrary structured data provided by data sources for Globus Search. It has only one special field, @context.

Field Name Type Description

@context

Object

A set of shorthands which will be expanded in all other fields of the document

The @context field is used to define a shorthand for values which are interpolated into the document keys. To best understand, see the examples section.

Special Note: Long Fields

All text or string type fields are constrained on their total length when used for faceting or sorting. A record containing more than 10,000 characters in a field will not appear in any facet buckets for that field. A record which contains more than 10,000 characters will appear at the end of any sort operation on that field even though it may lexically appear earlier in the list.

{
  "@context": {
    "f": "file_meta"
  },
  "f:type": "file",
  "f:extension": "txt",
  "f:name" : "abc.txt"
}

which is equivalent to and will be expanded as:

{
  "file_meta#type": "file",
  "file_meta#extension": "txt",
  "file_meta#name": "abc.txt"
}

3.2. GMetaEntry

A GMetaEntry is a single assertion of metadata pertaining to a given subject.

Field Name Type Description

subject

String

The entity described by this metadata, typically a URI

visible_to

Array of Strings

This is a list of security principals allowed to read the metadata. Each string will be in the form of a Principal URN, or the special string "public".

content

Object

A GMetaContent. This is the actual metadata to assert about subject

id

String

Optional. A unique identifier for this metadata entry. This value will be used on further API operations which reference this entry such as updates or delete. When id is not provided, it is assumed to have a default "null" value.

mimetype

String

should be "application/json" if used

{
  "subject": "https://search.api.globus.org/abc.txt",
  "visible_to": ["public"],
  "content": {
    "http://transfer.api.globus.org/metadata-schema/file#type": "file"
  }
}
{
  "subject": "https://search.api.globus.org/abc.txt",
  "mimetype": "application/json",
  "visible_to": ["urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c"],
  "id" : "visible_to_globus@globus.org",
  "content": {
    "http://transfer.api.globus.org/metadata-schema/file#type": "file",
    "http://transfer.api.globus.org/metadata-schema/file#extension": "txt",
    "http://transfer.api.globus.org/metadata-schema/file#name" : "abc.txt"
  }
}

This document is a superset of Example 1, but is only visible to the user globus@globus.org. This demonstrates how multiple entries about the same subject, but with different IDs, can be useful: some data is only visible to certain users or groups, while other data is public.

3.3. GMetaList

A GMetaList is a collection of GMetaEntry documents.

Field Name Type Description

gmeta

Array

an array of GMetaEntry documents

{
  "gmeta": [
    {
      "subject": "https://datasearch.demo.globus.org/",
      "mimetype": "application/json",
      "visible_to": ["public"],
      "id" : "valid_doc_1",
      "content": {
          "type": "file",
          "extension": "txt",
          "name" : "abc.txt"
      }
    }
  ]
}

3.4. GIngest

A GIngest document is a wrapper around a GMetaList or GMetaEntry which supplies attributes relevant to the ingest and indexing of metadata into the Globus Search service.

Field Name Type Description

ingest_type

String

must be one of {"GMetaList", "GMetaEntry"}. Describes the type of ingest_data

ingest_data

Object

must be a document of the type named in ingest_type. This is the data to add to the DataSearch Index

{
  "ingest_type": "GMetaEntry",
  "ingest_data": {
    "subject": "https://search.api.globus.org/",
    "mimetype": "application/json",
    "visible_to": ["public"],
    "id": "stephen_test_doc_2016_11_13",
    "content": {
      "type": "file",
      "extension": "txt",
      "name" : "stephen's test document with spaces.txt"
    }
  }
}
{
  "ingest_type": "GMetaEntry",
  "ingest_data": {
    "subject": "https://search.api.globus.org/",
    "mimetype": "application/json",
    "visible_to": ["public"],
    "id": "test_doc_2017_06_14",
    "content": {
      "type": "file",
      "extension": "txt",
      "name" : "another_document_without_spaces.txt"
    }
  }
}

4. API Methods

The Ingest API provides only one method, suitable for submitting a new GIngest document and getting back a task ID.

4.1. Ingest

Queue the data to be added to index_id, and return a TaskID which can be used with the Task API to check the status of the ingest request.

URL

/v1/index/<index_id>/ingest

Method

POST

HTTP Headers

Authorization: Bearer <Globus Auth token> 1
Content-Type: application/json

Request Body

a GIngest document

Response Body

{
  "acknowledged": true,
  "task_id": TaskID,
  "as_identity": IdentityID
}

1 The token must have the urn:globus:scopes:search.api.globus.org:all or urn:globus:scopes:search.api.globus.org:ingest scope, and must belong to a user with admin or write permissions against <index_id>

4.1.1. Examples

  • in the index 4de0e89e-a395-11e7-bc54-8c705ad34f60

  • with a subject of https://example.com/foo/bar

  • with a null entry_id

  • public visibility

curl -XPOST 'https://search.api.globus.org/v1/index/4de0e89e-a395-11e7-bc54-8c705ad34f60/ingest' \ # <1>
    --data '{
      "ingest_type": "GMetaEntry", # <2>
      "ingest_data": {
        "subject": "https://example.com/foo/bar",
        "visible_to": ["public"],
        "content": { # <3>
          "foo/bar": "some val"
        }
      }
    }'
  1. The Index ID is provided in the URL

  2. The datatype of the ingest_data document in GMetaEntry for a single entry

  3. content is an arbitrary JSON body

  • in the index 4de0e89e-a395-11e7-bc54-8c705ad34f60

  • with subject values of https://example.com/foo/bar and https://example.com/foo/bar/baz

  • with entry_id values of null, "alpha", and "beta"

  • public visibility and visibility only to the user globus@globus.org

    • The ID of globus@globus.org is 46bd0f56-e24f-11e5-a510-131bef46955c, so this is the value which will be used

curl -XPOST 'https://search.api.globus.org/v1/index/4de0e89e-a395-11e7-bc54-8c705ad34f60/ingest' \
    --data '{
      "ingest_type": "GMetaList", # <1>
      "ingest_data": {
        "gmeta": [ # <2>
          { # <3>
            "subject": "https://example.com/foo/bar",
            "visible_to": ["public"],
            "content": {
              "foo/bar": "some val"
          },
          {
            "subject": "https://example.com/foo/bar",
            "id": "alpha",
            "visible_to": [
              "urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c" # <4>
            ],
            "content": {
              "foo/bar": "some otherval"
            }
          },
          {
            "subject": "https://example.com/foo/bar/baz",
            "id": "alpha",
            "visible_to": [
              "urn:globus:auth:identity:46bd0f56-e24f-11e5-a510-131bef46955c"
            ],
            "content": {
              "foo/bar/baz": "some val"
            }
          },
          {
            "subject": "https://example.com/foo/bar/baz",
            "id": "beta",
            "visible_to": ["public"],
            "content": {
              "foo/bar/baz": "some otherval"
            }
          }
        ]
      }
    }'
  1. This time, the ingest_data is of type GMetaList

  2. GMetaList.gmeta is an array of GMetaEntry documents

  3. This entry does not specify an id, so its entry_id is null

  4. This notation is a Principal URN

  • in the index 4de0e89e-a395-11e7-bc54-8c705ad34f60

  • with subject values of https://example.com/foo/

  • with entry_id values of "alpha", and "beta"

  • public visibility and visibility only to the Group with ID 0a4dea26-44cd-11e8-847f-0e6e723ad808

curl -XPOST 'https://search.api.globus.org/v1/index/4de0e89e-a395-11e7-bc54-8c705ad34f60/ingest' \
    --data '{
      "ingest_type": "GMetaList",
      "ingest_data": {
        "gmeta": [
          {
            "subject": "https://example.com/foo",
            "id": "alpha",
            "visible_to": [
              "urn:globus:group:id:0a4dea26-44cd-11e8-847f-0e6e723ad808" # <1>
            ],
            "content": {
              "foo/bar": "some val"
            }
          },
          {
            "subject": "https://example.com/foo",
            "id": "beta",
            "visible_to": ["public"],
            "content": {
              "foo/bar/baz": "some otherval"
            }
          }
        ]
      }
    }'
  1. This notation is a Principal URN for a Group


© 2010- The University of Chicago Legal