Basic Document Representation

The “Basic Document” or “Bdoc” representation is a simple way to represent GATE documents, features, annotation sets and annotations through basic datatypes like strings, integers, maps and arrays so that the same representation can be easily used from several programming languages. The representation is limited to the following data types: string, integer, float, boolean, array/list, map (basically what is supported by basic JSON).

The various serialization formats supported by the plugin (JSON, YAML, MsgPack) simply serialize that representation using one of those formats. There may be additional fields in the serialization representation in order to deal with format versions or for distinguishing object types. These are mentioned below.

The abstract BdocDocument representation

A document is map with the following keys:

The document text must be able to represent any Unicode text and different serialization methods may use different ways of how to encode the text.

Features are represented as a map:

An Annotation set is represented as a map with the following keys:

Annotations are represented as a map with the following keys:

Examples

Here is a simple examle document serialized as JSON (bdocjs):

{
   "offset_type" : "p",
   "name" : "",
   "features" : {
      "feat1" : "value1"
   },
   "annotation_sets" : {
      "" : {
         "annotations" : [
            {
               "end" : 2,
               "id" : 0,
               "features" : {
                  "a" : 1,
                  "b" : true,
                  "c" : "some string"
               },
               "start" : 0,
               "type" : "Type1"
            }
         ],
         "name" : "",
         "next_annid" : 1
      },
      "Set2" : {
         "annotations" : [
            {
               "id" : 0,
               "start" : 2,
               "features" : {},
               "type" : "Type2",
               "end" : 8
            }
         ],
         "next_annid" : 1,
         "name" : "Set2"
      }
   },
   "text" : "A simple document"
}

The same document serialized as YAML (bdocym):

annotation_sets:
  ? ''
  : annotations:
    - end: 2
      features:
        a: 1
        b: true
        c: some string
      id: 0
      start: 0
      type: Type1
    name: ''
    next_annid: 1
  Set2:
    annotations:
    - end: 8
      features: {}
      id: 0
      start: 2
      type: Type2
    name: Set2
    next_annid: 1
features:
  feat1: value1
name: ''
offset_type: p
text: A simple document