Supported formats for documents and corpora
NOTE: this document is still work in progress!!
This document gives an overview over which formats GateNLP can use to
- load individual Documents from
- save individual Documents to
- read a sequence of documents (DocumentSource)
- write a sequence of documents (DocumentDestination)
NOTE: the standard format for loading/saving documents is a JSON representation of the document content
(optionally gzip compressed) and called bdocjs
. A detailed description of this format can be found
in Format bdocjs
Loading individual Documents
To load a document the method Document.load(source, ...)
can be used (see PythonDoc). The source
parameter is used to specify where the document should get loaded from:
- if
source
is a string, it is interpreted as the path to a file to load, except if the string starts with “http://” or “https://” in which case the string is interpreted as the URL to load the document from. - if
source
is apathlib.Path
object, the corresponding file is loaded, even if the path starts with “http://” or “https://” - if
source
is the result ofurllib.parse.urlparse
, that URL is always used for loading - NOTE: not all formats support loading from an URL, in those cases a string is always interpreted as the file path
Supported formats for loading
If the fmt
parameter for the load
method is None (the default), then an attempt is made to infer the format from the file extension of the path or URL
(the characters after the last dot in the string that follows any path separator).
If the extension is missing or does not properly indicate the format, the fmt
parameter should specify a known mime-type like format specification or
a known format identifier. The following list shows the supported formats and which format identifiers and extensions are associated with them:
- Format: bdocjs
- extensions:
bdocjs
- identifier:
json
,bdocjs
,text/bdocjs
- A JSON-based format, can be exchanged with Java GATE using the
Format_bdoc
plugin. This format fully supports all document content and all feature values which are JSON serializable, does not support shared objects or reference cycles in feature values. - this is the recommended standard format
- extensions:
- Format: bdocjsgz
- extensions:
bdocjs.gz
,bdoc.gz
- identifier:
bdocjsgz
,jsongz
,text/bdocjs+gzip
- Same as bdocjs, but with gzip-compression
- extensions:
- Format: yaml
- Format: yamlgz
- Format: msgpack
- Format: text
- Format: html
- Format: html-rendered
- Format: gatexml
- Format: tweet-v1
- Format: pickle
Supported formats for saving
- Format: bdocjs
- Format: bdocjsgz
- Format: yaml
- Format: yamlgz
- Format: msgpack
- Format: tweet-v1
- Format: html-ann-viewer