Supported formats for documents and corpora

NOTE: this document is still work in progress!!

This document gives an overview over which formats GateNLP can use to

NOTE: the standard format for loading/saving documents is a JSON representation of the document content (optionally gzip compressed) and called bdocjs. A detailed description of this format can be found in Format bdocjs

Loading individual Documents

To load a document the method Document.load(source, ...) can be used (see PythonDoc). The source parameter is used to specify where the document should get loaded from:

Supported formats for loading

If the fmt parameter for the load method is None (the default), then an attempt is made to infer the format from the file extension of the path or URL (the characters after the last dot in the string that follows any path separator).

If the extension is missing or does not properly indicate the format, the fmt parameter should specify a known mime-type like format specification or a known format identifier. The following list shows the supported formats and which format identifiers and extensions are associated with them:

Supported formats for saving

Saving individual Documents

Reading a sequence of documents

Writing a sequence of documents