Module gatenlp.corpora
Module that defines base and implementation classes for representing document collections.
Corpus subclasses represent collections with a fixed number of documents, where each document can be accessed and stored by its index number, much like lists/arrays of documents.
DocumentSource subclasses represent collections that can be iterated over, producing a sequence of Documents, one document a time.
DocumentDestination subclasses represent collections that can receive Documents one document a time.
Expand source code
"""
Module that defines base and implementation classes for representing document collections.
Corpus subclasses represent collections with a fixed number of documents, where each document can be
accessed and stored by its index number, much like lists/arrays of documents.
DocumentSource subclasses represent collections that can be iterated over, producing a sequence of Documents,
one document a time.
DocumentDestination subclasses represent collections that can receive Documents one document a time.
"""
from gatenlp.corpora.base import Corpus, DocumentSource, DocumentDestination, NullDestination
from gatenlp.corpora.base import MultiProcessingAble, DistributedProcessingAble
from gatenlp.corpora.base import EveryNthCorpus, EveryNthSource, ShuffledCorpus, CachedCorpus
from gatenlp.corpora.memory import ListCorpus, PandasDfSource
from gatenlp.corpora.files import BdocjsLinesFileSource, BdocjsLinesFileDestination
from gatenlp.corpora.files import JsonLinesFileSource, JsonLinesFileDestination
from gatenlp.corpora.files import TsvFileSource
from gatenlp.corpora.dirs import DirFilesCorpus, DirFilesSource, DirFilesDestination, NumberedDirFilesCorpus
Sub-modules
gatenlp.corpora.base
-
Module that defines base classes for representing document collections …
gatenlp.corpora.conll
-
Module that provides document source/destination classes for importing and exporting documents from/to various conll formats.
gatenlp.corpora.dirs
-
Module that defines Corpus and DocumentSource/DocumentDestination classes which access documents as files in a directory.
gatenlp.corpora.export
-
Module that defines DocumentDestination classes for exporting specific formats.
gatenlp.corpora.files
-
Module that defines Corpus and DocumentSource/DocumentDestination classes which access documents as lines or parts in a file.
gatenlp.corpora.memory
-
Module that defines Corpus and DocumentSource/DocumentDestination classes which access documents from in-memory objects.