public interface SimpleCorpus extends LanguageResource, List<Document>, NameBearer
Modifier and Type | Field and Description |
---|---|
static String |
CORPUS_DOCLIST_PARAMETER_NAME |
static String |
CORPUS_NAME_PARAMETER_NAME |
Modifier and Type | Method and Description |
---|---|
String |
getDocumentName(int index)
Gets the name of a document in this corpus.
|
List<String> |
getDocumentNames()
Gets the names of the documents in this corpus.
|
void |
populate(URL directory,
FileFilter filter,
String encoding,
boolean recurseDirectories)
Fills this corpus with documents created on the fly from selected
files in a directory.
|
void |
populate(URL directory,
FileFilter filter,
String encoding,
String mimeType,
boolean recurseDirectories)
Fills this corpus with documents created on the fly from selected
files in a directory.
|
long |
populate(URL singleConcatenatedFile,
String documentRootElement,
String encoding,
int numberOfDocumentsToExtract,
String documentNamePrefix,
String mimeType,
boolean includeRootElement)
Fills the provided corpus with documents extracted from the
provided trec file.
|
getDataStore, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
cleanup, getParameterValue, init, setParameterValue, setParameterValues
getFeatures, setFeatures
getName, setName
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, replaceAll, retainAll, set, size, sort, spliterator, subList, toArray, toArray
parallelStream, removeIf, stream
static final String CORPUS_NAME_PARAMETER_NAME
static final String CORPUS_DOCLIST_PARAMETER_NAME
List<String> getDocumentNames()
List
of Strings representing the names of the
documents in this corpus.String getDocumentName(int index)
index
- the index of the documentvoid populate(URL directory, FileFilter filter, String encoding, boolean recurseDirectories) throws IOException, ResourceInstantiationException
FileFilter
to select which
files will be used and which will be ignored. A simple file filter
based on extensions is provided in the Gate distribution (
ExtensionFileFilter
).directory
- the directory from which the files will be picked.
This parameter is an URL for uniformity. It needs to be a
URL of type file otherwise an InvalidArgumentException
will be thrown. An implementation for this method is
provided as a static method at
CorpusImpl.populate(Corpus, URL, FileFilter, String, boolean)
.filter
- the file filter used to select files from the target
directory. If the filter is null all the files
will be accepted.encoding
- the encoding to be used for reading the documentsrecurseDirectories
- should the directory be parsed
recursively?. If true all the files from the
provided directory and all its children directories (on as
many levels as necessary) will be picked if accepted by
the filter otherwise the children directories will be
ignored.IOException
ResourceInstantiationException
void populate(URL directory, FileFilter filter, String encoding, String mimeType, boolean recurseDirectories) throws IOException, ResourceInstantiationException
FileFilter
to select which
files will be used and which will be ignored. A simple file filter
based on extensions is provided in the Gate distribution (
ExtensionFileFilter
).directory
- the directory from which the files will be picked.
This parameter is an URL for uniformity. It needs to be a
URL of type file otherwise an InvalidArgumentException
will be thrown. An implementation for this method is
provided as a static method at
CorpusImpl.populate(Corpus, URL, FileFilter, String, boolean)
.filter
- the file filter used to select files from the target
directory. If the filter is null all the files
will be accepted.encoding
- the encoding to be used for reading the documentsmimeType
- the mime type to be used when loading documents. If
null, then the mime type will be automatically determined.recurseDirectories
- should the directory be parsed
recursively?. If true all the files from the
provided directory and all its children directories (on as
many levels as necessary) will be picked if accepted by
the filter otherwise the children directories will be
ignored.IOException
ResourceInstantiationException
long populate(URL singleConcatenatedFile, String documentRootElement, String encoding, int numberOfDocumentsToExtract, String documentNamePrefix, String mimeType, boolean includeRootElement) throws IOException, ResourceInstantiationException
singleConcatenatedFile
- the file with multiple documents in it.documentRootElement
- content between the start and end of
this element is considered for documents.encoding
- the encoding of the trec file.numberOfDocumentsToExtract
- indicates the number of documents to
extract from the concatenated file. -1 to indicate all
files.documentNamePrefix
- the prefix to use for document names when
creating frommimeType
- the mime type which determines how the document is handledIOException
ResourceInstantiationException
Copyright © 2024 GATE. All rights reserved.