@CreoleResource(name="GATE HTML Document Format", isPrivate=true, autoinstances=) public class NekoHtmlDocumentFormat extends TextualDocumentFormat
DocumentFormat that uses Andy Clark's NekoHTML parser to parse HTML documents. It tries to render HTML in a similar way to a web browser, i.e. whitespace is normalized, paragraphs are separated by a blank line, etc. By default the text content of style and script tags is ignored completely, though the set of tags treated in this way is configurable via a CREOLE parameter.
element2StringMap, magic2mimeTypeMap, markupElementsMap, mimeString2ClassHandlerMap, mimeString2mimeTypeMap, suffixes2mimeTypeMapdataStore, lrPersistentIdname| Constructor and Description |
|---|
NekoHtmlDocumentFormat()
Default construction
|
| Modifier and Type | Method and Description |
|---|---|
Set<String> |
getIgnorableTags() |
Resource |
init()
Initialise this resource, and return it.
|
void |
setIgnorableTags(Set<String> newTags) |
Boolean |
supportsRepositioning()
We support repositioning info for HTML files.
|
void |
unpackMarkup(Document doc)
Old-style unpackMarkup, without repositioning info.
|
void |
unpackMarkup(Document doc,
RepositioningInfo repInfo,
RepositioningInfo ampCodingInfo)
Unpack the markup in the document.
|
annotateParagraphs, getDataStore, hasContentButNoValidUrl, setNewLinePropertyaddStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getMimeTypeForString, getShouldCollectRepositioning, getSupportedFileSuffixes, getSupportedMimeTypes, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup, willReadFromUrlcleanup, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, synccheckParameterValues, flushBeanInfoCache, forgetBeanInfo, getBeanInfo, getInitParameterValues, getInitParameterValues, getName, getParameterValue, getParameterValue, getParameterValues, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners, toStringclone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitgetParameterValue, setParameterValue, setParameterValuesgetName, setName@CreoleParameter(comment="HTML tags whose text content should be ignored", defaultValue="script;style;iframe") public void setIgnorableTags(Set<String> newTags)
public Boolean supportsRepositioning()
supportsRepositioning in class DocumentFormatpublic void unpackMarkup(Document doc) throws DocumentFormatException
unpackMarkup in class TextualDocumentFormatDocumentFormatExceptionpublic void unpackMarkup(Document doc, RepositioningInfo repInfo, RepositioningInfo ampCodingInfo) throws DocumentFormatException
unpackMarkup in class TextualDocumentFormatdoc - The gate document you want to parse. If
doc.getSourceUrl() returns null
then the content of doc will be parsed. Using a URL is
recomended because the parser will report errors corectlly
if the document is not well formed.DocumentFormatExceptionpublic Resource init() throws ResourceInstantiationException
init in interface Resourceinit in class TextualDocumentFormatResourceInstantiationExceptionCopyright © 2024 GATE. All rights reserved.