AnnotatedPlainTextDocument {NLP} | R Documentation |
Create annotated plain text documents from plain text and collections of annotations for this text.
AnnotatedPlainTextDocument(s, annotations, meta = list()) annotations(x)
s |
a |
annotations |
an |
meta |
a named or empty list of document metadata tag-value pairs. |
x |
an object inheriting from class
|
Annotated plain text documents combine plain text with collections (“sets”, implemented as lists) of objects with annotations for the text.
A typical workflow is to use annotate()
with suitable
annotator pipelines to obtain the annotations, and then use
AnnotatedPlainTextDocument()
to combine these with the text
being annotated. This yields an object inheriting from
"AnnotatedPlainTextDocument"
and "TextDocument"
,
from which the text and collection of annotations can be obtained
using, respectively, as.character()
and
annotations()
.
There are methods for generics
words()
,
sents()
,
paras()
,
tagged_words()
,
tagged_sents()
,
tagged_paras()
,
chunked_sents()
,
parsed_sents()
and
parsed_paras()
and class "AnnotatedPlainTextDocument"
providing structured
views of the text in such documents. These all have an additional
argument which
for specifying the annotation object to use (by
default, the first one is taken), and of course require the necessary
annotations to be available in the annotation object used.
The methods for generics
tagged_words()
,
tagged_sents()
and
tagged_paras()
provide a mechanism for mapping POS tags via the map
argument,
see section Details in the help page for
tagged_words()
for more information.
The POS tagset used will be inferred from the POS_tagset
metadata element of the annotation object used.
For AnnotatedPlainTextDocument()
, an object inheriting from
"AnnotatedPlainTextTextDocument"
and "TextDocument"
.
For annotations()
, a list of Annotation
objects.
TextDocument
for basic information on the text document
infrastructure employed by package NLP.
## Use a pre-built annotated plain text document obtained by employing an ## annotator pipeline from package 'StanfordCoreNLP', available from the ## repository at <http://datacube.wu.ac.at>, using the following code: ## require("StanfordCoreNLP") ## s <- paste("Stanford University is located in California.", ## "It is a great university.") ## p <- StanfordCoreNLP_Pipeline(c("pos", "lemma", "parse")) ## doc <- AnnotatedPlainTextDocument(s, p(s)) doc <- readRDS(system.file("texts", "stanford.rds", package = "NLP")) doc ## Extract available annotation: a <- annotations(doc)[[1L]] a ## Structured views: sents(doc) tagged_sents(doc) tagged_sents(doc, map = Universal_POS_tags_map) parsed_sents(doc) ## Add (trivial) paragraph annotation: s <- as.character(doc) a <- annotate(s, Simple_Para_Token_Annotator(blankline_tokenizer), a) doc <- AnnotatedPlainTextDocument(s, a) ## Structured view: paras(doc)