Many existing text mining datasets are in the form of a DocumentTermMatrix
class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels package:
library(tm)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
If we want to analyze this with tidy tools, we need to turn it into a one-term-per-document-per-row data frame first. The tidy
function does this. (For more on the tidy verb, see the broom package).
library(dplyr)
library(tidytext)
ap_td <- tidy(AssociatedPress)
Just as shown in this vignette, having the text in this format is convenient for analysis with the tidytext package. For example, you can perform sentiment analysis on these newspaper articles.
ap_sentiments <- ap_td %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
ap_sentiments
## # A tibble: 30,094 × 4
## document term count sentiment
## <int> <chr> <dbl> <chr>
## 1 1 assault 1 negative
## 2 1 complex 1 negative
## 3 1 death 1 negative
## 4 1 died 1 negative
## 5 1 good 2 positive
## 6 1 illness 1 negative
## 7 1 killed 2 negative
## 8 1 like 2 positive
## 9 1 liked 1 positive
## 10 1 miracle 1 positive
## # ... with 30,084 more rows
We can find the most negative documents:
library(tidyr)
ap_sentiments %>%
count(document, sentiment, wt = count) %>%
ungroup() %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
arrange(sentiment)
## # A tibble: 2,190 × 4
## document negative positive sentiment
## <int> <dbl> <dbl> <dbl>
## 1 1251 54 6 -48
## 2 1380 53 5 -48
## 3 531 51 9 -42
## 4 43 45 11 -34
## 5 1263 44 10 -34
## 6 2178 40 6 -34
## 7 334 45 12 -33
## 8 1664 38 5 -33
## 9 2147 47 14 -33
## 10 516 38 6 -32
## # ... with 2,180 more rows
Or visualize which words contributed to positive and negative sentiment:
library(ggplot2)
ap_sentiments %>%
count(sentiment, term, wt = count) %>%
ungroup() %>%
filter(n >= 150) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(term = reorder(term, n)) %>%
ggplot(aes(term, n, fill = sentiment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Contribution to sentiment")
Note that a tidier is also available for the dfm
class from the quanteda package:
data("inaugCorpus", package = "quanteda")
d <- quanteda::dfm(inaugCorpus)
d
## Document-feature matrix of: 57 documents, 9,215 features.
tidy(d)
## # A tibble: 43,719 × 3
## document term count
## <chr> <chr> <dbl>
## 1 1789-Washington fellow-citizens 1
## 2 1797-Adams fellow-citizens 3
## 3 1801-Jefferson fellow-citizens 2
## 4 1809-Madison fellow-citizens 1
## 5 1813-Madison fellow-citizens 1
## 6 1817-Monroe fellow-citizens 5
## 7 1821-Monroe fellow-citizens 1
## 8 1841-Harrison fellow-citizens 11
## 9 1845-Polk fellow-citizens 1
## 10 1849-Taylor fellow-citizens 1
## # ... with 43,709 more rows
Some existing text mining tools or algorithms work only on sparse document-term matrices. Therefore, tidytext provides cast_
verbs for converting from a tidy form to these matrices.
ap_td
## # A tibble: 302,031 × 3
## document term count
## <int> <chr> <dbl>
## 1 1 adding 1
## 2 1 adult 2
## 3 1 ago 1
## 4 1 alcohol 1
## 5 1 allegedly 1
## 6 1 allen 1
## 7 1 apparently 2
## 8 1 appeared 1
## 9 1 arrested 1
## 10 1 assault 1
## # ... with 302,021 more rows
# cast into a Document-Term Matrix
ap_td %>%
cast_dtm(document, term, count)
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
# cast into a Term-Document Matrix
ap_td %>%
cast_tdm(term, document, count)
## <<TermDocumentMatrix (terms: 10473, documents: 2246)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
# cast into quanteda's dfm
ap_td %>%
cast_dfm(term, document, count)
## Document-feature matrix of: 10,473 documents, 2,246 features.
# cast into a Matrix object
m <- ap_td %>%
cast_sparse(document, term, count)
class(m)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
dim(m)
## [1] 2246 10473
This allows for easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications.
You can also tidy Corpus objects from the tm package. For example, consider a Corpus containing 20 documents, one for each
reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- VCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain))
reuters
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 20
The tidy
verb creates a table with one row per document:
reuters_td <- tidy(reuters)
reuters_td
## # A tibble: 20 × 17
## author datetimestamp description
## <chr> <dttm> <chr>
## 1 <NA> 1987-02-26 10:00:56
## 2 BY TED D'AFFLISIO, Reuters 1987-02-26 10:34:11
## 3 <NA> 1987-02-26 11:18:00
## 4 <NA> 1987-02-26 11:21:01
## 5 <NA> 1987-02-26 12:00:57
## 6 <NA> 1987-02-28 20:25:46
## 7 By Jeremy Clift, Reuters 1987-02-28 20:39:14
## 8 <NA> 1987-02-28 22:27:27
## 9 <NA> 1987-03-01 01:22:30
## 10 <NA> 1987-03-01 11:31:44
## 11 <NA> 1987-03-01 18:05:49
## 12 <NA> 1987-03-02 00:39:23
## 13 <NA> 1987-03-02 00:43:22
## 14 <NA> 1987-03-02 00:43:41
## 15 <NA> 1987-03-02 01:25:42
## 16 <NA> 1987-03-02 04:20:05
## 17 <NA> 1987-03-02 04:28:26
## 18 <NA> 1987-03-02 05:13:46
## 19 By BERNICE NAPACH, Reuters 1987-03-02 07:38:34
## 20 <NA> 1987-03-02 07:49:06
## # ... with 14 more variables: heading <chr>, id <chr>, language <chr>,
## # origin <chr>, topics <chr>, lewissplit <chr>, cgisplit <chr>,
## # oldid <chr>, topics_cat <list>, places <list>, people <chr>,
## # orgs <chr>, exchanges <chr>, text <chr>
Similarly, you can tidy
a corpus
object from the quanteda package:
library(quanteda)
data("inaugCorpus")
inaugCorpus
## Corpus consisting of 57 documents and 3 docvars.
inaug_td <- tidy(inaugCorpus)
inaug_td
## # A tibble: 57 × 4
## text
## * <chr>
## 1 Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong t
## 2 Fellow citizens, I am again called upon by the voice of my country to execu
## 3 When it was first perceived, in early times, that no middle course for Amer
## 4 Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the fi
## 5 Proceeding, fellow citizens, to that qualification which the Constitution r
## 6 Unwilling to depart from examples of the most revered authority, I avail my
## 7 About to add the solemnity of an oath to the obligations imposed by a secon
## 8 I should be destitute of feeling if I was not deeply affected by the strong
## 9 Fellow citizens, I shall not attempt to describe the grateful emotions whic
## 10 In compliance with an usage coeval with the existence of our Federal Consti
## # ... with 47 more rows, and 3 more variables: Year <int>,
## # President <chr>, FirstName <chr>
This lets us work with tidy tools like unnest_tokens
to analyze the text alongside the metadata.
inaug_words <- inaug_td %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
inaug_words
## # A tibble: 49,621 × 4
## Year President FirstName word
## <int> <chr> <chr> <chr>
## 1 2013 Obama Barack waves
## 2 2013 Obama Barack realizes
## 3 2013 Obama Barack philadelphia
## 4 2013 Obama Barack 400
## 5 2013 Obama Barack 40
## 6 2013 Obama Barack absolutism
## 7 2013 Obama Barack contour
## 8 2013 Obama Barack newtown
## 9 2013 Obama Barack lanes
## 10 2013 Obama Barack appalachia
## # ... with 49,611 more rows
We could then, for example, see how the appearance of a word changes over time:
inaug_freq <- inaug_words %>%
count(Year, word) %>%
ungroup() %>%
complete(Year, word, fill = list(n = 0)) %>%
group_by(Year) %>%
mutate(year_total = sum(n),
percent = n / year_total) %>%
ungroup()
inaug_freq
## # A tibble: 490,200 × 5
## Year word n year_total percent
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1789 1 0 529 0.000000000
## 2 1789 1,000 0 529 0.000000000
## 3 1789 100 0 529 0.000000000
## 4 1789 100,000,000 0 529 0.000000000
## 5 1789 120,000,000 0 529 0.000000000
## 6 1789 125 0 529 0.000000000
## 7 1789 13 0 529 0.000000000
## 8 1789 14th 1 529 0.001890359
## 9 1789 15th 0 529 0.000000000
## 10 1789 16 0 529 0.000000000
## # ... with 490,190 more rows
For example, we can use the broom package to perform logistic regression on each word.
models <- inaug_freq %>%
group_by(word) %>%
filter(sum(n) > 50) %>%
do(tidy(glm(cbind(n, year_total - n) ~ Year, .,
family = "binomial"))) %>%
ungroup() %>%
filter(term == "Year")
models
## # A tibble: 113 × 6
## word term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 act Year 0.006894234 0.002191596 3.1457591 1.656564e-03
## 2 action Year 0.001634417 0.001959204 0.8342250 4.041542e-01
## 3 administration Year -0.006979577 0.001882474 -3.7076616 2.091819e-04
## 4 america Year 0.018890081 0.001584306 11.9232506 8.954525e-33
## 5 american Year 0.007084142 0.001321897 5.3590709 8.365105e-08
## 6 americans Year 0.032657656 0.003659114 8.9250184 4.456252e-19
## 7 authority Year -0.005640373 0.002336159 -2.4143787 1.576207e-02
## 8 business Year 0.003745929 0.002016455 1.8576801 6.321445e-02
## 9 called Year -0.001935068 0.002088388 -0.9265844 3.541423e-01
## 10 century Year 0.016480566 0.002495844 6.6032027 4.023687e-11
## # ... with 103 more rows
models %>%
filter(term == "Year") %>%
arrange(desc(abs(estimate)))
## # A tibble: 113 × 6
## word term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 americans Year 0.03265766 0.003659114 8.925018 4.456252e-19
## 2 america Year 0.01889008 0.001584306 11.923251 8.954525e-33
## 3 century Year 0.01648057 0.002495844 6.603203 4.023687e-11
## 4 live Year 0.01448914 0.002490610 5.817506 5.973212e-09
## 5 democracy Year 0.01432438 0.002394738 5.981606 2.209489e-09
## 6 god Year 0.01402582 0.001921362 7.299935 2.879058e-13
## 7 freedom Year 0.01366336 0.001320242 10.349129 4.223092e-25
## 8 foreign Year -0.01364998 0.002058045 -6.632497 3.300543e-11
## 9 earth Year 0.01303351 0.002291996 5.686532 1.296449e-08
## 10 world Year 0.01233715 0.001000739 12.328042 6.398240e-35
## # ... with 103 more rows
You can show these models as a volcano plot, which compares the effect size with the significance:
library(ggplot2)
models %>%
mutate(adjusted.p.value = p.adjust(p.value)) %>%
ggplot(aes(estimate, adjusted.p.value)) +
geom_point() +
scale_y_log10() +
geom_text(aes(label = word), vjust = 1, hjust = 1,
check_overlap = TRUE) +
xlab("Estimated change over time") +
ylab("Adjusted p-value")
We can also use the ggplot2 package to display the top 6 terms that have changed in frequency over time.
library(scales)
models %>%
top_n(6, abs(estimate)) %>%
inner_join(inaug_freq) %>%
ggplot(aes(Year, percent)) +
geom_point() +
geom_smooth() +
facet_wrap(~ word) +
scale_y_continuous(labels = percent_format()) +
ylab("Frequency of word in speech")