BTM {BTM} | R Documentation |
The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)
A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm b=(wi,wj) is defined as: P(b) = ∑_k{P(wi|z)*P(wj|z)*P(z)} where k is the number of topics you want to extract.
Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for P(w|k)=phi and P(z)=theta.
BTM(data, k = 5, alpha = 50/k, beta = 0.01, iter = 1000, window = 15, background = FALSE, trace = FALSE)
data |
a tokenised data frame containing one row per token with 2 columns
|
k |
integer with the number of topics to identify |
alpha |
numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k. |
beta |
numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w|z). Defaults to 0.1. |
iter |
integer with the number of iterations of Gibbs sampling |
window |
integer with the window size for biterm extraction. Defaults to 15. |
background |
logical if set to |
trace |
logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE. |
an object of class BTM which is a list containing
model: a pointer to the C++ BTM model
K: the number of topics
W: the number of tokens in the data
alpha: the symmetric dirichlet prior probability of a topic P(z)
beta: the symmetric dirichlet prior probability of a word given the topic P(w|z)
iter: the number of iterations of Gibbs sampling
background: indicator if the first topic is set to the background topic that equals the empirical word distribution.
theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it
phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(w|z). the rownames of the matrix indicate the token w
A biterm is defined as a pair of words co-occurring in the same text window.
If you have as an example a document with sequence of words 'A B C B'
, and assuming the window size is set to 3,
that implies there are two text windows which can generate biterms namely
text window 'A B C'
with biterms 'A B', 'B C', 'A C'
and text window 'B C B'
with biterms 'B C', 'C B', 'B B'
A biterm is an unorder word pair where 'B C' = 'C B'
. Thus, the document 'A B C B'
will have the following biterm frequencies:
'A B': 1
'B C': 3
'A C': 1
'B B': 1
These biterms are used to create the model.
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
predict.BTM
, terms.BTM
, logLik.BTM
library(udpipe) data("brussels_reviews_anno", package = "udpipe") x <- subset(brussels_reviews_anno, language == "nl") x <- subset(x, xpos %in% c("NN", "NNP", "NNS")) x <- x[, c("doc_id", "lemma")] model <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE) model terms(model) scores <- predict(model, newdata = x) ## Another small run with first topic the background word distribution set.seed(123456) model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE) model terms(model)