BTM {BTM}R Documentation

Construct a Biterm Topic Model on Short Text

Description

The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)

Usage

BTM(data, k = 5, alpha = 50/k, beta = 0.01, iter = 1000, window = 15,
  background = FALSE, trace = FALSE)

Arguments

data

a tokenised data frame containing one row per token with 2 columns

  • the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text)

  • the second column is a column called of type character containing the sequence of words occurring within the context identifier

k

integer with the number of topics to identify

alpha

numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k.

beta

numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w|z). Defaults to 0.1.

iter

integer with the number of iterations of Gibbs sampling

window

integer with the window size for biterm extraction. Defaults to 15.

background

logical if set to TRUE, the first topic is set to a background topic that equals to the empirical word distribution. This can be used to filter out common words. Defaults to FALSE.

trace

logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE.

Value

an object of class BTM which is a list containing

Note

A biterm is defined as a pair of words co-occurring in the same text window. If you have as an example a document with sequence of words 'A B C B', and assuming the window size is set to 3, that implies there are two text windows which can generate biterms namely text window 'A B C' with biterms 'A B', 'B C', 'A C' and text window 'B C B' with biterms 'B C', 'C B', 'B B' A biterm is an unorder word pair where 'B C' = 'C B'. Thus, the document 'A B C B' will have the following biterm frequencies:

These biterms are used to create the model.

References

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

See Also

predict.BTM, terms.BTM, logLik.BTM

Examples

library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model  <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
model
terms(model)
scores <- predict(model, newdata = x)

## Another small run with first topic the background word distribution
set.seed(123456)
model  <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE)
model
terms(model)

[Package BTM version 0.2.1 Index]