madlib.lda {PivotalR} | R Documentation |
This function is a wrapper for MADlib's Latent Dirichlet Allocation. The computation is parallelized by MADlib if the connected database is distributed. Please refer to MADlib documentation for details of the algorithm implementation [1].
madlib.lda(data, topic_num, alpha, beta, iter_num = 20, nstart = 1, best = TRUE,...)
data |
An object of |
topic_num |
Number of topics. |
alpha |
Dirichlet parameter for the per-doc topic multinomial. |
beta |
Dirichlet parameter for the per-topic word multinomial. |
iter_num |
Number of iterations. |
nstart |
Number of repeated random starts. |
best |
If TRUE only the model with the minimum perplexity is returned. |
... |
Other optional parameters. Not implemented. |
An lda.madlib
object or a list of them, which is a list that
contains the following items:
assignments |
The per-document topic assignments. |
document_sums |
The per-document topic counts. |
model_table |
The |
output_table |
The |
tf_table |
The |
topic_sums |
The per-topic sum of assignments. |
topics |
The per-word association with topics. |
Author: Predictive Analytics Team at Pivotal Inc.
Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io
[1] Documentation of LDA in the latest MADlib release, http://madlib.incubator.apache.org/docs/latest/group__grp__lda.html
predict.lda.madlib
is used for prediction-labelling test documents
using a learned lda.madlib
model.
perplexity.lda.madlib
is used for computing the perplexity of a
learned lda.madlib
model.
## Not run: ## set up the database connection ## Assume that .port is port number and .dbname is the database name cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE) dat <- db.data.frame("__madlib_pivotalr_lda_data__", conn.id = cid, verbose = FALSE) output.db <- madlib.lda(dat, 2,0.1,0.1, 50) perplexity.db <- perplexity.lda.madlib(output.db) print(perplexity.db) ## Run LDA multiple times and get the best one output.db <- madlib.lda(dat, 2,0.1,0.1, 50, nstart=2) perplexity.db <- perplexity.lda.madlib(output.db) print(perplexity.db) ## Run LDA multiple times and keep all models output.db <- madlib.lda(dat, 2,0.1,0.1, 50, nstart=2, best=FALSE) perplexity.db <- perplexity.lda.madlib(output.db[[1]]) print(perplexity.db) perplexity.db <- perplexity.lda.madlib(output.db[[2]]) print(perplexity.db) db.disconnect(cid) ## End(Not run)