ml_gradient_boosted_trees {sparklyr} | R Documentation |
Perform regression or classification using gradient-boosted trees.
ml_gradient_boosted_trees(x, response, features, impurity = c("auto", "gini", "entropy", "variance"), loss.type = c("auto", "logistic", "squared", "absolute"), max.bins = 32L, max.depth = 5L, num.trees = 20L, min.info.gain = 0, min.rows = 1L, learn.rate = 0.1, sample.rate = 1, type = c("auto", "regression", "classification"), thresholds = NULL, seed = NULL, checkpoint.interval = 10L, cache.node.ids = FALSE, max.memory = 256L, ml.options = ml_options(), ...)
x |
An object coercable to a Spark DataFrame (typically, a
|
response |
The name of the response vector (as a length-one character
vector), or a formula, giving a symbolic description of the model to be
fitted. When |
features |
The name of features (terms) to use for the model fit. |
impurity |
Criterion used for information gain calculation One of 'auto', 'gini', 'entropy', or 'variance'. 'auto' defaults to 'gini' for classification and 'variance' for regression. |
loss.type |
Loss function which the algorithm tries to minimize. Defaults to |
max.bins |
The maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity. |
max.depth |
Maximum depth of the tree (>= 0); that is, the maximum number of nodes separating any leaves from the root of the tree. |
num.trees |
Number of trees to train (>= 1), defaults to 20. |
min.info.gain |
Minimum information gain for a split to be considered at a tree node. Should be >= 0, defaults to 0. |
min.rows |
Minimum number of instances each child must have after split. |
learn.rate |
The learning rate or step size, defaults to 0.1. |
sample.rate |
Fraction of the training data used for learning each decision tree, defaults to 1.0. |
type |
The type of model to fit. |
thresholds |
Thresholds in multi-class classification to adjust the probability of predicting each class. Vector must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold. |
seed |
Seed for random numbers. |
checkpoint.interval |
Set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations, defaults to 10. |
cache.node.ids |
If |
max.memory |
Maximum memory in MB allocated to histogram aggregation. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size. Defaults to 256. |
ml.options |
Optional arguments, used to affect the model generated. See
|
... |
Optional arguments. The |
Other Spark ML routines: ml_als_factorization
,
ml_decision_tree
,
ml_generalized_linear_regression
,
ml_kmeans
, ml_lda
,
ml_linear_regression
,
ml_logistic_regression
,
ml_multilayer_perceptron
,
ml_naive_bayes
,
ml_one_vs_rest
, ml_pca
,
ml_random_forest
,
ml_survival_regression