ml_prepare_dataframe {sparklyr} | R Documentation |
This routine prepares a Spark DataFrame for use by Spark ML routines.
ml_prepare_dataframe(x, features, response = NULL, ..., ml.options = ml_options(), envir = new.env(parent = emptyenv()))
x |
An object coercable to a Spark DataFrame (typically, a
|
features |
The name of features (terms) to use for the model fit. |
response |
The name of the response vector (as a length-one character
vector), or a formula, giving a symbolic description of the model to be
fitted. When |
... |
Optional arguments. The |
ml.options |
Optional arguments, used to affect the model generated. See
|
envir |
An R environment – when supplied, it will be filled with metadata describing the transformations that have taken place. |
Spark DataFrames are prepared through the following transformations:
All specified columns are transformed into a numeric data type
(using a simple cast for integer / logical columns, and
ft_string_indexer
for strings),
The ft_vector_assembler
is used to combine the
specified features into a single 'feature' vector, suitable
for use with Spark ML routines.
After calling this function, the envir
environment (when supplied)
will be populated with a set of variables:
features : | The name of the generated features vector. |
response : | The name of the generated response vector. |
labels : | When the response column is a string vector,
the ft_string_indexer is used to transform
the vector into a [0:n) numeric vector. The ordered
labels are injected here to allow for easier mapping
from the [0:n) values back to the original label.
|
## Not run: # example of how 'ml_prepare_dataframe' might be used to invoke # Spark's LinearRegression routine from the 'ml' package envir <- new.env(parent = emptyenv()) tdf <- ml_prepare_dataframe(df, features, response, envir = envir) lr <- invoke_new( sc, "org.apache.spark.ml.regression.LinearRegression" ) # use generated 'features', 'response' vector names in model fit model <- lr %>% invoke("setFeaturesCol", envir$features) %>% invoke("setLabelCol", envir$response) ## End(Not run)