sqlrutils Automation R Package

Microsoft

09/03/2016

Contents:

Package Info

sqlrutils package provides a mechanism for R users to wrap their R script into a TSQL stored procedure, register that stored procedure with a database, and test it from an R development environment. It enables the users to:

  1. Generate a TSQL script needed for creating a stored procedure by embedding the R script as a parameter to sp_execute_external_script and setting other parameters appropriately.
  2. Capture the TSQL script as a string inside an R data structure
  3. Optionally, generate a .sql file
  4. Register the SP using the generated script from R development environment
  5. Execute registered stored procedure from R development environment and process output1
  6. Support common user scenarios for ETL, Training, and Scoring

API overview

Step by step guide for the users

Part I: Format your R script

  1. Wrap the code into a function.
  2. All of the variables that the function relies on should be defined either inside the function or come in as input parameters.
  3. The input parameters of the function will become the input parameters of the sql stored procedure, and thus should conform to the following type requirements:
    • Among the input parameters there can be at most 1 data frame.
    • The objects inside the data frame as well as all other input parameters of the function should be of the following types: POSIXct , numeric, character, integer, logical, raw.
    • If an input type is not one of the above, it needs to be serialized and passed into the function as “raw”. The function should take the necessary steps to unserialize it.
  4. The function should output one of the following:
    • Data frame containing objects of classes mentioned in 3(ii).
    • Named list, where the names are the names of the variables in the list (i.e. list(x = x)). The list can contain at most 1 data frame. The objects in the data frame as well as all other members of the list should have types specified in 3.
    • NULL

Part II: Make calls to our package

  1. If your function takes input parameters, for each one of them create the following objects:
    • For the input parameter that is a data frame create a InputData object
    • For all other input parameters create InputParameter objects
  2. If your function outputs a list:
    • For the variable in the list that is a data frame create a OutputData object
    • For all other members of the list create OutputParameter objects
    • If you function outputs a data frame directly, without first wrapping it into a list, you do not have create any of such objects and can simply skip the step. Same if your function returns NULL.
  3. Utilize one the StoredProcedure constructor to generate a query that will create a SQL stored procedure containing the R function
  4. Use registerStoredProcedure function to register the stored procedure with the database
  5. Before you execute the stored procedure: if the stored procedure has input parameters those need to be set before the function can be executed:
    • Call getInputParameters to get a list of input parameter objects
    • Set either $query or $value for each one of them (but not both!)
    • Use executeStoredProcedure to execute the stored procedure from the R development environment. If you took step 9 pass it the list of input parameter objects that you set.

Code Samples

ETL Example: reading from the database and writing back to the database

> etl1 <- function() {
+   # The query to get the data
+   qq <- "select top 10000 ArrDelay,CRSDepTime,DayOfWeek from AirlineDemoSmall"
+   # The connection string. For executeStoredProcedure to work, ODBC 3.8 driver is needed.
+   conStr <- paste("Driver={ODBC Driver 13 for SQL Server};Server=.;Database=RevoTestDB;",
+                   "Trusted_Connection=Yes;", sep = "")
+   # The data source - retrieves the data from the database
+   dsSqls <- RxSqlServerData(sqlQuery = qq, connectionString = conStr)
+   # The destination data source
+   dsSqls2 <- RxSqlServerData(table ="cleanData",  connectionString = conStr)
+   # A transformation function
+   transformFunc <- function(data) {
+     data$CRSDepHour <- as.integer(trunc(data$CRSDepTime))
+     return(data)
+   }
+   # The transformation variables
+   transformVars <- c("CRSDepTime")
+   # set the compute context
+   sqlCompute <- RxInSqlServer(numTasks = 4, connectionString = conStr)
+   rxSetComputeContext(sqlCompute)
+   # drop table if necessary
+   if (rxSqlServerTableExists("cleanData")) {
+     rxSqlServerDropTable("cleanData")
+   }
+   # perform the transformation
+   rxDataStep(inData = dsSqls,
+              outFile = dsSqls2,
+              transformFunc = transformFunc,
+              transformVars = transformVars,
+              overwrite = TRUE)
+   return(NULL)
+ }
> 
> # create the sql server stored procedure object
> etlSP1 <- StoredProcedure("etl1", "spETL_ds_to_ds",
+                       filePath = "C:\\Users\\user\\Documents",
+                       dbName ="RevoTestDB")
> # connection string necessary for registrations and execution
> # since we did not pass it to StoredProcedure
> conStr <- "Driver={ODBC Driver 13 for SQL Server};Server=.;Database=RevoTestDB;Trusted_Connection=Yes;"
> #register the stored procedure with the database
> registerStoredProcedure(etlSP1, conStr)
[1] TRUE
> #execute the stored procedure
> executeStoredProcedure(etlSP1, connectionString = conStr, verbose = TRUE)
exec spETL_ds_to_ds 
named list()

Training Example: reading from a data frame and writing to a data frame

> train1 <- function(in_df) {
+   factorLevels <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
+   in_df[,"DayOfWeek"] <- factor(in_df[,"DayOfWeek"], levels=factorLevels)
+   # The model formula
+   formula <- ArrDelay ~ CRSDepTime + DayOfWeek + CRSDepHour:DayOfWeek
+ 
+   # Train the model
+   mm <- rxLinMod(formula, data = in_df, transformFunc = NULL, transformVars = NULL)
+ 
+   # Store the model into the database
+   # rdata needs to be created beforehand
+   conStr <- paste0("Driver={ODBC Driver 13 for SQL Server};Server=.;",
+                    "Database=RevoTestDB;Trusted_Connection=Yes;")
+   out.table = "rdata"
+   # write the model to the table
+   ds = RxOdbcData(table = out.table, connectionString = conStr)
+ 
+   rxWriteObject(ds, "linmod.v1", mm, keyName = "key",
+                 valueName = "value")
+ 
+   # the model needs to be serialized before it can be added to the return list
+   mm <- memCompress(serialize(mm, connection = NULL), type="gzip")
+   return(data.frame(mm))
+ }
> 
> conStr <- "Driver={ODBC Driver 13 for SQL Server};Server=.;Database=RevoTestDB;Trusted_Connection=Yes;"
> # create  an InputData object for the input data frame in_df
> indata <- InputData("in_df",
+                defaultQuery = paste0("select top 10000 ArrDelay,CRSDepTime,",
+                                      "DayOfWeek,CRSDepHour from cleanData"))
> # create the sql server stored procedure object
> trainSP1 <- StoredProcedure('train1', "spTrain_df_to_df", indata,
+                        dbName = "RevoTestDB",
+                        connectionString = conStr,
+                        filePath = "C:\\Users\\user\\Documents")
> # spRegisterSp and executeStoredProcedure do not require a connection string since we
> # provided one when we created trainSP1
> registerStoredProcedure(trainSP1)
[1] TRUE
> model <- executeStoredProcedure(trainSP1, verbose = TRUE)
exec spTrain_df_to_df 
> result2v2 <- rxReadObject(model$data[1,1][[1]])
> result2v2
Call:
rxLinMod(formula = formula, data = in_df, transformFunc = NULL, 
    transformVars = NULL)

Linear Regression Results for: ArrDelay ~ CRSDepTime + DayOfWeek + CRSDepHour:DayOfWeek
Data: in_df
Dependent variable(s): ArrDelay
Total independent variables: 16 (Including number dropped: 7)
Number of valid observations: 9713
Number of missing observations: 287 
 
Coefficients:
                                      ArrDelay
(Intercept)                        -12.2606079
CRSDepTime                           0.8384946
DayOfWeek=Monday                     9.9722108
DayOfWeek=Tuesday                   10.7932985
DayOfWeek=Wednesday                  0.2561862
DayOfWeek=Thursday                     Dropped
DayOfWeek=Friday                       Dropped
DayOfWeek=Saturday                     Dropped
DayOfWeek=Sunday                       Dropped
CRSDepHour for DayOfWeek=Monday     -0.4449961
CRSDepHour for DayOfWeek=Tuesday    -0.6194943
CRSDepHour for DayOfWeek=Wednesday   1.0483718
CRSDepHour for DayOfWeek=Thursday    0.5470396
CRSDepHour for DayOfWeek=Friday        Dropped
CRSDepHour for DayOfWeek=Saturday      Dropped
CRSDepHour for DayOfWeek=Sunday        Dropped

Training Example: reading from a data frame and return a model as an output parameter

> train2 <- function(in_df) {
+     factorLevels <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
+   in_df[,"DayOfWeek"] <- factor(in_df[,"DayOfWeek"], levels=factorLevels)
+   # The model formula
+   formula <- ArrDelay ~ CRSDepTime + DayOfWeek + CRSDepHour:DayOfWeek
+   # Train the model
+   rxSetComputeContext("local")
+   mm <- rxLinMod(formula, data = in_df, transformFunc = NULL, transformVars = NULL)
+   # the model needs to be serialized before it can be added to the return list
+   mm <- memCompress(serialize(mm, connection = NULL),type="gzip")
+   return(list(mm = mm))
+ }
> # create  an InputData object for the input data frame in_df
> indata <- InputData(name = "in_df",
+                 query = paste0("select top 10000 ArrDelay,CRSDepTime,",
+                                "DayOfWeek,CRSDepHour from cleanData"))
> # create an OutputParameter object for the model inside the return list
> outModelParam <- OutputParameter("mm", "raw")
> trainSP2 <- StoredProcedure(train2, "spTrain_df_to_op", indata, outModelParam,
+                         filePath = "C:\\Users\\user\\Documents")
> conStr <- "Driver={ODBC Driver 13 for SQL Server};Server=.;Database=RevoTestDB;Trusted_Connection=Yes;"
> # need to pass connection string to registerStoredProcedure and executeStoredProcedure since we did not
> # provide one when creating trainSP2
> registerStoredProcedure(trainSP2, conStr)
[1] TRUE
> model <- executeStoredProcedure(trainSP2, indata, connectionString = conStr, verbose = TRUE)
exec spTrain_df_to_op  @input_data_1_outer = ?,  @mm_outer = ?
> rxReadObject(model$params[[1]])
Call:
rxLinMod(formula = formula, data = in_df, transformFunc = NULL, 
    transformVars = NULL)

Linear Regression Results for: ArrDelay ~ CRSDepTime + DayOfWeek + CRSDepHour:DayOfWeek
Data: in_df
Dependent variable(s): ArrDelay
Total independent variables: 16 (Including number dropped: 7)
Number of valid observations: 9713
Number of missing observations: 287 
 
Coefficients:
                                      ArrDelay
(Intercept)                        -12.2606079
CRSDepTime                           0.8384946
DayOfWeek=Monday                     9.9722108
DayOfWeek=Tuesday                   10.7932985
DayOfWeek=Wednesday                  0.2561862
DayOfWeek=Thursday                     Dropped
DayOfWeek=Friday                       Dropped
DayOfWeek=Saturday                     Dropped
DayOfWeek=Sunday                       Dropped
CRSDepHour for DayOfWeek=Monday     -0.4449961
CRSDepHour for DayOfWeek=Tuesday    -0.6194943
CRSDepHour for DayOfWeek=Wednesday   1.0483718
CRSDepHour for DayOfWeek=Thursday    0.5470396
CRSDepHour for DayOfWeek=Friday        Dropped
CRSDepHour for DayOfWeek=Saturday      Dropped
CRSDepHour for DayOfWeek=Sunday        Dropped

Scoring Example: batch prediction

The data comes into the function as a data frame, the model comes in as a parameter, the name assigned to the predicted variable comes in as a parameter as well. The prediction is returned as a data frame. P-value becomes an output parameter of the stored procedure. It does not get propagated back to R, however, it can be captured from SSMS.

> # indata - input data frame
> # model_param - serialized model object
> # predVarName - name of the predicted variable
> score1 <- function(in_df, model_param, predVarNameInParam) {
+     factorLevels <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
+     in_df[,"DayOfWeek"] <- factor(in_df[,"DayOfWeek"], levels=factorLevels)
+   mm <- rxReadObject(as.raw(model_param))
+   # Predict
+   result <- rxPredict(modelObject = mm,
+                       data = in_df,
+                       outData = NULL,
+                       predVarNames = predVarNameInParam,
+                       extraVarsToWrite = c("ArrDelay"),
+                       writeModelVars = TRUE,
+                       overwrite = TRUE)
+   return(list(result = result, pvOutParam = mm$f.pvalue))
+ }
> 
> # create  an InputData object for the input data frame in_df
> indata <- InputData(name = "in_df", defaultQuery = "SELECT top 10 * from cleanData")
> # create InputParameter objects for model_param and predVarNameInParam
> model <- InputParameter("model_param", "raw",
+                       defaultQuery = paste("select top 1 value from rdata",
+                                            "where [key] = 'linmod.v1'"))
> predVarNameInParam <- InputParameter("predVarNameInParam", "character")
> # create OutputData object for the data frame inside the return list
> outData <- OutputData("result")
> # create OutputParameter object for non data frame variable inside the return list
> pvOutParam <- OutputParameter("pvOutParam", "numeric")
> scoreSP1 <- StoredProcedure(score1, "spScore_df_param_df", indata, model, predVarNameInParam, outData, pvOutParam,
+                       filePath = "C:\\Users\\user\\Documents")
> conStr <- "Driver={ODBC Driver 13 for SQL Server};Server=.;Database=RevoTestDB;Trusted_Connection=Yes;"
> # connection string necessary for registrations and execution
> # since we did not pass it to StoredProcedure
> registerStoredProcedure(scoreSP1, conStr)
[1] TRUE
> model <- executeStoredProcedure(scoreSP1, predVarNameInParam = "ArrDelayEstimate", connectionString = conStr, verbose = TRUE)
exec spScore_df_param_df  @predVarNameInParam_outer = ?,  @pvOutParam_outer = ?
> model$data
     column1 column2   column3 column4 column5
1  1.8120845       6  9.666666  Monday       9
2  0.4219664      -2  6.416667  Monday       6
3  6.2803173      -2 20.833334  Monday      20
4  5.6771952     -15 19.583334  Monday      19
5  2.1357094      -7 10.583334  Monday      10
6  6.0449431      16 21.083332  Monday      21
7  2.5990813       0 11.666666  Monday      11
8  5.7213215      -9 20.166668  Monday      20
9  0.8853391       2  7.500000  Monday       7
10 1.2788375      19  8.500000  Monday       8
> model$params[[1]]
[1] 4.032124e-154