vigor {VIGoR} | R Documentation |
This function performs Bayesian genome-wide regression using variational Bayesian algorithms. The available regression methods are Bayesian lasso (BL), extended Bayesian lasso (EBL), weighted Bayesian shrinkage regression (wBSR), BayesB, BayesC, stochastic search variable selection (SSVS), and Bayesian mixture model (MIX).
vigor(Pheno, Geno, Method = c("BL", "EBL", "wBSR", "BayesB", "BayesC", "SSVS", "MIX"), Hyperparameters, Function = "fitting", Nfold = 10, CVFoldTuning = 5, Partition=NULL, Covariates = "Intercept", Threshold = 2+log10(ncol(Geno)), Maxiterations=1000, RandomIni=FALSE, Printinfo=TRUE)
Pheno |
An N-length Vector of response variables (e.g., phenotypic values), where N is the number of individuals. Missing data (coded as NA) are allowed. |
Geno |
An N x P matrix of marker genotypes, where P is the number of markers. Both integers and doubles can be used. Because missing values are not allowed, missing genotypes should be imputed before analysis. |
Method |
String representing the selected regression method. See details below. |
Hyperparameters |
A vector or matrix of hyperparameter values. When multiple combinations of hyperparameter values (hyperparameter sets) are used, the columns of the matrix correspond to the hyperparameters, and the rows correspond to the combinations (sets). See details below. |
Function |
One of the strings "fitting", "tuning", and "cv". See details below. |
Nfold |
An integer value. When n > 1, n-fold cross-validation (CV) is performed on randomly partitioned individuals. When the integer is -1, leave-one-out CV is conducted. When the integer is -9, the partitioning for the CV is defined by the argument Partition. Used when Function = "cv". |
CVFoldTuning |
An integer specifying the fold number of the CV in hyperparameter tuning. Used when Function = "cv" or "tuning" and the number of hyperparameter sets (number of rows of Hyperparameters) > 1. |
Partition |
A matrix defining the partitions of CV. See details and examples below. Used when Function = "cv" or "tuning" and Nfold = -9. |
Covariates |
If Covariates = "Intercept", the intercept is automatically added to the regression model. An N x F matrix where F is the number of covariates also can be input as the covariates. Both integers and doubles are permitted. Missing values are not allowed. See details below. |
Threshold |
Specifies the convergence threshold. Calculation terminates if the convergence metric is < 1e-Threshold. See the pdf manual of VIGoR (Onogi 2015) for the metric. |
Maxiterations |
Maximum number of iterations. |
RandomIni |
If TRUE, the initial values of the marker effects are randomly determined. Otherwise, they are set to 0. |
Printinfo |
If TRUE, print the run information to the console. |
For details of vigor, the user is referred to the pdf manual (Onogi 2015).
Regression methods
Vigor assumes the following linear model for individual i;
Pheno[i] = sum(Covariates[i,]*Alpha) + sum(Gamma*Geno[i,]*Beta) + Ei
where Pheno[i] is the response variable, Alpha contains the covariate coefficients, and Gamma contains the variables that indicate whether the corresponding markers are included in the model (1) or not (0). Beta contains the marker effects, and Ei is the residual. Gamma is set to 1 except in wBSR, which infers Gamma. Ei is assumed to follow a Normal (0, 1/Tau02) distribution, where Tau02 is assumed to follow 1/Tau02.
The methods assume different prior distributions of the marker effect p (Beta[p]).
BL
Beta[p] ~ normal (0, sqrt(1/Tau2[p]/Tau02))
Tau2[p] ~ inverse gamma (1, Lambda2/2)
Lambda2 ~ gamma (Phi, Omega)
EBL
Beta[p] ~ normal (0, sqrt(1/Tau2[p]/Tau02))
Tau2[p] ~ inverse gamma (1, Delta2*Eta2[p]/2)
Delta2 ~ gamma (Phi, Omega)
Eta2[p] ~ gamma (Psi, Theta)
wBSR
Beta[p] ~ normal (0, sqrt(Sigma2[p]))
Sigma2[p] ~ scaled-inverse-chi square (Nu, S2)
Gamma[p] ~ Bernoulli (Kappa)
BayesB
Beta[p] ~ normal (0, sqrt(Sigma2[p])) if Rho[p]=1, and 0 if Rho[p]=0
Sigma2[p] ~ scaled-inverse-chi square (Nu, S2)
Rho[p] ~ Bernoulli (Kappa)
BayesC
Beta[p] ~ normal (0, sqrt(Sigma2)) if Rho[p]=1, and 0 if Rho[p]=0
Sigma2 ~ scaled-inverse-chi square (Nu, S2)
Rho[p] ~ Bernoulli (Kappa)
SSVS
Beta[p] ~ normal (0, sqrt(Sigma2)) if Rho[p]=1, and normal(0, sqrt(c*Sigma2)) if Rho[p]=0
Sigma2 ~ scaled-inverse-chi square (Nu, S2)
Rho[p] ~ Bernoulli (Kappa)
MIX
Beta[p] ~ normal (0, sqrt(Sigma2[1])) if Rho[p]=1, and normal(0, sqrt(Sigma2[2])) if Rho[p]=0
Sigma2[1] ~ scaled-inverse-chi square (Nu, S2)
Sigma2[2] ~ scaled-inverse-chi square (Nu, c*S2)
Rho[p] ~ Bernoulli (Kappa)
Hyperparameters
To run vigor, the following hyperparameter values must be declared as arguments for Hyperparameters.
BL: Phi, Omega
EBL: Phi, Omega, Psi, Theta
wBSR: Nu, S2, Kappa
BayesB: Nu, S2, Kappa
BayesC: Nu, S2, Kappa
SSVS: c, Nu, S2, Kappa
MIX: c, Nu, S2, Kappa
The argument Hyperparameters is a Nh-length vector, or a (Nh x Nset) matrix, where Nh and Nset are the number of hyperparameters and the number of hyperparameter sets, respectively. For example, when the regression method is BL, the matrix
1 0.001
1 0.01
1 0.1
indicates that Phi = 1 and Omega = 0.001 in the first set, Phi = 1 and Omega = 0.01 in the second set, and Phi = 1 and Omega = 0.1 in the third set. As another example, when the regression model is BayesB, the vector
4 0.5 0.001
indicates that Nu = 4, S2 = 0.5, and Kappa = 0.001. Vectors or matrices for Hyperparameters can be created using hyperpara.
Functions
The functions of vigor are "fitting", "tuning", and "cv". "Fitting" fits the selected regression model to the data. When Hyperparameters includes multiple hyperparameter sets (i.e., is input as a matrix), only the first set is used.
"Tuning" selects the hyperparameter set with the lowest MSE in CV. This set is then used in the model fitting. When tuning the hyperparameters, CV is performed on randomly partitioned data. The number of folds is determined by CVFoldTuning.
The "cv" function conducts CV, and returned the predicted values. When Hyperparameters includes multiple hyperparameter sets, tuning is performed at each fold of the CV.
Partition matrix
The following is a possible Partition of 20 individuals evaluated in a five-fold CV:
14 11 3 2 7
5 4 20 10 9
6 8 16 15 12
18 13 17 1 19
Individuals (row numbers in Pheno/Geno/Covariates) 14, 5, 6, and 18 are removed from the training set at the first fold of the five-fold CV. Samples 11, 4, 8, and 13 are removed at the next fold. This process is repeated up to the fold number of the CV. If the number of individuals N is 19, the gap is filled with -9. For example,
8 6 3 14 18
12 4 1 15 5
17 9 13 11 10
19 16 7 2 -9
An example of random sampling validation in which individuals can be sampled more than once is shown below.
18 3 11 16 13
17 8 13 13 18
7 15 14 19 7
1 13 12 7 2
Individuals 18, 13, and 7 are repeatedly used as testing samples.
Random partitioning (i.e., Nfold = n) outputs a Partition matrix, which can be input as the Partition matrix in subsequent analysis.
Intercept
If Covariates = "Intercept", vigor automatically adds the intercept to the regression model. If Covariates is a user-specified matrix, vigor uses this matrix as the covariates, regarding the first column as the intercept. Thus, the first column of Covariates should be filled with 1s. However, when Covariates is a matrix of population-assignment probabilities when correcting for population stratification, the intercept is not necessary.
Standardization
Vigor standardizes Pheno (response variables). Although the returned values are generally scaled back to the original scale, some estimates are reported in the standardized scale. See value.
When Function = "fitting" or "tuning", a list containing the following elements is returned.
$LB |
Lower bound of the marginal log likelihood of Pheno. |
$ResidualVar |
Residual variances (1/Tau02) of the iterations. Reported in the standardized scale. |
$Beta |
Posterior means of the marker effects (E[Beta|Pheno]). |
$Sd.beta |
Posterior uncertainty (standard deviation) of the marker effects (sqrt(Var[Beta|Pheno])). |
$Tau2 |
Posterior mean of Tau2. |
$Sigma2 |
Posterior mean of Sigma2. |
$Alpha |
Posterior means of Alpha. |
$Sd.alpha |
Posterior uncertainty of Alpha. |
$Lambda2 |
Posterior mean of Lambda2. Reported in the standardized scale. Returned only by the BL model. |
$Delta2 |
Posterior mean of Delta2. Reported in the standardized scale. Returned only by the EBL model. |
$Eta2 |
Posterior means of Eta2. Reported in the standardized scale. Returned only by the EBL model. |
$Gamma |
Posterior means of Gamma. Returned only by the wBSR model. |
$Rho |
Posterior means of Rho. Returned only by the BayesB, BayesC, SSVS, and MIX models. |
$MSE |
A data frame with (2 + Nh) columns, where Nh is the number of hyperparameters. Returned when Function = "tuning".
|
When Function = "cv", a list containing the following elements is returned.
$Prediction |
A data frame with 4 columns, Test, Y, Yhat, and BV:
|
$MSE |
A data frame with (3 + Nh) columns, returned when analyzing multiple hyperparameter sets.
|
$Partition |
A matrix representing the partition used in random partitioning. This matrix can be used as the argument Partition in subsequent analyses. |
Akio Onogi
Hiroyoshi Iwata
Onogi & Iwata, VIGoR: variational Bayesian inference for genome-wide regression, in prep.
Onogi A, 2015, Documents for VIGoR (May 2015).https://github.com/Onogi/VIGoR
#data data(sampledata) dim(Geno)#100 samples and 1000 markers dim(Pheno)#100 samples and 3 traits dim(Covariates)#100 samples and 2 covariates (including the intercept) #Use BL. Draw a simple Manhattan plot. Result<-vigor(Pheno$Height,Geno,"BL",c(1,1),Covariates=Covariates) plot(abs(Result$Beta),pch=20) which((abs(Result$Beta)-1.96*Result$Sd.beta)>0) #Significant markers (P<0.05) #Use BayesC without covariates. Result<-vigor(Pheno$Height,Geno,"BayesC",matrix(c(4,1,0.01,4,1,0.001),nr=2,byrow=TRUE)) plot(abs(Result$Beta),pch=20) which((abs(Result$Beta)-1.96*Result$Sd.beta)>0) print(Result$Alpha)#intercept is automatically added. #Tuning hyperparameters. Two hyperparameter sets are given as a matrix. Use BayesB. H<-matrix(c(5,1,0.001,5,1,0.01),nc=3,byrow=TRUE) print(H) Result<-vigor(Pheno$Height,Geno,"BayesB",H,Function="tuning",Covariates=Covariates) plot(abs(Result$Beta),pch=20) print(Result$MSE)#the set with the lowest MSE was used #When Function of vigor is "fitting", only the first set is used for regression. #to repeat analyses under the different sets, for example, Result<-as.list(numeric(2)) for(set in 1:2){Result[[set]]<-vigor(Pheno$Height,Geno,"wBSR",H[set,],Covariates=Covariates)} #Perform cross-validation. Use BL. Number of hyperparameter sets is 2. #the first set is c(1,0.01) and the second is c(1,0.1) #6-fold CV Result<-vigor(Pheno$Height,Geno,"BL", matrix(c(1,0.01,1,0.1),ncol=2,byrow=TRUE),Function="cv",Nfold=6,Covariates=Covariates) plot(Result$Prediction$Y,Result$Prediction$Yhat) #plot true and predicted values cor(Result$Prediction$Y,Result$Prediction$Yhat) #accuracy print(Result$MSE) #see which the set used at each fold. print(Result$Partition) #see the partition of CV #Perform CV using the same partition. Use BayesC. H<-matrix(c(5,1,0.01,5,1,0.1),nc=3,byrow=TRUE) Result2<-vigor(Pheno$Height,Geno,"BayesC",H,Function="cv",Nfold=-9, Partition=Result$Partition,Covariates=Covariates) cor(Result2$Prediction$Y,Result2$Prediction$Yhat) #accuracy