R Code Documentation

Feature Selection

.. FS_mRMR.R::
[sources]
#Runs mRMR feature selection in parallel as implemented in the mRMRe package: De Jay, N. et al. "mRMRe: an R package for parallelized mRMR ensemble feature selection." Bioinformatics (2013).
#Although run in parallel, the performance is still not very good for high-dimensional data sets (>20.000 features).
#The resulting scores sometimes seem to not be sorted.
#However, these scores are the individual features' scores and feature combinations can result in a different overall ranking.
#Invoked by python's featureselection.MRMRSelector class.
#
#@param args the input parameters parsed from the command line, consisting of
#         - absolute path to the input data set file.
#         - absolute path to the output file where the ranking will be stored.
#         - maximum number of features to select.

library(mRMRe)

args = commandArgs(trailingOnly=TRUE)

inputFile <- args[[1]]
outputLocation <- args[[2]]
maxFeatures <- args[[3]]


if (length(args)<3) {
  stop("Please supply three arguments: inputFile (data set), outputLocation (feature ranking), and maxFeatures (number of features to select)", call.=FALSE)
}

rawData <- read.csv(inputFile, check.names = FALSE, stringsAsFactors = TRUE)
#check.names= FALSE necessary because R introducing X for column names beginning with numbers
geneExpressionMatrix <- rawData[-c(1)]
geneExpressionMatrix[,1] <- as.numeric(geneExpressionMatrix[,1])

#do feature selection here
dd <- mRMR.data(data = geneExpressionMatrix)
features <- mRMR.classic(data = dd, target_indices = c(1), feature_count = maxFeatures)

ranking <- solutions(features)
scores <- features@scores[[1]]

colNames <- names(geneExpressionMatrix)
featureNames <- list(colNames[unlist(ranking)])

outputData <- list(featureNames,scores)

fileOutput <- as.data.frame(outputData)

colnames(fileOutput) <- c("attributeName", "score")

write.csv(fileOutput, file = outputLocation, row.names = FALSE, sep = "\t")

Runs mRMR feature selection in parallel as implemented in the mRMRe package: De Jay, N. et al. “mRMRe: an R package for parallelized mRMR ensemble feature selection.” Bioinformatics (2013). Although run in parallel, the performance is still not very good for high-dimensional data sets (>20.000 features). The resulting scores sometimes seem to not be sorted. However, these scores are the individual features’ scores and feature combinations can result in a different overall ranking. Invoked by featureselection.MRMRSelector.

Param args:the input parameters parsed from the command line, consisting of a) the absolute path to the input data set file, b) the absolute path to the output file where the ranking will be stored, and c)the maximum number of features to select.
.. FS_LassoPenalty.R::
[sources]
#Runs Lasso feature selection with individual penalty scores for each feature as implemented in the xtune package:
#Zeng, C. et al.: "Incorporating prior knowledge into regularized regression", Bioinformatics (2020), https://doi.org/10.1093/bioinformatics/btaa776
#Invoked by python's featureselection.LassoPenaltySelector class.
# #
#@param args the input parameters parsed from the command line, consisting of
#         - absolute path to the input data set file.
#         - absolute path to the output file where the ranking will be stored.
#         - absolute path to the input ranking file (where the external rankings that will serve as penalty scores are stored).

library(xtune)

args = commandArgs(trailingOnly=TRUE)

input.filename <- args[[1]]
output.filename <- args[[2]]
external.filename <- args[[3]]

if (length(args)<3) {
  stop("Please supply three arguments: inputFile (gene expression data), outputLocation (feature ranking), and filename for external scores", call.=FALSE)
}

expression.matrix <- read.csv(input.filename, check.names = FALSE, row.names = 1)
external.scores <- read.csv(external.filename, check.names = FALSE, row.names = 1)
#check.names= FALSE necessary because R introducing X for column names beginning with numbers

expression.levels  <- expression.matrix[-c(1,1)]

#we want to predict the labels = classes
labels <- expression.matrix[1]
#make labels numeric
label.types = unique(labels[,1])

labels <- as.numeric(factor(labels[,1], label.types, labels = 1:length(label.types) ))

#train the model
prior.knowledge.model <- xtune(as.matrix(expression.levels), labels, external.scores)

coefs <- coef(prior.knowledge.model)
final.scores <- as.data.frame(as.matrix(coefs))
#rename score column
colnames(final.scores) <- c("score")
#remove intercept element (drop param keeps row names)
final.scores <- final.scores[-c(1),, drop=FALSE]
feature.indexes <- order(final.scores$score,decreasing = TRUE)
final.scores <- final.scores[feature.indexes,, drop = FALSE] #drop retains row names
final.scores <- cbind(attributeName = rownames(final.scores), final.scores)

#write dataframe
write.table(final.scores, output.filename, row.names = FALSE, sep = "\t")

Runs Lasso feature selection with individual penalty scores for each feature as implemented in the xtune package: Zeng, C. et al.: “Incorporating prior knowledge into regularized regression”, Bioinformatics (2020), https://doi.org/10.1093/bioinformatics/btaa776 Invoked by featureselection.LassoPenaltySelector.

Param args:the input parameters parsed from the command line, consisting of a) the absolute path to the input data set file, b) the absolute path to the output file where the ranking will be stored, and c) the absolute path to the input ranking file (where the external rankings that will serve as penalty scores are stored).
.. FS_Variance.R::
[sources]
#Runs variance-based feature selection as implemented in the genefilter package.
#For every feature, computes its variance across all samples.
#Features are then ranked in descending order - highly variant features are more likely to separate samples into classes and seem to be the most interesting ones.
#Invoked by python's featureselection.VarianceSelector class.
#
#@param args the input parameters parsed from the command line, consisting of
#         - absolute path to the input data set file.
#         - absolute path to the output file where the ranking will be stored.

library(genefilter)

args = commandArgs(trailingOnly=TRUE)

if (length(args)<2) {
  stop("Please supply two arguments: inputFile (gene expression data) and outputLocation (feature ranking)", call.=FALSE)
}

rawData <- read.csv(args[1], check.names = FALSE)
#check.names= FALSE necessary because R introducing X for column names beginning with numbers

geneExpressionMatrix <- rawData[-c(1,2)]

rV <- rowVars(t(geneExpressionMatrix))

ordered <- rV[order(-rV) , drop = FALSE]

orderedNameList <-  names(ordered)

orderedNameValueList <- c(orderedNameList,data.frame(ordered)[,1])

orderedNameValueMatrix <- matrix(orderedNameValueList,ncol=2)

colnames(orderedNameValueMatrix) <- c("attributeName", "score")

write.table(orderedNameValueMatrix, file = args[2], row.names = FALSE, sep = "\t")

Runs variance-based feature selection as implemented in the genefilter package. For every feature, computes its variance across all samples. Features are then ranked in descending order - highly variant features are more likely to separate samples into classes and seem to be the most interesting ones. Invoked by featureselection.VarianceSelector.

Param args:the input parameters parsed from the command line, consisting of ab) the absolute path to the input data set file and b) the absolute path to the output file where the ranking will be stored.

Utility

.. DataCharacteristicsPlotting.R::
[sources]
#Creates plots to show some characteristics of a given expression data set and its class labels.
#Currently supported plots that can be selected:
# - density plots (density)
# - box plot (box)
# - mds plot (mds)
#Invoked by python's evaluation.DatasetEvaluator class.
#
#@param args the input parameters parsed from the command line, consisting of
#         - absolute path to the input expression file.
#         - absolute path to the output directory where the plots will be stored.
#         - separator to use for reading the input expression file.
#         - a boolean value whether the input expression data is labeled or not.
#         - a list of option names that define what plots are created. Currently supported: density (density plot), box (boxplot of expression values), mds).
library(ggplot2)
library(dplyr)
library(tidyr)
library(tools)

args = commandArgs(trailingOnly=TRUE)

input.filename <- args[[1]]
output.dir <- args[[2]]
separator <- args[[3]]
labeledData <- args[[4]]
options <- args[5:length(args)]

if (length(options) == 0){
  stop("No data characteristics were selected in config file", call.=FALSE)

}
rownamescol <- 1

data <- read.table(input.filename, sep= separator, header=TRUE, row.names = rownamescol, check.names = FALSE, stringsAsFactors = FALSE)
if (labeledData == "TRUE"){
  unlabeleddata <- data[-1]   #remove label column
  labelCol <- colnames(data)[1] 
}
filename = file_path_sans_ext(basename(input.filename))

#################### PLOT DENSITIES ####################
if ("density" %in% options){
  output.filename = paste0(output.dir, "density_", filename, ".pdf")
  pl2 <- data %>% gather(key = "Gene", value = "Expression", -1) %>% ggplot(., aes(x=Expression, color = get(labelCol))) + geom_density()
  pdf(output.filename)
  print(pl2)
  dev.off()
}

#################### PLOT DISTRIBUTION ####################
if ("box" %in% options){
  output.filename = paste0(output.dir, "distribution_", filename, ".pdf")
  boxpl <- data %>% gather(key = "Gene", value = "Expression", -1) %>% ggplot(., aes(get(labelCol), Expression, color = get(labelCol))) + geom_boxplot()
  pdf(output.filename)
  print(boxpl)
  dev.off()
}

#################### PLOT MDS ####################
if ("mds" %in% options){
  output.filename = paste0(output.dir, "mds_", filename, ".pdf")
  #compute euclidean distance matrix (euclidean distance is used in the limma package)
  dist.eu <- as.matrix(dist(unlabeleddata, method = "euclidean"))
  mds.eu <- as.data.frame(cmdscale(dist.eu))
  mds.eu <- merge(mds.eu, data[1], by="row.names", all=TRUE)

  mdspl <- ggplot(mds.eu, aes(V1, V2, label = classLabel, color = classLabel)) + 
    geom_point(size=2) + 
    labs(x="", y="", title="MDS by Euclidean") + theme_bw()

  pdf(output.filename)
  print(mdspl)
  dev.off()
}

Creates plots to show some characteristics of a given expression data set and its class labels. Currently supported plots that can be selected: density plots (density), box plot (box), and mds plot (mds). Invoked by evaluation.DatasetEvaluator.

Param args:the input parameters parsed from the command line, consisting of a) the absolute path to the input expression file, b) the absolute path to the output directory where the plots will be stored, c) the separator to use for reading the input expression file, d) a boolean value whether the input expression data is labeled or not, and e) a list of option names that define what plots are created. Currently supported: density (density plot), box (boxplot of expression values), mds).
.. UpsetDiagramCreation.R::
[sources]
#Uses the UpSetR package to create an upset diagram for a given set of features.
#Invoked by python's evaluation.RankingsEvaluator and evaluation.AnnotationEvaluator classes.
#
#@param args the input parameters parsed from the command line, consisting of
#         - absolute path to the output file where to store the created plot.
#         - number of top k features for which to compute the feature set overlaps.
#         - absolute path to the input directory containing the input files.
#         - string of color ids separated by "_", used for giving every feature ranking a unique color.
#         - list of input files containing the rankings. their order corresponds to the order of colors.
library(UpSetR)
library(stringr)

args = commandArgs(trailingOnly=TRUE)

outputFile = args[[1]]

topK = strtoi(args[[2]])

inputPath = args[[3]]

#remove training _
colorstring = substring(args[[4]], 2)

#split by _
colors = str_split(colorstring, "_")
rankings = list()
setCount = 0

for (filename in args[5:length(args)]) {
  approach.topK = topK
  topGenes = list()
  #find fileending
  parts = gregexpr(pattern ='\\.', filename)[[1]]
  method = substr(filename, 1, parts[length(parts)] - 1)
  file = paste(inputPath, filename, sep="")
  ranking <- read.csv(file, sep = "\t", stringsAsFactors = FALSE)
  #get top n genes for the overlap and adapt topK if we have fewer genes
  numRows = nrow(ranking)
  if (numRows > 0){
    if (numRows < approach.topK){
      approach.topK = numRows
    }
    genes = head(ranking, n = approach.topK)[[1]]
    #as the genes in the columns are by default detected as factors, we just get the levels (=distinct names)
    topGenes = genes
    
  }
    
  #only add approach results if they are not empty
  if (length(topGenes) > 0){
    #make a list out of topGenes otherwise we have a format issue
    rankings[method] = list(topGenes)
    setCount = setCount + 1
  }
}
print(paste0("UPSETR SETCOUNT ", toString(setCount)))
print(paste0("UPSETR COLORCOUNT ", toString(length(colors))))
pdf(file=outputFile, onefile=FALSE)
upset(fromList(rankings), nsets = setCount, order.by = "freq", sets.bar.color = unlist(colors, use.names=FALSE))

dev.off()
Uses the UpSetR package to create an upset diagram for a given set of features. Invoked by :class:`evaluation.RankingsEvaluator` and :class:`evaluation.AnnotationEvaluator`. :param args: the input parameters parsed from the command line, consisting of a) the absolute path to the output file where to store the created plot, b) the number of top k features for which to compute the feature set overlaps, c) absolute path to the input directory containing the input files, d) a string of color ids separated by "_", used for giving every feature ranking a unique color, and e) a list of input files containing the rankings, where the rankings order corresponds to the order of colors.
.. IdentifierMapping.R::
[sources]
#Maps a set of identifiers to a desired format by using the biomaRt package.
#Currently, usage of this functionality is not enabled as biomaRt showed to be very unstable for returning queries in parallel
#(the server is not reachable, the connection is blocked, ...).
#The current implementation sends the identifiers for mapping in chunks of 10.000 identifiers
#(that was one desperate try to improve biomaRt stability, but it probably did not help...).
# Invoked by python's knowledgebases.BioMART class.
#
#@param args the input parameters parsed from the command line, consisting of
#         - original ID format (corresponding to biomaRt identifiers), e.g. ensembl_gene_id
#         - desired ID format (corresponding to biomaRt identifiers), e.g. hgnc_symbol
#         - absolute path to the input file, which contains one identifier per line
#         - absolute path to the output file where the mapping will be stored.

library(biomaRt)
library(httr)

args = commandArgs(trailingOnly=TRUE)
originalIDFormat <- args[[1]]
requiredIDFormat <- args[[2]]
inputFile <- args[[3]]
outputFile <- args[[4]]
httr::set_config(httr::config(ssl_verifypeer = FALSE))
mart<-useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL",
                 dataset = "hsapiens_gene_ensembl", mirror = "useast")
print(inputFile)
#load data set from file
data <- read.csv(inputFile, header = FALSE, stringsAsFactors = FALSE)
#fetch gene mapping
#attributes: what your results shall include
#filters: what platform your IDs are currently from
#values: the concrete IDs as input
#mapping <- getBM(attributes = c(originalIDFormat, requiredIDFormat), filters = originalIDFormat, values=data, mart = mart)
chunksize = 10000

final_mapping = NULL
chunks = as.integer(nrow(data)/chunksize)
if (chunks > 0){
  for (i in 1:chunks){
    start = ((i-1) * chunksize) + 1
    end = i * chunksize
    genechunk = data[start:end,1]
    mapping <- getBM(attributes = c(originalIDFormat, requiredIDFormat), filters = originalIDFormat, values=(genechunk), mart = mart)
    if (is.null(final_mapping)){
      final_mapping = mapping
    } else{
      final_mapping = rbind(final_mapping, mapping)
    }
  }
}

leftover = as.integer(nrow(data) - (chunks * chunksize))
if (leftover > 0){
  start = (chunks * chunksize) + 1
  end = nrow(data)
  last_genechunk = data[start:end,1]
  mapping <- getBM(attributes = c(originalIDFormat, requiredIDFormat), filters = originalIDFormat, values=(last_genechunk), mart = mart)
  if (is.null(final_mapping)){
    final_mapping = mapping
  } else{
    final_mapping = rbind(final_mapping, mapping)
  }
}
write.csv(final_mapping, outputFile, row.names = FALSE)

The current implementation sends the identifiers for mapping in chunks of 10.000 identifiers (that was one desperate try to improve biomaRt stability, but it probably did not help…). Invoked by knowledgebases.BioMART.

Param args:the input parameters parsed from the command line, consisting of a) the original ID format (corresponding to biomaRt identifiers), e.g. ensembl_gene_id, b) the desired ID format (corresponding to biomaRt identifiers), e.g. hgnc_symbol, c) the absolute path to the input file, which contains one identifier per line, and d) the absolute path to the output file where the mapping will be stored.