Prior Knowledge Approaches¶

Comprior provides multiple prior knowledge approaches of types Modifying Prior Knowledge Approaches, Combining Approaches, Network/Pathway Approaches as defined by Perscheid: “Integrative biomarker detection on high-dimensional gene expression data sets: a survey on prior knowledge approaches”
all of them can be flexibly combined with any of the available knowledge bases (see the configuration parameter description on how to do that)

Modifying Prior Knowledge Approaches¶

type of prior knowledge used: list of relevant genes (no association scores)
traditional feature selection and prior knowledge retrieval are carried out independently
Comprior allows to design flexible modifying prior knowledge approaches that can be combined with any knowledge base and any traditional approach
kind of two-level approaches that introduce an additional filtering or extension step before or after a traditional feature selection approach

Prefilter: prior knowledge is retrieved first and the input data set is filtered for those genes that were retrieved from the knowledge base; traditional feature selection is carried out afterwards
Postfilter: Traditional feature selection is carried out first, and the resulting features are then filtered to keep only those that were also retrieved by the knowledge base
prefilter and postfilter approaches have the same results for univariate feature selection approaches, e.g. Variance

Comprior retrieves relevant genes and interleaves the gene ranking retrieved by a traditional approach with the set of relevant genes from the knowledge base
this way, a feature set always not only contains traditionally selected genes, but also nearly as much genes that were retrieved from a knowledge base so that the feature set can contain genes that have a high statistical relevance but no (so far identified) biological relevance according to the knowledge base and vice versa

type of prior knowledge used: relevant genes and their association scores (for the search terms)
traditional feature selection and prior knowledge retrieval are carried out in parallel and integrated more thoroughly
if a gene has multiple association scores (because it is associated to multiple search terms), Comprior will always keep the highest association score and remove the duplicate entries
potentially, network information can also be retrieved via Comprior and then be mapped to some kind of relevance score, e.g. by incorporating topological information of a gene

gene association scores are used as individual penalty term per feature applied to Lasso
Comprior uses the xtune R package implementation by Zeng et al.: “Incorporating prior knowledge into regularized regression”

the final relevance score \(s_i\) for a gene \(i\) is made up of two parts: the association score from the knowledge base \(s_{i,kb}\), and the statistical relevance score \(s_{i,trad}\) from a traditional approach
both scores are equally weighted to compute the final relevance score for a gene: \(s_i = s_{i,kb} \times s_{i,trad}\)

network/pathway approaches use network information to identify (sub-) networks or pathways as new features and map the feature space from the original genes to the (sub-)networks
network/pathway approaches thus always have a) a feature, i.e. pathway/subnetwork, selection step and b) a mapping step where new feature values must be computed

feature selection as described by Tian et al.: “Discovering statistically significant pathways in expression profiling studies”
- a pathway/subnetwork is considered relevant if the gene expression profiles of its member genes correlate with the data set classes
- average ANOVA score from all pathway member genes and class labels
- rank pathways (= new features) by their ANOVA scores
feature mapping is based on Vert and Kanehisa’s definition of pathway relevance and smoothness: “Graph-driven feature extraction from microarray data using diffusion kernels and kernel CCA”
- omputes pathway activity scores for every sample and pathway as new feature values.
- the feature value \(v_{p,s}\) for a pathway \(p`and sample :math:`s\) is computed by taking the expression levels of all member genes \(i\) (\(expr_i\)) and weighting these by the variance \(var_i\) of gene \(i\) and the average correlation score \(corr_{i,neighbors_i}\) of its neighbor genes in pathway \(p\): \(average(expr_i \times var_i \times corr_{i,neighbors_i})\)