Prior Knowledge Approaches

Modifying Prior Knowledge Approaches

  • type of prior knowledge used: list of relevant genes (no association scores)
  • traditional feature selection and prior knowledge retrieval are carried out independently
  • Comprior allows to design flexible modifying prior knowledge approaches that can be combined with any knowledge base and any traditional approach
  • kind of two-level approaches that introduce an additional filtering or extension step before or after a traditional feature selection approach

Filtering

  • Prefilter: prior knowledge is retrieved first and the input data set is filtered for those genes that were retrieved from the knowledge base; traditional feature selection is carried out afterwards
  • Postfilter: Traditional feature selection is carried out first, and the resulting features are then filtered to keep only those that were also retrieved by the knowledge base
  • prefilter and postfilter approaches have the same results for univariate feature selection approaches, e.g. Variance

Extension

  • Comprior retrieves relevant genes and interleaves the gene ranking retrieved by a traditional approach with the set of relevant genes from the knowledge base
  • this way, a feature set always not only contains traditionally selected genes, but also nearly as much genes that were retrieved from a knowledge base so that the feature set can contain genes that have a high statistical relevance but no (so far identified) biological relevance according to the knowledge base and vice versa

Combining Approaches

  • type of prior knowledge used: relevant genes and their association scores (for the search terms)
  • traditional feature selection and prior knowledge retrieval are carried out in parallel and integrated more thoroughly
  • if a gene has multiple association scores (because it is associated to multiple search terms), Comprior will always keep the highest association score and remove the duplicate entries
  • potentially, network information can also be retrieved via Comprior and then be mapped to some kind of relevance score, e.g. by incorporating topological information of a gene

LassoPenalty

WeightedScore

  • the final relevance score \(s_i\) for a gene \(i\) is made up of two parts: the association score from the knowledge base \(s_{i,kb}\), and the statistical relevance score \(s_{i,trad}\) from a traditional approach
  • both scores are equally weighted to compute the final relevance score for a gene: \(s_i = s_{i,kb} \times s_{i,trad}\)

Network/Pathway Approaches

  • network/pathway approaches use network information to identify (sub-) networks or pathways as new features and map the feature space from the original genes to the (sub-)networks
  • network/pathway approaches thus always have a) a feature, i.e. pathway/subnetwork, selection step and b) a mapping step where new feature values must be computed

NetworkActivity

  • feature selection as described by Tian et al.: “Discovering statistically significant pathways in expression profiling studies”

    • a pathway/subnetwork is considered relevant if the gene expression profiles of its member genes correlate with the data set classes
    • average ANOVA score from all pathway member genes and class labels
    • rank pathways (= new features) by their ANOVA scores
  • feature mapping is based on Vert and Kanehisa’s definition of pathway relevance and smoothness: “Graph-driven feature extraction from microarray data using diffusion kernels and kernel CCA”

    • omputes pathway activity scores for every sample and pathway as new feature values.
    • the feature value \(v_{p,s}\) for a pathway \(p`and sample :math:`s\) is computed by taking the expression levels of all member genes \(i\) (\(expr_i\)) and weighting these by the variance \(var_i\) of gene \(i\) and the average correlation score \(corr_{i,neighbors_i}\) of its neighbor genes in pathway \(p\): \(average(expr_i \times var_i \times corr_{i,neighbors_i})\)

CorgsNetworkActivity

  • feature selection as described by NetworkActivity

  • feature mapping as described by Lee et al: “Inferring pathway activity toward precise disease classification”
    • the feature value \(v_{p,s}\) for a pathway \(p\) and sample \(s\) is computed in the following way:

      • find the subset of genes (=CORGs) for which the score \(S(CORGs)\) is maximal (via greedy search)
      • \(S(CORGs)\) comes from a t-test between an activity vector \(a= (a_1, ..., a_n)\) and class vector \(c = (c_1, ... c_n)\) with \(n = \#samples\), i.e. every sample \(i\) gets an activity score \(a_i\) for the particular set of genes, \(c_i\) is the class label of that sample
      • \(a_i\) is computed from \(\frac{average(expr_{i,CORGs})}{\sqrt{k}}\), with \(k = \#CORGs\) and \(expr_{i,CORGs}\) being the expression values of all CORGs genes for sample \(i\)