Python Code Documentation

pipeline module

The framework module is responsible for orchestrating the complete benchmarking process. It is the starting point that is invoked when running Comprior. It tidies up and prepares working directories, creates and coordinates the execution order of preprocessing modules, feature selectors, and evaluation procedures.

class pipeline.Pipeline(userConfig)[source]

Bases: object

Class that executes the complete benchmarking pipeline.

Parameters:outputRootPath (str) – absolute path to the overall output directory (will be extended by own folders by every evaluation.Evaluator).
prepareExecution(userConfig)[source]

Prepares the pipeline execution by loading the configuration file, clearing intermediate directories, and creating output directories.

Parameters:userConfig (str) – absolute path to an additional user configuration file (config.ini will always be used by default) to overwrite default configuration.
evaluateInputData(inputfile)[source]

Run evaluation.DatasetEvaluator to create plots as specified by the config’s Evaluation-preanalysis parameter.

Parameters:inputfile (str) – absolute path to the input data set to be analyzed.
evaluateKnowledgeBases(labeledInputDataPath)[source]

Evaluates knowledge base coverage for all knowledge bases that are used in the specified feature selection methods. Uses the class labels and alternativeSearchTerms from the config, queries the knowledge bases and creates corresponding plots regarding coverage of theses search terms.

Parameters:labeledInputDataPath (str) – absolute path to the labeled input data set.
runFeatureSelector(selector, datasetLocation, outputDir, loggingDir)[source]

Runs a given feature selector.

Parameters:
  • selector (featureselection.FeatureSelector) – Any feature selector that inherits from featureselection.FeatureSelector.
  • datasetLocation (str) – absolute path to the input data set (from which features should be selected).
  • outputDir (str) – absolute path to the selector’s output directory (where ranking will be written to).
selectFeatures(datasetLocation)[source]

Creates and runs all feature selectors that are listed in the config file. Applies parallelization by running as much feature selectors in parallel as stated in the config’s General–>numCores attribute.

Parameters:
  • datasetLocation (str) – absolute path to the input data set (from which features should be selected).
  • outputRootPath (str) – absolute path to the selector’s output directory (where ranking will be written to).
Returns:

absolute path to directory that contains generated feature rankings.

Return type:

str

assignColors(methods)[source]

Assigns each (feature selection) method a unique color. Will be delivered later on to every evaluation.Evaluator instance to create visualizations with consistent coloring for evaluated approaches.

Parameters:methods (List of str) – List of method names.
Returns:Dictionary containing hex color codes for every method
Return type:dict
assignMarkers(approaches)[source]

Assigns each (feature selection) method a unique color. Will be delivered later on to every evaluation.Evaluator instance to create visualizations with consistent coloring for evaluated approaches.

Parameters:methods (List of str) – List of method names.
Returns:Dictionary containing hex color codes for every method
Return type:dict
evaluateBiomarkers(inputDir, dataset, rankingsDir)[source]

Covers the evaluation phase. Processes input data to only contain the top k selected features per feature selection approach via the evaluation.AttributeRemover. Runs all selected evaluation strategies that cover assessment of rankings (evaluation.RankingsEvaluator), annotations(evaluation.AnnotationEvaluator), and classification performance (evaluation.ClassificationEvaluator). If selected, also conducts cross-validation across data sets with evaluation.CrossEvaluator.

Parameters:
  • inputDir (str) – absolute path to the directory where input data sets are located (for evaluation.AttributeRemover).
  • dataset (str) – absolute file path to the input data set (from which features should be selected).
  • rankingsDir (str) – absolute path to the directory that contains all rankings.
preprocessData()[source]

Preprocesses the input data set specified in the config file. Preprocessing consists of a) transposing the data so that features are in the columns (if necessary), b) mapping the features to the right format (if necessary), c) labeling the data with the user-specified metadata attribute, d) filtering features or samples that have too few information (optional, specified via config), and finally e) putting the analysis-ready data set to the right location for further processing.

Returns:A tuple consisting of the absolute path to the analysis-ready data set and the absolute path to the mapped input final_filename and mapped_input
Return type:tuple(str,str)
loadConfig(userConfig)[source]

Loads the config files. config.ini will always be loaded as default config file, all other config files provided by userConfig overwrite corresponding values.

Parameters:userConfig (str or List of str, optional) – absolute path(s) to user-defined config files that should be used. If config files specify the same parameter, the value specified by the last config file in the list will be used.
prepareDirectories()[source]

Prepares directory structure for benchmarking run. Creates all necessary directories in the output folder. Also cleans up intermediate directory so that no old data is accidentially used.

Returns:absolute path to the directory where all results from this run will be stored.
Return type:str
executePipeline()[source]
The entry point for the overall benchmarking process.
This method is invoked when running the framework, and from here all other steps of the benchmarking process are encapsulated in own methods.
Parameters:userConfig (str) – absolute path to an additional user configuration file (config.ini will always be used by default) to overwrite default configuration.

benchutils module

Utility module that provides functionality that is repeatedly used across the system, e.g. directory handling and file loading, identifier mapping, logging, and running external code from R or Java. It also loads and stores the configuration parameters.

benchutils.loadConfig(path)[source]

Loads the config files.

Parameters:path (str or list of str) – absolute path or list of absolute paths to the config files. For multiple config files specifying the same parameters, the ones from the last config file in the list will be used.
benchutils.getConfig(category)[source]

Get the config entries for a particular category.

Parameters:category (str) – category name.
Returns:all parameters for that config category
Return type:dict
benchutils.getConfigValue(category, identifier)[source]

Get the value for a given config parameter.

Parameters:
  • category (str) – the parameter’s category name.
  • identifier (str) – the parameter name.
Returns:

the parameter value.

Return type:

str

benchutils.getConfigBoolean(category, identifier)[source]

Get the boolean value for a given config parameter.

Parameters:
  • category (str) – the parameter’s category name.
  • identifier (str) – the parameter name.
Returns:

the parameter boolean value.

Return type:

bool

benchutils.loadRanking(rankingFile)[source]

Load a feature ranking from a file.

Parameters:rankingFile (str) – absolute path to the file containing a feature ranking.
Returns:the feature ranking as a DataFrame.
Return type:pandas.DataFrame
benchutils.createOrClearDirectory(directoryLocation)[source]

If the provided directory location is already existing, remove all files in that directory. Create a new directory otherwise.

Parameters:directoryLocation (str) – absolute path to the directory that must be cleared or created.
benchutils.createDirectory(directoryLocation)[source]

Creates a directory.

Parameters:directoryLocation (str) – absolute path to the directory to be created.
benchutils.removeDirectoryContent(directoryLocation)[source]

Remove the files inside a directory.

Parameters:directoryLocation (str) – absolute path to the directory that must be cleared.
benchutils.removeFile(file)[source]

Delete a file.

Parameters:file (str) – absolute path to the file that must be deleted.
benchutils.cleanupResults()[source]

Remove all intermediate files from former runs, e.g. generated during preprocessing or mapping.

benchutils.createLogger(outputPath)[source]

Create a logger for Comprior. Creates two handlers for this logger: one for console output that only contains high-level status update logs and error messages. Warnings and other tracing information is written to an extra log file.

Parameters:outputPath (String) – absoulte path to where the log file will be stored.
benchutils.logDebug(message)[source]

Write a log at debug level.

Parameters:message (String) – the log message to print.
benchutils.logInfo(message)[source]

Write a log at info level.

Parameters:message (String) – the log message to print.
benchutils.logWarning(message)[source]

Write a log at warning level.

Parameters:message (String) – the log message to print.
benchutils.logError(message)[source]

Write a log at error level.

Parameters:message (String) – the log message to print.
benchutils.createTimeLog()[source]

Create the data structure for tracing runtimes of feature selection approaches.

Returns:the logging data structure.
Return type:pandas.DataFrame
benchutils.flushTimeLog(timeLogs, outputFilePath)[source]

Write the whole log (of runtimes) to a file.

Parameters:
  • timeLogs (pandas.DataFrame) – the logs in a DataFrame.
  • outputFilePath (str) – absolute path to the log file.
benchutils.logRuntime(timeLogs, start, end, message)[source]

Write a runtime log entry and add it to the runtime log data structure.

Parameters:
  • timeLogs (pandas.DataFrame) – logs to which the new entry should be added
  • start (str) – starting time.
  • end (str) – ending time.
  • message (str) – description of that entry.
Returns:

updated logs.

Return type:

pandas.DataFrame

benchutils.runRCommand(rConfig, scriptName, params)[source]

Run external R code.

Parameters:
  • rConfig (dict) – R config parameters (store paths to Rscript and the R code).
  • scriptName (str) – name of the R script to be executed.
  • params (list of str) – list of parameters that will be forwarded to the R script.
benchutils.runJavaCommand(javaConfig, scriptName, params)[source]

Run external Java code.

Parameters:
  • javaConfig (dict) – java config parameters (store paths to java and the java code).
  • scriptName (str) – name of the jar to be executed.
  • params (list of str) – list of parameters that will be forwarded to the jar.
benchutils.mapIdentifiers(itemList, originalFormat, desiredFormat)[source]

Write a log entry and add it to the log data structure.

Parameters:
  • itemList (list of str) – list of identifiers, e.g. gene names, to be mapped
  • originalFormat (str) – current format of the identifiers.
  • desiredFormat (str) – desired format to which the identifiers should be mapped.
Returns:

mapping table where every item from itemList is now mapped to desiredFormat.

Return type:

pandas.DataFrame

benchutils.mapGeneList(genes, originalFormat, desiredFormat, outputFile)[source]

Map a list of genes to the desired format.

Parameters:
  • genes (list of str) – list of gene names to be mapped
  • originalFormat (str) – current format of the gene names.
  • desiredFormat (str) – desired format to which the gene names should be mapped.
  • outputFile (str) – absolute path to the output file in which the mapping should be stored.
Returns:

list of mapped gene names.

Return type:

list of str

benchutils.mapRanking(ranking, originalFormat, desiredFormat, outputFile)[source]

Map the feature names of a ranking to the desired format.

Parameters:
  • ranking (pandas.DataFrame) – DataFrame of the ranking.
  • originalFormat (str) – current format of the feature names in the ranking.
  • desiredFormat (str) – desired format to which the feature names should be mapped.
  • outputFile (str) – absolute path to the output file in which the mapped feature ranking should be stored.
Returns:

mapped feature ranking.

Return type:

pandas.DataFrame

benchutils.retrieveMappings(itemList, originalFormat, desiredFormat)[source]

Query the knowledge base to map the identifiers. We have mapping via BiomaRt and gConvert available. gConvert is currently used because BiomaRt is unstable and blocks when parallel queries are sent.

Parameters:
  • itemList (list of str) – list of identifier names to be mapped
  • originalFormat (str) – current format of the identifiers.
  • desiredFormat (str) – desired format to which the identifiers should be mapped.
Returns:

mapping table for all identifiers.

Return type:

pandas.DataFrame

benchutils.mapDataMatrix(inputMatrix, genesInColumns, originalFormat, desiredFormat, outputFile, labeled)[source]

Map the features of a data set to the desired format.

Parameters:
  • inputMatrix (pandas.DataFrame) – DataFrame of the ranking.
  • genesInColumns (bool) – if the genes/features are located in the columns.
  • originalFormat (str) – current format of the feature names in the data set.
  • desiredFormat (str) – desired format to which the feature names should be mapped.
  • outputFile (str) – absolute path to the output file in which the mapped data set should be stored.
  • labeled (bool) – if the data matrix is additionally labeled.
Returns:

mapped data set.

Return type:

pandas.DataFrame

preprocessing module

Contains all classes related to preprocessing. All classes providing preprocessing functionality have to inherit from the abstract class preprocessing.Preprocessor and implement its preprocessing.Preprocessor.preprocess() method. For a detailed look at the class architecture, have a look at ADD CLASS ARCHITECTURE LINK HERE.

class preprocessing.Preprocessor(input, metadata, output)[source]

Bases: object

Super class of all preprocessor implementations. Inherit from this class and implement preprocessing.Preprocessor.preprocess() if you want to add a new preprocessor class.

Parameters:
  • input (str) – absolute path to the input file.
  • metadata (str) – absolute path to the metadata file.
  • output (str) – absolute path to the output directory.
preprocess()[source]

Abstract method. Interface method that is invoked externally to trigger preprocessing.

Returns:absolute path to the preprocessed output file.
Return type:str
class preprocessing.MappingPreprocessor(input, output, currentFormat, desiredFormat, labeled)[source]

Bases: preprocessing.Preprocessor

Maps the input data set to a desired format.

Parameters:
  • input (str) – absolute path to the input file.
  • output (str) – absolute path to the output directory.
  • currentFormat (str) – current identifier format.
  • desiredFormat (str) – desired identifier format.
  • labeled (bool) – boolean value if the input data is labeled.
preprocess()[source]

Maps the identifiers in the input dataset to the desired format that was specified when constructing the preprocessor.

Returns:absolute path to the mapped file.
Return type:str
class preprocessing.FilterPreprocessor(input, metadata, output)[source]

Bases: preprocessing.Preprocessor

Filters features or samples above a user-defined threshold of missing values.

Parameters:
  • input (str) – absolute path to the input file.
  • metadata (str) – absolute path to the metadata file.
  • output (str) – absolute path to the output directory.
  • config (str) – configuration parameter for preprocessing as specified in the config file.
preprocess()[source]

Depending on what is specified in the config file, filter samples and/or features. Remove all samples/features that have missing values above the threshold specified in the config.

Returns:absolute path to the filtered output file.
Return type:str
filterMissings(threshold, data)[source]

Filter the data for entries that have missing information above the given threshold.

Parameters:
  • threshold (str) – maximum percentage of allowed missing items as string.
  • data (pandas.DataFrame) – a DataFrame to be filtered
Returns:

filtered DataFrame.

Return type:

pandas.DataFrame

class preprocessing.DataTransformationPreprocessor(input, metadata, output, dataSeparator)[source]

Bases: preprocessing.Preprocessor

Transform the input data to have features in the columns for subsequent processing.

Parameters:
  • input (str) – absolute path to the input file.
  • metadata (str) – absolute path to the metadata file.
  • output (str) – absolute path to the output directory.
  • dataSeparator (str) – delimiter to use when parsing the input file.
preprocess()[source]

If not already so, transpose the input data to have the features in the columns.

Returns:absolute path to the correctly formatted output file.
Return type:str
class preprocessing.MetaDataPreprocessor(input, metadata, output, separator)[source]

Bases: preprocessing.Preprocessor

Add labels to input data. Get labels from meta data attribute that was specified in the user config.

Parameters:
  • input (str) – absolute path to the input file.
  • metadata (str) – absolute path to the metadata file.
  • output (str) – absolute path to the output directory.
  • dataSeparator (str) – delimiter to use when parsing the input and metadata file.
  • diseaseColumn (str) – column name of the class labels.
  • transposeMetadataMatrix (bool) – boolean value if the identifier names are located in the columns, as specified in the config file.
preprocess()[source]

Labels all samples of a data set. Labels are taken from the corresponding metadata file and the metadata attribute that was specified in the config file. Samples without metadata information well be assigned to class “NotAvailable”.

Returns:absolute path to the labeled data set.
Return type:str
class preprocessing.DataMovePreprocessor(input, output)[source]

Bases: preprocessing.Preprocessor

Moves the input data set to the specified location.

Parameters:
  • input (str) – absolute path to the input file.
  • output (str) – absolute path to the output directory.
preprocess()[source]

Moves a file (self.input) to another location (self.output). Typically used at the end of preprocessing, when the final data set is moved to a new location for the actual analysis.

Returns:absolute path to the new file location.
Return type:str

featureselection module

Contains all classes related to feature selection. Each feature selection approach must be implemented in its own class inheriting from the abstract super class featureselection.FeatureSelector or one of its abstract subclasses, e.g. for including R or Java code. Each feature selection class must implement setParams() and selectFeatures(), as input or output parameters are just set at runtime.

Feature extraction methods are implemented in the same structure, except that they need to have an instance of a class inheriting from featureselection.PathwayMapper assigned to them so that the feature space can be transformed from the original to the new, e.g. pathways.

The creation of feature selectors is encapsulated by the class featureselection.FeatureSelectorFactory that takes care that every selector is equipped correspondingly, e.g. with a knowledge base or another feature selector. For a detailed look at the class architecture and the inheritance structure, have a look at ADD CLASS ARCHITECTURE LINK HERE.

class featureselection.FeatureSelectorFactory[source]

Bases: object

Singleton class. Python code encapsulates it in a way that is not shown in Sphinx, so have a look at the descriptions in the source code.

Creates feature selector object based on a given name. New feature selection approaches must be registered here. Names for feature selectors must follow to a particular scheme, with keywords separated by _: - first keyword is the actual selector name - if needed, second keyword is the knowledge base - if needed, third keyword is the (traditional) approach to be combined Examples: - Traditional Approaches have only one keyword, e.g. InfoGain or ANOVA - LassoPenalty_KEGG provides KEGG information to the LassoPenalty feature selection approach - Weighted_KEGG_InfoGain –> Factory creates an instance of KBweightedSelector which uses KEGG as knowledge base and InfoGain as traditional selector. While the focus here lies on the combination of traditional approaches with prior biological knowledge, it is theoretically possible to use ANY selector object for combination that inherits from FeatureSelector.

Parameters:config (dict) – configuration parameters for UMLS web service as specified in config file.
instance = None
class featureselection.FeatureSelector(name)[source]

Bases: object

Abstract super class for feature selection functionality. Every feature selection class has to inherit from this class and implement its FeatureSelector.selectFeatures() method and - if necessary - its FeatureSelector.setParams() method. Once created, feature selection can be triggered by first setting parameters (input, output, etc) as needed with FeatureSelector.setParams(). The actual feature selection is triggered by invoking FeatureSelector.selectFeatures().

Parameters:
  • input (str) – absolute path to input dataset.
  • output (str) – absolute path to output directory (where the ranking will be stored).
  • dataset (pandas.DataFrame) – the dataset for which to select features. Will be loaded dynamically based on self.input at first usage.
  • dataConfig (dict) – config parameters for input data set.
  • name (str) – selector name
selectFeatures()[source]

Abstract. Invoke feature selection functionality in this method when implementing a new selector

Returns:absolute path to the output ranking file.
Return type:str
getTimeLogs()[source]

Gets all logs for this selector.

Returns:dataframe of logged events containing start/end time, duration, and a short description.
Return type:pandas.DataFrame
setTimeLogs(newTimeLogs)[source]

Overwrites the current logs with new ones.

Parameters:newTimeLogs (pandas.DataFrame) – new dataframe of logged events containing start/end time, duration, and a short description.
disableLogFlush()[source]

Disables log flushing (i.e., writing the log to a separate file) of the selector at the end of feature selection. This is needed when a CombiningSelector uses a second selector and wants to avoid that its log messages are written, potentially overwriting logs from another selector of the same name.

enableLogFlush()[source]

Enables log flushing, i.e. writing the logs to a separate file at the end of feature selection.

getName()[source]

Gets the selector’s name.

Returns:selector name.
Return type:str
getData()[source]

Gets the labeled dataset from which to select features.

Returns:dataframe containing the dataset with class labels.
Return type:pandas.DataFrame
getUnlabeledData()[source]

Gets the dataset without labels.

Returns:dataframe containing the dataset without class labels.
Return type:pandas.DataFrame
getFeatures()[source]

Gets features from the dataset.

Returns:list of features.
Return type:list of str
getUniqueLabels()[source]

Gets the unique class labels available in the dataset.

Returns:list of distinct class labels.
Return type:list of str
getLabels()[source]

Gets the labels in the data set.

Returns:all labels from the dataset.
Return type:list of str
setParams(inputPath, outputDir, loggingDir)[source]

Sets parameters for the feature selection run: path to the input datast and path to the output directory.

Parameters:
  • inputPath (str) – absolute path to the input file containing the dataset for analysis.
  • outputDir (str) – absolute path to the output directory (where to store the ranking)
  • loggingDir (str) – absolute path to the logging directory (where to store log files)
writeRankingToFile(ranking, outputFile, index=False)[source]

Writes a given ranking to a specified file.

Parameters:
  • ranking (pandas.DataFrame) – dataframe with the ranking.
  • outputFile (str) – absolute path of the file where ranking will be stored.
  • index (bool, default False) – whether to write the dataframe’s index or not.
class featureselection.PythonSelector(name)[source]

Bases: featureselection.FeatureSelector

Abstract. Inherit from this class when implementing a feature selector using any of scikit-learn’s functionality. As functionality invocation, input preprocessing and output postprocessing are typically very similar/the same for such implementations, this class already encapsulates it. Instead of implementing PythonSelector.selectFeatures(), implement PythonSelector.runSelector().

runSelector(data, labels)[source]

Abstract - implement this method when inheriting from this class. Runs the actual feature selector of scikit-learn. Is invoked by PythonSelector.selectFeatures().

Parameters:
  • data (pandas.DataFrame) – dataframe containing the unlabeled dataset.
  • labels (list of int) – numerically encoded class labels.
Returns:

sklearn/mlxtend selector that ran the selection (containing coefficients etc.).

selectFeatures()[source]

Executes the feature selection procedure. Prepares the input data set to match scikit-learn’s expected formats and postprocesses the output to create a ranking.

Returns:absolute path to the output ranking file.
Return type:str
prepareInput()[source]

Prepares the input data set before running any of scikit-learn’s selectors. Removes the labels from the input data set and encodes the labels in numbers.

Returns:dataset (without labels) and labels encoded in numbers.
Return type:pandas.DataFrame and list of int
prepareOutput(outputFile, data, selector)[source]

Transforms the selector output to a valid ranking and stores it into the specified file.

Parameters:
  • outputFile (str) – absolute path of the file to which to write the ranking.
  • data (pandas.DataFrame) – input dataset.
  • selector – selector object from scikit-learn.
class featureselection.RSelector(name)[source]

Bases: featureselection.FeatureSelector

Selector class for invoking R code for feature selection. Inherit from this class if you want to use R code, implement RSelector.createParams() with what your script requires, and set self.scriptName accordingly.

Parameters:rConfig (dict) – config parameters to execute R code.
createParams(filename)[source]

Abstract. Implement this method to set the parameters your R script requires.

Parameters:filename (str) – absolute path of the output file.
Returns:list of parameters to use for R code execution, e.g. input and output filenames.
Return type:list of str
selectFeatures()[source]

Triggers the feature selection. Actually a wrapper method that invokes external R code.

Returns:absolute path to the result file containing the ranking.
Return type:str
class featureselection.JavaSelector(name)[source]

Bases: featureselection.FeatureSelector

Selector class for invoking R code for feature selection. Inherit from this class if you want to use R code, implement RSelector.createParams() with what your script requires, and set self.scriptName accordingly.

Parameters:javaConfig (dict) – config parameters to execute java code.
createParams()[source]

Abstract. Implement this method to set the parameters your java code requires.

Returns:list of parameters to use for java code execution, e.g. input and output filenames.
Return type:list of str
selectFeatures()[source]

Triggers the feature selection. Actually a wrapper method that invokes external java code.

Returns:absolute path to the result file containing the ranking.
Return type:str
class featureselection.PriorKnowledgeSelector(name, knowledgebase)[source]

Bases: featureselection.FeatureSelector

Super class for all prior knowledge approaches. If you want to implement an own prior knowledge approach that uses a knowledge base (but not a second selector and no network approaches), inherit from this class.

Parameters:
  • knowledgebase (knowledgebases.KnowledgeBase or inheriting class) – instance of a knowledge base.
  • alternativeSearchTerms (list of str) – list of alternative search terms to use for querying the knowledge base.
selectFeatures()[source]

Abstract. Implement this method when inheriting from this class.

Returns:absolute path to the output ranking file.
Return type:str
collectAlternativeSearchTerms()[source]

Gets all alternative search terms that were specified in the config file and put them into a list.

Returns:list of alternative search terms to use for querying the knowledge base.
Return type:list of str
getSearchTerms()[source]

Gets all search terms to use for querying a knowledge base. Search terms that will be used are a) the class labels in the dataset, and b) the alternative search terms that were specified in the config file.

Returns:list of search terms to use for querying the knowledge base.
Return type:list of str
getName()[source]

Returns the full name (including applied knowledge base) of this selector.

Returns:selector name.
Return type:str
class featureselection.CombiningSelector(name, knowledgebase, tradApproach)[source]

Bases: featureselection.PriorKnowledgeSelector

Super class for prior knoweldge approaches that use a knowledge base AND combine it with any kind of selector, e.g. a traditional approach. Inherit from this class if you want to implement a feature selector that requires both a knowledge base and another selector, e.g. because it combines information from both.

Parameters:
  • knowledgebase (knowledgebases.KnowledgeBase or inheriting class) – instance of a knowledge base.
  • tradApproach (FeatureSelector) – any feature selector implementation to use internally, e.g. a traditional approach like ANOVA
selectFeatures()[source]

Abstract. Implement this method as desired when inheriting from this class.

Returns:absolute path to the output ranking file.
Return type:str
getName()[source]

Returns the full name (including applied knowledge base and feature selector) of this selector.

Returns:selector name.
Return type:str
getExternalGenes()[source]

Gets all genes related to the provided search terms from the knowledge base.

Returns:list of gene names.
Return type:list of str
class featureselection.NetworkSelector(name, knowledgebase, featuremapper)[source]

Bases: featureselection.PriorKnowledgeSelector

Abstract. Inherit from this method if you want to implement a new network approach that actually conducts feature EXTRACTION, i.e. maps the original data set to have pathway/subnetworks. Instead of FeatureSelector.selectFeatures() implement NetworkSelector.selectPathways() when inheriting from this class.

Instances of NetworkSelector and inheriting classes also require a PathwayMapper object that transfers the dataset to the new feature space. Custom implementations thus need to implement a) a selection strategy to select pathways and b) a mapping strategy to compute new feature values for the selected pathways.

Parameters:featureMapper (FeatureMapper or inheriting class) – feature mapping object that transfers the feature space.
selectPathways(pathways)[source]

Selects the pathways that will become the new features of the data set. Implement this method (instead of FeatureSelector.selectFeatures() when inheriting from this class.

Parameters:pathways (dict) – dict of pathways (pathway names as keys) to select from.
Returns:pathway ranking as dataframe
Return type:pandas.DataFrame
writeMappedFile(mapped_data, fileprefix)[source]

Writes the mapped dataset with new feature values to the same directory as the original file is located (it will be automatically processed then).

Parameters:
  • mapped_data (pandas.DataFrame) – dataframe containing the dataset with mapped feature space.
  • fileprefix (str) – prefix of the file name, e.g. the directory path
Returns:

absolute path of the file name to store the mapped data set.

Return type:

str

getName()[source]

Gets the selector name (including the knowledge base).

Returns:selector name.
Return type:str
filterPathways(pathways)[source]
selectFeatures()[source]

Instead of selecting existing features, instances of NetworkSelector select pathways or submodules as features. For that, it first queries its knowledge base for pathways. It then selects the top k pathways (strategy to be implemented in NetworkSelector.selectPathways()) and subsequently maps the dataset to its new feature space. The mapping will be conducted by an object of PathwayMapper or inheriting classes. If a second dataset for cross-validation is available, the feature space of this dataset will also be transformed.

Returns:absolute path to the pathway ranking.
Return type:str
class featureselection.RandomSelector[source]

Bases: featureselection.FeatureSelector

Baseline Selector: Randomly selects any features.

selectFeatures()[source]

Randomly select any features from the feature space. Assigns a score of 0.0 to every feature

Returns:absolute path to the ranking file.
Return type:str
class featureselection.AnovaSelector[source]

Bases: featureselection.PythonSelector

Runs ANOVA feature selection using scikit-learn implementation

runSelector(data, labels)[source]

Runs the ANOVA feature selector of scikit-learn. Is invoked by PythonSelector.selectFeatures().

Parameters:
  • data (pandas.DataFrame) – dataframe containing the unlabeled dataset.
  • labels (list of int) – numerically encoded class labels.
Returns:

sklearn/mlxtend selector that ran the selection (containing coefficients etc.).

class featureselection.Variance2Selector[source]

Bases: featureselection.PythonSelector

Runs variance-based feature selection using scikit-learn.

prepareOutput(outputFile, data, selector)[source]

Transforms the selector output to a valid ranking and stores it into the specified file. We need to override this method because variance selector has no attribute scores but variances.

Parameters:
  • outputFile (str) – absolute path of the file to which to write the ranking.
  • data (pandas.DataFrame) – input dataset.
  • selector – selector object from scikit-learn.
runSelector(data, labels)[source]

Runs the actual variance-based feature selector of scikit-learn. Is invoked by PythonSelector.selectFeatures().

Parameters:
  • data (pandas.DataFrame) – dataframe containing the unlabeled dataset.
  • labels (list of int) – numerically encoded class labels.
Returns:

sklearn/mlxtend selector that ran the selection (containing coefficients etc.).

class featureselection.MRMRSelector[source]

Bases: featureselection.RSelector

Runs maximum Relevance minimum Redundancy (mRMR) feature selection using the mRMRe R implementation: https://cran.r-project.org/web/packages/mRMRe/index.html Actually a wrapper class for invoking the R code.

Parameters:
  • scriptName (str) – name of the R script to invoke.
  • maxFeatures (int) – maximum number of features to select. Currently all features (=0) are ranked..
createParams(outputFile)[source]

Sets the parameters the R script requires (input file, output file, maximum number of features).

Returns:list of parameters to use for mRMR execution in R.
Return type:list of str
class featureselection.VarianceSelector[source]

Bases: featureselection.RSelector

Runs variance-based feature selection using R genefilter library. Actually a wrapper class for invoking the R code.

Parameters:scriptName (str) – name of the R script to invoke.
createParams(outputFile)[source]

Sets the parameters the R script requires (input file, output file).

Parameters:outputFile (str) – absolute path to the output file that will contain the ranking.
Returns:list of parameters to use for mRMR execution in R.
Return type:list of str
class featureselection.InfoGainSelector[source]

Bases: featureselection.JavaSelector

Runs InfoGain feature selection as provided by WEKA: https://www.cs.waikato.ac.nz/ml/weka/ Actually a wrapper class for invoking java code.

createParams()[source]

Sets the parameters the java program requires (input file, output file, selector name).

Returns:list of parameters to use for InfoGain execution in java.
Return type:list of str
class featureselection.ReliefFSelector[source]

Bases: featureselection.JavaSelector

Runs ReliefF feature selection as provided by WEKA: https://www.cs.waikato.ac.nz/ml/weka/ Actually a wrapper class for invoking java code.

createParams()[source]

Sets the parameters the java program requires (input file, output file, selector name).

Returns:list of parameters to use for InfoGain execution in java.
Return type:list of str
class featureselection.KbSelector(knowledgebase)[source]

Bases: featureselection.PriorKnowledgeSelector

Knowledge base selector. Selects features exclusively based the information retrieved from a knowledge base.

Parameters:knowledgebase (knowledgebases.KnowledgeBase) – instance of a knowledge base.
updateScores(entry, newGeneScores)[source]

Updates a score entry with the new score retrieved from the knowledge base. Used by apply function.

Parameters:
  • entry (pandas.Series) – a gene score entry consisting of the gene name and its score
  • newGeneScores (pandas.DataFrame) – dataframe containing gene scores retrieved from the knowledge base.
Returns:

updated series element.

Return type:

pandas.Series

selectFeatures()[source]

Does the actual feature selection. Retrieves association scores for genes from the knowledge base based on the given search terms.

Returns:absolute path to the resulting ranking file.
Return type:str
class featureselection.KBweightedSelector(knowledgebase, tradApproach)[source]

Bases: featureselection.CombiningSelector

Selects features based on association scores retrieved from the knowledge base and the relevance score retrieved by the (traditional) approach. Computes the final score via tradScore * assocScore.

Parameters:
  • knowledgebase (knowledgebases.KnowledgeBase or inheriting class) – instance of a knowledge base.
  • tradApproach (FeatureSelector) – any feature selector implementation to use internally, e.g. a traditional approach like ANOVA
updateScores(entry, newGeneScores)[source]

Updates a score entry with the new score retrieved from the knowledge base. Used by apply function.

Parameters:
  • entry (pandas.Series) – a gene score entry consisting of the gene name and its score
  • newGeneScores (pandas.DataFrame) – dataframe containing gene scores retrieved from the knowledge base.
Returns:

updated series element.

Return type:

pandas.Series

getName()[source]

Gets the selector name (including the knowledge base and (traditional) selector).

Returns:selector name.
Return type:str
computeStatisticalRankings(intermediateDir)[source]

Computes the statistical relevance score of all features using the (traditional) selector.

Parameters:intermediateDir (str) – absolute path to output directory for (traditional) selector (where to write the statistical rankings).
Returns:dataframe with statistical ranking.
Return type:pandas.DataFrame
computeExternalRankings()[source]

Computes the association scores for every gene using the knowledge base. Genes for which no entry could be found receive a default score of 0.000001.

Returns:dataframe with statistical ranking.
Return type:pandas.DataFrame
combineRankings(externalRankings, statisticalRankings)[source]

Combines score rankings from both the knowledge base and the (traditional) selector (kb_score * trad_score) to retrieve a final score for every gene.

Parameters:
  • externalRankings (pandas.DataFrame) – dataframe with ranking from knowledge base.
  • statisticalRankings (pandas.DataFrame) – dataframe with statistical ranking.
Returns:

dataframe with final combined ranking.

Return type:

pandas.DataFrame

selectFeatures()[source]

Runs the feature selection process. Retrieves scores from knowledge base and (traditional) selector and combines these to a single score.

Returns:absolute path to final output file containing the ranking.
Return type:str
class featureselection.LassoPenalty(knowledgebase)[source]

Bases: featureselection.PriorKnowledgeSelector, featureselection.RSelector

Runs feature selection by invoking xtune R package: https://cran.r-project.org/web/packages/xtune/index.html

xtune is a Lasso selector that uses feature-individual penalty scores. These penalty scores are retrieved from the knowledge base.

selectFeatures()

Triggers the feature selection. Actually a wrapper method that invokes external R code.

Returns:absolute path to the result file containing the ranking.
Return type:str
getName()

Returns the full name (including applied knowledge base) of this selector.

Returns:selector name.
Return type:str
createParams(outputFile)[source]

Sets the parameters the xtune R script requires (input file, output file, filename containing rankings from knowledge base).

Returns:list of parameters to use for xtune execution in R.
Return type:list of str
computeExternalRankings()[source]

Computes the association scores for each feature based on the scores retrieved from the knowledge base. Features that could not be found in the knowledge base receive a default score of 0.000001.

Returns:absolute path to the file containing the external rankings.
Return type:str
class featureselection.WrapperSelector(name)[source]

Bases: featureselection.PythonSelector

Selector implementation for wrapper selectors using scikit-learn. Currently implements recursive feature eliminatin (RFE) and sequential forward selection (SFS) strategies, which can be combined with nearly any classifier offered by scikit-learn, e.g. SVM.

Parameters:
  • selector – scikit-learn selector strategy (currently RFE and SFS)
  • classifier – scikit-learn classifier to use for wrapper selection.
createClassifier()[source]

Creates a classifier instance (from scikit-learn) to be used during the selection process. To enable the framework to use a new classifier, extend this method accordingly.

Returns:scikit-learn classifier instance.
createSelector()[source]

Creates a selector instance that leads the selection process. Currently, sequential forward selection (SFS) and recursive feature elimination (RFE) are implemented. Extend this method if you want to add another selection strategy.

Returns:scikit-learn selector instance.
prepareOutput(outputFile, data, selector)[source]

Overwrites the inherited prepareOutput method because we need to access the particular selector’s coefficients. The coefficients are extracted as feature scores and will be written to the rankings file.

Parameters:
  • outputFile (str) – selector name
  • data (pandas.DataFrame) – input dataset to get the feature names.
  • selector – selector instance that is used during feature selection.
runSelector(data, labels)[source]

Runs the actual feature selector of scikit-learn. Is invoked by PythonSelector.selectFeatures().

Parameters:
  • data (pandas.DataFrame) – dataframe containing the unlabeled dataset.
  • labels (list of int) – numerically encoded class labels.
Returns:

sklearn/mlxtend selector that ran the selection (containing coefficients etc.).

class featureselection.SVMRFESelector[source]

Bases: featureselection.JavaSelector

Executes SVM-RFE with poly-kernel. Uses an efficient java implementation from WEKA and is thus just a wrapper class to invoke the corresponding jars.

createParams()[source]

Sets the parameters the java program requires (input file, output file, selector name).

Returns:list of parameters to use for InfoGain execution in java.
Return type:list of str
class featureselection.RandomForestSelector[source]

Bases: featureselection.PythonSelector

Selector class that implements RandomForest as provided by scikit-learn.

prepareOutput(outputFile, data, selector)[source]

Overwrites the inherited prepareOutput method because we need to access the RandomForest selector’s feature importances. These feature importances are extracted as feature scores and will be written to the rankings file.

Parameters:
  • outputFile (str) – selector name
  • data (pandas.DataFrame) – input dataset to get the feature names.
  • selector – RandomForest selector instance that is used during feature selection.
runSelector(data, labels)[source]

Runs the actual feature selection using scikit-learn’s RandomForest classifier. Is invoked by PythonSelector.selectFeatures().

Parameters:
  • data (pandas.DataFrame) – dataframe containing the unlabeled dataset.
  • labels (list of int) – numerically encoded class labels.
Returns:

scikit-learn RandomForestClassifier that ran the selection.

class featureselection.LassoSelector[source]

Bases: featureselection.PythonSelector

Selector class that implements Lasso feature selection using scikit-learn.

prepareOutput(outputFile, data, selector)[source]

Overwrites the inherited prepareOutput method because we need to access Lasso’s coefficients. These coefficients are extracted as feature scores and will be written to the rankings file.

Parameters:
  • outputFile (str) – selector name
  • data (pandas.DataFrame) – input dataset to get the feature names.
  • selector – RandomForest selector instance that is used during feature selection.
runSelector(data, labels)[source]

Runs the actual Lasso feature selector using scikit-learn. Is invoked by PythonSelector.selectFeatures().

Parameters:
  • data (pandas.DataFrame) – dataframe containing the unlabeled dataset.
  • labels (list of int) – numerically encoded class labels.
Returns:

Lasso selector that ran the selection.

class featureselection.PreFilterSelector(knowledgebase, tradApproach)[source]

Bases: featureselection.CombiningSelector

Applies a two-level prefiltering strategy for feature selection. Filters all features that were not retrieved by a knowledge base based on the search terms provided in the config file. Applies a (traditional) feature selector on the remaining features afterwards.

For traditional univariate filter approaches, the results retrieved by this class and PostFilterSelector will be the same.

selectFeatures()[source]

Carries out feature selection. First queries the assigned knowledge base to get genes that are associated to the given search terms. Filter feature set of input data set to contain only features that are in the retrieved gene set. Apply (traditional) selector on the filtered data set.

Returns:absolute path to rankings file.
Return type:str
class featureselection.PostFilterSelector(knowledgebase, tradApproach)[source]

Bases: featureselection.CombiningSelector

Applies a two-level postfiltering strategy for feature selection. Applies (traditional) feature selection to the input data set. Afterwards, removes all genes for which no information in the corresponding knowledge base was found based on the search terms provided in the config file. For traditional univariate filter approaches, the results retrieved by this class and PreFilterSelector will be the same.

selectFeatures()[source]

Carries out feature selection. First executes (traditional) selector. Then queries the assigned knowledge base to get genes that are associated to the given search terms. Finally filters feature set to contain only features that are in the retrieved gene set.

Returns:absolute path to rankings file.
Return type:str
class featureselection.ExtensionSelector(knowledgebase, tradApproach)[source]

Bases: featureselection.CombiningSelector

Selector implementation inspired by SOFOCLES: “SoFoCles: Feature filtering for microarray classification based on Gene Ontology”, Papachristoudis et al., Journal of Biomedical Informatics, 2010

This selector carries out (traditional) feature selection and in parallel retrieves relevant genes from a knowledge base based on the provided search terms. The ranking is then adapted by alternating the feature ranking retrieved by the (traditiona) selection approach and the externally retrieved genes. This is kind of related to an extension approach, where a feature ranking that was retrieved by a traditional approach is extended by such external genes.

selectFeatures()[source]

Carries out feature selection. Executes (traditional) selector and separately retrieves genes from the assigned knowledge base based on the search terms specified in the config. Finally merges the two feature lists alternating to form an “extended” feature ranking.

Returns:absolute path to rankings file.
Return type:str
class featureselection.NetworkActivitySelector(knowledgebase, featuremapper)[source]

Bases: featureselection.NetworkSelector

Selector implementation that selects a set of pathways from the knowledge base and maps the feature space to the pathways. Pathway ranking scores are computed based on the average ANOVA p-value of its member genes and the sample classes. This method is also used by Chuang et al. and Tian et al. (Discovering statistically significant pathways in expression profiling studies) Pathway feature values are computed with an instance of FeatureMapper or inheriting classes, whose mapping strategies can vary. If pathways should be selected according to another strategy, use this class as an example implementation to implement a new class that inherits from NetworkSelector.

selectPathways(pathways)[source]

Computes a pathway ranking for the input pathways. Computes a pathway score based on the average ANOVA’s f-test p-values of a pathway’s member genes and the sample classes.

Parameters:pathways (str) – selector name
Returns:pathway ranking with pathway scores
Return type:pandas.DataFrame
class featureselection.FeatureMapper[source]

Bases: object

Abstract. Inherit from this class and implement FeatureMapper.mapFeatures() to implement a new mapping strategy. Maps the feature space of the given input data to a given set of pathways. Computes a new feature value for every feature and sample based on the implemented strategy.

mapFeatures(original_data, pathways)[source]

Abstract method. Implement this method when inheriting from this class. Carries out the actual feature mapping.

Parameters:
  • original_data (pandas.DataFrame) – the original data set of which to map the feature space.
  • pathways (dict) – dict of pathway names as keys and corresponding pathway pypath.Network objects as values
Returns:

the transformed data set with new feature values

Return type:

pandas.DataFrame

getUnlabeledData(dataset)[source]

Removes the labels from the data set.

Parameters:dataset (pandas.DataFrame) – data set from which to remove the labels.
Returns:data set without labels.
Return type:pandas.DataFrame
getLabels(dataset)[source]

Gets the dataset labels.

Parameters:dataset (pandas.DataFrame) – data set from which to extract the labels.
Returns:label vector of the data set.
Return type:pandas.Series
getFeatures(dataset)[source]

Gets the features of a data set.

Parameters:dataset (pandas.DataFrame) – data set from which to extract the features.
Returns:feature vector of the data set.
Return type:pandas.Series
getSamples(dataset)[source]

Gets all samples in a data set.

Parameters:dataset (pandas.DataFrame) – data set from which to extract the samples.
Returns:list of samples from the data set.
Return type:list
getPathwayGenes(pathway, genes)[source]

Returns the intersection of a given set of genes and the genes contained in a given pathway.

Parameters:
  • pathway (pypath.Network) – pathway object from which to get the genes.
  • genes (list of str) – list of gene names.
Returns:

list of genes that are contained in both the pathway and the gene list.

Return type:

list of str

class featureselection.CORGSActivityMapper[source]

Bases: featureselection.FeatureMapper

Pathway mapper that implements the strategy described by Lee et al.: “Inferring Pathway Activity toward Precise Disease Classification” Identifies CORGS genes for every pathway: uses random search to find the minimal set of genes for which the pathway activity score is maximal. First, every sample receives an activity score, which is the average expression level of the (CORGS) genes / number of genes. The computed activity scores are then used for f-testing with the class labels, and the p-values are the new pathway feature values. These steps are executed again and again until the p-values are not decreasing anymore.

getANOVAscores(data, labels)[source]

Applies ANOVA f-test to test the association/correlation of a feature (pathway) with a given label. The feature has activity scores (computed from CORGS genes) for every sample, which are to be tested for the labels.

Parameters:
  • data (pandas.DataFrame) – the data set which to test for correlation with the labels (typically feature scores of a pathway for samples).
  • labels (pandas.Series) – class labels to use for f-test.
Returns:

series of p-values for every sample.

Return type:

pandas.Series

computeActivityScore(sampleExpressionLevels)[source]

Computes the activity score of a given set of genes for a specific sample. The activity score of a sample is the mean expression value of the given genes divided by the overall number of given genes.

Parameters:sampleExpressionLevels (pandas.DataFrame) – data set containing expression levels from a given set of genes for samples.
Returns:activity scores for the given samples.
Return type:pandas.Series
computeActivityVector(expressionLevels)[source]

Computes the activity score of a given set of genes for a all samples.

Parameters:expressionLevels (pandas.DataFrame) – input data set of expression levels for a given set of (CORGS) genes.
Returns:instance of a feature selector implementation.
Return type:pandas.DataFrame or inheriting class
mapFeatures(original_data, pathways)[source]

Carries out the actual feature mapping. Follows the strategy described by Lee et al.: “Inferring Pathway Activity toward Precise Disease Classification” Identifies CORGS genes for every pathway: uses random search to find the minimal set of genes for which the pathway activity score is maximal. First, every sample receives an activity score, which is the average expression level of the (CORGS) genes / number of genes. The computed activity scores are then used for f-testing with the class labels, and the p-values are the new pathway feature values. These steps are executed again and again until the p-values are not decreasing anymore.

Parameters:
  • original_data (pandas.DataFrame) – the original data set of which to map the feature space.
  • pathways (dict) – dict of pathway names as keys and corresponding pathway pypath.Network objects as values
Returns:

the transformed data set with new feature values

Return type:

pandas.DataFrame

class featureselection.PathwayActivityMapper[source]

Bases: featureselection.FeatureMapper

Pathway mapper that implements a strategy that is related to Vert and Kanehisa’s strategy: Vert, Jean-Philippe, and Minoru Kanehisa. “Graph-driven feature extraction from microarray data using diffusion kernels and kernel CCA.” NIPS. 2002. Computes pathway activity scores for every sample and pathway as new feature values. The feature value is the average of: expression level weighted by gene variance and neighbor correlation score)

getAverageCorrelation(correlations, gene, neighbors)[source]

Computes the average correlation from the correlations of a given gene and its neighbors.

Parameters:
  • correlations (pandas.DataFrame) – correlation matrix of all genes.
  • gene (str) – gene name whose average neighbor correlation to compute.
  • neighbors (list of str) – list of gene names that are neighbors of the given gene.
Returns:

average correlation value.

Return type:

float

computeGeneVariances(data)[source]

Computes the variances for every gene across all samples.

Parameters:data (pandas.DataFrame) – data set with expression values.
Returns:variance for every gene.
Return type:pandas.Series
mapFeatures(original_data, pathways)[source]

Executes the actual feature mapping procedure. A feature value is the average of (for every gene in a pathway): (expression level weighted by gene variance and neighbor correlation score)

Parameters:
  • original_data (pandas.DataFrame) – the original data set of which to map the feature space.
  • pathways (dict) – dict of pathway names as keys and corresponding pathway pypath.Network objects as values
Returns:

the transformed data set with new feature values

Return type:

pandas.DataFrame

knowledgebases module

Contains all classes related to knowledge bases. A knowledge base is realized with two classes: * A class inheriting from knowledgebases.KnowledgeBase and implementing the three interface methods knowledgebases.KnowledgeBase:getRelevantGenes(), knowledgebases.KnowledgeBase:getGeneScores(), and knowledgebases.KnowledgeBase:getRelevantPathways(). * A class that is responsible for querying the corresponding web service and inherits from Bioservice’s REST class. Those knowledge bases that retrieve pathway information also need an additional PathwayMapper class, which transforms the original pathway results from the knowledge base (which can range from SIF to any other pathway specification format) into the pathway representation that is used throughout Comprior. For Comprior’s internal pathway representation, we use pypath.

The creation of knowledge bases is encapsulated by the class knowledgebase.KnowledgeBaseFactory that takes care that every knowledge base is equipped with a web service querying class and, if needed, the right type of knowledgebase.PathwayMapper. For a detailed look at the class architecture, have a look at ADD CLASS ARCHITECTURE LINK HERE.

knowledgebases.suppress_stdout(suppress=True)[source]
class knowledgebases.ENRICHR[source]

Bases: bioservices.services.REST

Queries some of the API endpoints of the EnrichR web service (https://maayanlab.cloud/Enrichr/help#api).

Parameters:config (dict) – configuration parameters for EnrichR web service (as specified in config file).
addlist(geneList)[source]

Queries EnrichR to annotate a given list of genes. Returns a userListID, which can be used to retrieve the actual results in a second query.

Parameters:geneList – list of genes to annotate
Returns:json response containing a userListID.
Return type:dict of str
export(params)[source]

Download file of enrichment results. Requires a userListId that was retrieved from a prior query.

Parameters:params (list of str) – list of parameters to use for that query (userListId: Identifier returned from addList endpoint, filename: Name of text file download, backgroundType: Gene set library for which to download results)
Returns:text file containing enrichment results.
Return type:str
genemap(params)[source]

Finds all terms, their descriptions, and optional categorizations, for a given gene identifier.

Parameters:params (list of str) – list of parameters to be used for the query (gene Gene to use in search for terms, json (optional): Set “true” to return JSON rather plaintext, setup (optional): Set “true” to category information for the libraries)
Returns:json object of all terms containing the specified gene and their descriptions.
Return type:dict of str
enrich(params)[source]

Returns all that are terms available in library (specified by backgroundType param) and enriched in the given set of genes (specified by userListId param).

Parameters:params (list of str) – list of parameters to be used for the query (userListId: Identifier returned from addList endpoint; backgroundType: Gene set library to enrich against)
Returns:dataframe object of all enriched terms (unsorted, unfiltered.
Return type:dataframe
class knowledgebases.UMLS_AUTH[source]

Bases: bioservices.services.REST

Singleton class. Python code encapsulates it in a way that is not shown in Sphinx, so have a look at the descriptions in the source code.

Authentication service to get access to the UMLS database UMLS database (which we need for retrieving CUI disease codes for querying DisGeNET). You first have to get a ticket-granting ticket (tgt, valid for 8 hours) with the help of an API key. With the tgt, you can then request a service ticket for every new query to the UMLS database. The service ticket must then be used for the query. The task of this class is to generate a valid tgt and subsequent service ticket. Documentation on the authentication process: https://documentation.uts.nlm.nih.gov/rest/authentication.html

Parameters:
  • config (dict) – configuration parameters for UMLS web service as specified in config file.
  • tgt_timestamp (str) – timestamp of the tgt. If it is older than 8 hours, we need to request a new tgt.
  • tgt (list of str) – id of the ticket-granting ticket (valid for 8 hours). With this ticket, we can then query the actual UMLS web service.
  • service (str) – uri for the service login
instance = None
class knowledgebases.UMLS[source]

Bases: bioservices.services.REST

Retrieves UMLS CUI codes for labels, which can then be used for querying DisGeNET.

Parameters:
  • config (dict) – configuration parameters for UMLS web service (as specified in configuration file).
  • auth (UMLS_AUTH) – authentication component to generate a valid service ticket (required for every query).
getCUIs(labels)[source]

Get CUIs for the given labels.

Parameters:labels (list of str) – list of identifiers for which to retrieve CUIs, e.g. disease names.
Returns:list of CUIs.
Return type:list of str
class knowledgebases.DISGENET[source]

Bases: bioservices.services.REST

Queries the DisGeNET web service for a given set of labels and retrieves association scores for all genes related to the query labels. DisGeNET API documentation: https://www.disgenet.org/api/

Parameters:umls (UMLS for transforming disease names to CUIs (required for query)) – list of gene names to be mapped
getVersion()[source]

Get the current version of the DisGeNET API endpoint.

Returns:web service version infos.
Return type:json dict
query(labels)[source]

Conducts the actual query to retrive gene-disease association scores for a given list of disease labels. Transforms the disease labels into CUIs before with the UMLS web service.

Parameters:labels (list of str) – list of disease labels for which to retrieve gene-disease associations.
Returns:DataFrame with gene-disease association scores.
Return type:pandas.DataFrame
class knowledgebases.GCONVERT[source]

Bases: object

Queries the g:Convert web service to map a list of identifiers to a desired format. g:Convert makes use of the Ensembl build. g:Convert API documentation: https://biit.cs.ut.ee/gprofiler/page/apis

Parameters:url (str) – API url as specified in the configuration file.
query(items, originalFormat, desiredFormat)[source]

Map a list of itendifiers to the desired format.

Parameters:
  • items (list of str) – list of identifiers to be mapped
  • originalFormat (str) – current format of the identifiers
  • desiredFormat (str) – desired identifier format
Returns:

DataFrame containing the identifier mapping.

Return type:

pandas.DataFrame

class knowledgebases.PATHWAYCOMMONSWS[source]

Bases: bioservices.services.REST

Queries the PathwayCommons web service. Bioservices’ existing implementation to query PathwayCommons was not used because it contained outdated values for _valid_formats for pathway retrieval, so we used the original code and adapted it to work correctly.

getVersion()[source]

Map a list of genes to the desired format.

Parameters:genes (list of str) – list of gene names to be mapped
Returns:list of mapped gene names.
Return type:list of str
default_extension

set extension of the requests (default is json). Can be ‘json’ or ‘xml’

search(q, page=0, datasource=None, organism=None, type=None)[source]

Text search in PathwayCommons using Lucene query syntax

Some of the parameters are BioPAX properties, others are composite relationships.

All index fields are (case-sensitive): comment, ecnumber, keyword, name, pathway, term, xrefdb, xrefid, dataSource, and organism.

The pathway field maps to all participants of pathways that contain the keyword(s) in any of its text fields.

Finally, keyword is a transitive aggregate field that includes all searchable keywords of that element and its child elements.

All searches can also be filtered by data source and organism.

It is also possible to restrict the domain class using the ‘type’ parameter.

This query can be used standalone or to retrieve starting points for graph searches.

Parameters:
  • q (str) – requires a keyword , name, external identifier, or a Lucene query string.
  • page (int) – (N>=0, default is 0), search result page number.
  • datasource (str) – filter by data source (use names or URIs of pathway data sources or of any existing Provenance object). If multiple data source values are specified, a union of hits from specified sources is returned. datasource=[reactome,pid] returns hits associated with Reactome or PID.
  • organism (str) – The organism can be specified either by official name, e.g. “homo sapiens” or by NCBI taxonomy id, e.g. “9606”. Similar to data sources, if multiple organisms are declared a union of all hits from specified organisms is returned. For example organism=[9606, 10016] returns results for both human and mice.
  • type (str) – BioPAX class filter
get(uri, frmt='BIOPAX')[source]

Retrieves full pathway information for a set of elements

elements can be for example pathway, interaction or physical entity given the RDF IDs. Get commands only retrieve the BioPAX elements that are directly mapped to the ID. Use the traverse() query to traverse BioPAX graph and obtain child/owner elements.

Parameters:
  • uri (str) – valid/existing BioPAX element’s URI (RDF ID; for utility classes that were “normalized”, such as entity refereneces and controlled vocabularies, it is usually a Identifiers.org URL. Multiple IDs can be provided using list uri=[http://identifiers.org/uniprot/Q06609, http://identifiers.org/uniprot/Q549Z0’] See also about MIRIAM and Identifiers.org.
  • format (str) – output format (values)
Returns:

a complete BioPAX representation for the record pointed to by the given URI is returned. Other output formats are produced by converting the BioPAX record on demand and can be specified by the optional format parameter. Please be advised that with some output formats it might return “no result found” error if the conversion is not applicable for the BioPAX result. For example, BINARY_SIF output usually works if there are some interactions, complexes, or pathways in the retrieved set and not only physical entities.

class knowledgebases.KnowledgeBaseFactory[source]

Bases: object

Singleton class. Python code encapsulates it in a way that is not shown in Sphinx, so have a look at the descriptions in the source code.

Creates knowledge bases based on the provided name and creates all corresponding objects, e.g. web service endpoints. Every knowledge base implementation must be registered here, otherwise it will not be accessible.

instance = None
class knowledgebases.KnowledgeBase(name, kb_config, webservice, geneInfo, pathwayInfo)[source]

Bases: object

Super class for every knowledge base implementation. If a new knowledge base is implemented, it must inherit from this class and implement methods KnowledgeBase.getRelevantGenes(), KnowledgeBase.getGeneScores(), and KnowledgeBase.getRelevantPathways().

Parameters:
  • name (str) – name of the knowledge base
  • config (dict) – configuration parameter of the knowledge base as specified in the config file.
  • webservice (bioservices.REST or inheriting classes.) – web service querying object
  • hasGeneInformation (bool) – true if the knowledge base provides gene association information, false otherwise
  • hasPathwayInformation (bool) – true if the knowledge base also provides pathway information, false otherwise
getRelevantGenes(labels)[source]

Abstract. Get all genes that are associated to a list of labels, e.g. disease names.

Parameters:labels (list of str) – list of labels for which to retrieve the genes.
Returns:list of associated genes.
Return type:list of str
getGeneScores(labels)[source]

Abstract. Get all genes and their association scores for a given list of disease names.

Parameters:labels (list of str) – list of disease names for which to get gene-disease-association scores.
Returns:DataFrame of genes and their association scores.
Return type:pandas.DataFrame
getRelevantPathways(labels)[source]

Get all pathways related to a set of labels, e.g disease names.

Parameters:labels (list of str) – list of labels for which to find related pathways.
Returns:dict of pathway names and pathway representations.
Return type:dict with pypath.Network as values
getName()[source]

Returns the name of the knowledge base.

Returns:knowledge base name.
Return type:str
hasPathways()[source]

Returns if knowledge base retrieves pathway information, i.e. if KnowledgeBase.getRelevantPathways() is implemented..

Returns:true if knowledge base provides pathway information, false otherwise.
Return type:bool
hasGenes()[source]

Returns if knowledge base retrieves gene information, i.e. if KnowledgeBase.getRelevantGenes() KnowledgeBase.getGeneScores() are implemented.

Returns:true if knowledge base provides gene information, false otherwise.
Return type:bool
class knowledgebases.Enrichr[source]

Bases: knowledgebases.KnowledgeBase

Special knowledge base not intended to be used by feature selection approaches. Instead, it is used for evaluation purposes to annotate and enrich rankings.

Parameters:
  • name (str) – name of the knowledge base
  • config (dict) – configuration parameter of the knowledge base as specified in the config file.
  • webservice (bioservices.REST or inheriting classes.) – web service querying object
  • hasGeneInformation (bool) – true if the knowledge base provides gene association information, false otherwise
  • hasPathwayInformation (bool) – true if the knowledge base also provides pathway information, false otherwise
downloadEnrichedTerms(userIdList, filePrefix)[source]

Downloads enriched terms from a former query into a file. Filters these terms for those with an adjusted p-value > 0.05, then sorts by combined score in descending order.

Parameters:
  • userIdList (str) – userIdList to retrieve enrichment/annotation results from the original query.
  • filePrefix (str) – prefix to use in filename.
getRelevantGenes(labels)[source]

Is not implemented for EnrichR.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
getGeneScores(labels)[source]

Is not implemented for EnrichR.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
getRelevantPathways(labels)[source]

Is not implemented for EnrichR.

Parameters:labels (list of str) – list of labels for which to find related pathways.
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
enrichGeneset(geneList, filePrefix)[source]

Sends a list of identifies (here, genes) to EnrichR web service and stores all term enrichments in a file.

Parameters:
  • geneList (list of str) – list of gene names for which to retrieve enrichments.
  • filePrefix (str) – prefix to use in file name (to store enrichments).
annotateGene(gene)[source]

Annotates a gene with terms.

Parameters:gene (str) – gene name.
Returns:list of all annotations to the provided gene.
Return type:list of str
annotateGenes(geneList, filePrefix)[source]

Annotates a list of genes with relevant terms.

Parameters:
  • geneList (list of str) – list of gene names to annotate.
  • filePrefix (str) – prefix to use when storing results in a file.
Returns:

dict of gene names and lists of their annotations.

Return type:

dict

class knowledgebases.BioMART[source]

Bases: object

Maps a identifiers or data sets with identifiers to the desired format by using BiomaRt. Wrapper class that internally invokes BiomaRt’s R code. Very unstable, so currently not used. However, it can be exchanged in benchutils.retrieveMappings() function.

mapItems(itemList, originalFormat, desiredFormat)[source]

Map a list of identifiers to the desired format. Internally invokes external R code that uses the BiomaRt package.

Parameters:
  • itemList (list of str) – list of identifiers to be mapped
  • originalFormat (str) – original identifier format.
  • desiredFormat (str) – format to which to map identifiers.
Returns:

mapping data frame of identifiers (with original and desired format)

Return type:

pandas.DataFrame

getRelevantGenes(labels)[source]

Is not implemented for BiomaRt.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
getGeneScores(labels)[source]

Is not implemented for BiomaRt.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
getRelevantPathways(labels)[source]

Is not implemented for BiomaRt.

Parameters:labels (list of str) – list of labels for which to find related pathways.
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
class knowledgebases.Gconvert[source]

Bases: knowledgebases.KnowledgeBase

Maps identifiers or data sets containing identifiers to the desired format by using the g:Convert web service.

Parameters:
  • name (str) – name of the knowledge base
  • config (dict) – configuration parameter of the knowledge base as specified in the config file.
  • webservice (bioservices.REST or inheriting classes) – web service querying object.
  • hasGeneInformation (bool) – true if the knowledge base provides gene association information, false otherwise
  • hasPathwayInformation (bool) – true if the knowledge base also provides pathway information, false otherwise
mapItems(itemList, originalFormat, desiredFormat)[source]

Map a list of identifiers to the desired format.

Parameters:
  • itemList (list of str) – list of identifiers to be mapped.
  • originalFormat (str) – current format of the identifiers.
  • desiredFormat (str) – desired format to which to map identifiers.
Returns:

DataFrame table containing mappings of the identifiers from original to desired format.

Return type:

pandas.DataFrame

getRelevantGenes(labels)[source]

Is not implemented for g:Convert.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
getGeneScores(labels)[source]

Is not implemented for g:Convert.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
getRelevantPathways(labels)[source]

Is not implemented for g:Convert.

Parameters:labels (list of str) – list of labels for which to find related pathways.
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
class knowledgebases.OpenTargets[source]

Bases: knowledgebases.KnowledgeBase

Knowledge base implementation of OpenTargets. Uses the OpenTargetsClient Python implementation provided by OpenTargets to query the web service API.

Parameters:
  • name (str) – name of the knowledge base
  • config (dict) – configuration parameter of the knowledge base as specified in the config file.
  • webservice (opentargets.OpenTargetsClient) – web service querying implementation.
  • hasGeneInformation (bool) – true if the knowledge base provides gene association information, false otherwise
  • hasPathwayInformation (bool) – true if the knowledge base also provides pathway information, false otherwise
getAssociations(labels)[source]

Get all relevant information for a given set of labels, sorted by their association scores in descending order. Writes web service results into an intermediate file and maps the identifiers to have the correct format for further processing.

Parameters:labels (list of str) – list of labels, e.g. disease names.
Returns:DataFrame containing all related genes and their association scores.
Return type:pandas.DataFrame
getRelevantGenes(labels)[source]

Get all genes that are somehow associated to the given labels, e.g. disease names.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:list of associated genes.
Return type:list of str
getGeneScores(labels)[source]

Get all genes and their association scores that are related to the given labels, e.g. disease names.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:DataFrame of associated genes and their association scores, in descending order.
Return type:pandas.DataFrame
getRelevantPathways(labels)[source]

As OpenTargets currently does not provide pathway information, this feature is not implemented for OpenTargets.

Parameters:labels (list of str) – list of labels for which to find related pathways.
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
class knowledgebases.Kegg(pathwayparser)[source]

Bases: knowledgebases.KnowledgeBase

Knowledge base implementation for KEGG. Uses the KEGG web service implementation provided by bioservices. Requires an instance of KEGGPathwayParser to be able to map retrieved pathways into the internal pathway format.

Parameters:
  • name (str) – name of the knowledge base
  • config (dict) – configuration parameter of the knowledge base as specified in the config file.
  • webservice (bioservices.KEGG) – web service querying implementation.
  • hasGeneInformation (bool) – true if the knowledge base provides gene association information, false otherwise
  • hasPathwayInformation (bool) – true if the knowledge base also provides pathway information, false otherwise
  • pathwayparser (KEGGPathwayParser) – pathway mapping class that transforms KEGG pathways in SIF format into the internally used pathway format.
getPathwayNames(labels)[source]

Retrieve all pathway names related to the given labels, e.g. disease names.

Parameters:labels (list of str) – list labels, e.g. disease names, for which to find pathways.
Returns:list of pathway names.
Return type:list of str
getRelevantGenes(labels)[source]

Get all genes that are related to a set of labels, e.g. disease names. For KEGG, this means we retrieve all genes that are contained in pathways associated to these labels.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:list of associated genes.
Return type:list of str
getGeneScores(labels)[source]

Get association scores for all genes that are related to the provided labels, e.g. disease names. For KEGG, the association score for a gene is the sum of its degree percentile rank for every pathway, normalized by the overall number of pathways retrieved. This favors hub genes/genes having many interactions with other genes.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:DataFrame of associated genes and their association scores, in descending order.
Return type:pandas.DataFrame
getRelevantPathways(labels)[source]

Get all pathways related to a set of labels, e.g. disease names. Uses the KEGGPathwayParser to map KEGG’s pathways from SIF to pypath.Network.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:dict of pathway names and their internal representation as pypath.Network.
Return type:dict
class knowledgebases.Disgenet[source]

Bases: knowledgebases.KnowledgeBase

Knowledge base implementation for DisGeNET.

Parameters:
  • name (str) – name of the knowledge base
  • config (dict) – configuration parameter of the knowledge base as specified in the config file.
  • webservice (DISGENET) – web service querying implementation.
  • hasGeneInformation (bool) – true if the knowledge base provides gene association information, false otherwise
  • hasPathwayInformation (bool) – true if the knowledge base also provides pathway information, false otherwise
getRelevantGenes(labels)[source]

Get all genes that are related to a set of labels, e.g. disease names.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:list of associated genes.
Return type:list of str
getGeneScores(labels)[source]

Get association scores for all genes that are related to the provided labels, e.g. disease names. DisGeNET provides a couple of association scores to its genes (https://www.disgenet.org/dbinfo). Which score to use can be defined by the user in the config file.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:DataFrame of associated genes and their association scores, in descending order.
Return type:pandas.DataFrame
getRelevantPathways(labels)[source]

As DisGeNET currently does not provide pathway information, this feature is not implemented.

Parameters:labels (list of str) – list of labels for which to find related pathways.
Returns:NotImplementedError as this knowledge base is not intended to be used for such analyses.
Return type:NotImplementedError
class knowledgebases.Pathwaycommons[source]

Bases: knowledgebases.KnowledgeBase

Knowledge base implementation for PathwayCommons.

Parameters:
  • name (str) – name of the knowledge base
  • config (dict) – configuration parameter of the knowledge base as specified in the config file.
  • webservice (opentargets.OpenTargetsClient) – web service querying implementation.
  • hasGeneInformation (bool) – true if the knowledge base provides gene association information, false otherwise
  • hasPathwayInformation (bool) – true if the knowledge base also provides pathway information, false otherwise
getGeneScores(labels)[source]

Get association scores for all genes that are related to the provided labels, e.g. disease names. For PathwayCommons, the association score for a gene is the sum of its degree percentile rank for every pathway, normalized by the overall number of pathways retrieved. This favors hub genes/genes having many interactions with other genes.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:DataFrame of associated genes and their association scores, in descending order.
Return type:pandas.DataFrame
getRelevantGenes(labels)[source]

Get all genes that are related to a set of labels, e.g. disease names. For PathwayCommons, this means we retrieve all genes that are contained in pathways associated to these labels.

Parameters:labels (list of str) – list of identifiers, e.g. disease names, for which to find associated genes.
Returns:list of associated genes.
Return type:list of str
readPathway(pathway)[source]

Reads a pathway to create pypath.Network.

Parameters:pathway (str) – pathway string to parse
getRelevantPathways(labels)[source]

Get all pathways related to a set of labels, e.g. disease names as pypath.Network.

Parameters:labels (list of str) – list of gene names to be mapped
Returns:dict of pathway names and their internal representation as pypath.Network.
Return type:dict
class knowledgebases.PathwayParser[source]

Bases: object

Super class that maps a pathway from its original format (provided by a knowledge base) to the internally used pypath.Network. When having to map pathways from a knowledge base, implement a new class that inherits from this one and implements PathwayParser.parsePathway().

parsePathway(pathway, pathwayID)[source]

Abstract method. Parse a pathway to the internally used format of pypath.Network.

Parameters:
  • pathway (str) – pathway string to parse
  • pathwayID (str) – name of the pathway
Returns:

pathway in the internally used format..

Return type:

pypath.Network

class knowledgebases.KEGGPathwayParser[source]

Bases: knowledgebases.PathwayParser

Parse KEGG pathways, which are returned in KGML format.

readInteractions(interactions, geneIds)[source]

Parses interactions for a set of genes.

Parameters:
  • interactions (list) – interactions to parse
  • geneIds (list of str) – gene ids whose interactions to add
parsePathway(kgml_pathway, pathwayID)[source]

Parse KEGG pathway to the internally used format of pypath.Network.

Parameters:
  • pathway (str) – pathway string to parse
  • pathwayID (str) – name of the pathway
Returns:

pathway in the internally used format..

Return type:

pypath.Network

evaluation module

Contains all classes related to the evaluation part. There are distinct classes for the following evaluation aspects: * review of knowledge base coverage for the provided search terms (see evaluation.KnowledgeBaseEvaluator) * inspection of data set quality, e.g. via mds or density plots (see evaluation.DatasetEvaluator) * comparison and assessment of feature rankings, e.g. overlap (see evaluation.RankingsEvaluator) * annotation of feature rankings and enrichment via EnrichR (see evaluation.AnnotationEvaluator) * classification and subsequent visualization of standard metrics (see evaluation.ClassificationEvaluator) * cross-classification across a second data set and visualization of standard metrics (see evaluation.CrossEvaluator)

Every one of these classes inherits from the abstract evaluation.Evaluator and implements the evaluate() method. evaluation.AttributeRemover is used in the overall benchmarking process to prepare the input data to contain only the selected features. For a detailed look at the class architecture, have a look at ADD CLASS ARCHITECTURE LINK HERE.

class evaluation.AttributeRemover(dataDir, rankingsDir, topK, outputDir)[source]

Bases: object

Prepares the input data set for subsequent classification by removing lowly-ranked features and only keeping the top k features. Creates one “reduced” file for every ranking and from one to k (so if k is 50, we will end up with 50 files having one and up to 50 features.

Parameters:
  • dataDir (str) – absolute path to the directory that contains the input data set whose features to reduce.
  • rankingsDir (str) – absolute path to the directory that contains the rankings.
  • topK (str) – maximum numbers of features to select.
  • outputDir (str) – absolute path to the directory where the reduced files will be stored.
loadTopKRankings()[source]

Loads all available rankings from files.

Returns:Dictionary with selection methods as keys and a ranked list of the (column) names of the top k features.
Return type:dict
removeAttributesFromDataset(method, ranking, dataset)[source]

Creates reduced data sets from dataset for the given method’s ranking that only contain the top x features. Creates multiple reduced data sets from topKmin to topKmax specified in the config.

Parameters:
  • method (str) – selection method applied for the ranking.
  • ranking (List of str) – (ranked) list of feature names from the top k features.
  • dataset (pandas.DataFrame) – original input data set
removeUnusedAttributes()[source]

For every method and its corresponding ranking, create reduced files with only the top x features.

class evaluation.Evaluator(input, output, methodColors)[source]

Bases: object

Abstract super class. Every evaluation class has to inherit from this class and implement its evaluate() method.

Parameters:
  • input (str) – absolute path to the directory where the input data is located.
  • output (str) – absolute path to the directory to which to save results.
  • methodColors (dict of str) – dictionary containing a color string for every selection method.
  • javaConfig (str) – configuration parameters for java code (as specified in the config file).
  • rConfig (str) – configuration parameters for R code (as specified in the config file).
  • evalConfig (str) – configuration parameters for evaluation, e.g. how many features to select (as specified in the config file).
  • classificationConfig (str) – configuration parameters for classification, e.g. which classifiers to use (as specified in the config file).
evaluate()[source]

Abstract. Must be implemented by inheriting class as this method is invoked by framework.Framework to run the evaluation.

loadRankings(inputDir, maxRank, keepOrder)[source]

Loads all rankings from a specified input directory. If only the top k features shall be in the ranking, set maxRank accordingly, set it to 0 if otherwise (so to load all features). If feature order is important in the returned rankings, set keepOrder to true; if you are only interested in what features are among the top maxRank, set it to false.

Parameters:
  • inputDir (str) – absolute path to directory where all rankings are located.
  • maxRank (int) – maximum number of features to have in ranking.
  • keepOrder (bool) – whether the order of the features in the ranking is important or not.
Returns:

Dictionary of rankings per method, either as ordered list or set (depending on keepOrder attribute)

Return type:

dict

computeKendallsW(rankings)[source]

Computes Kendall’s W from two rankings. Note: measure does not make much sense if the two rankings are highly disjunct, which can happen especially for traditional approaches.

Parameters:rankings (matrix) – matrix containing two rankings for which to compute Kendall’s W.
Returns:Kendall’s W score.
Return type:float
class evaluation.ClassificationEvaluator(inputDir, rankingsDir, intermediateDir, outputDir, methodColors, methodMarkers)[source]

Bases: evaluation.Evaluator

Evaluates selection methods via classification by using only the selected features and computing multiple standard metrics. Uses AttributeRemover to create reduced datasets containing only the top k features, which are then used for subsequent classification. Currently, classification and subsequent evaluation is wrapped here and is actually carried out by java jars using WEKA.

Parameters:
  • input (str) – absolute path to the directory where the input data for classification is located.
  • rankingsDir (str) – absolute path to the directory where the rankings are located.
  • intermediateDir (str) – absolute path to the directory where the reduced datasets (containing only the top k features) are written to.
  • output (str) – absolute path to the directory to which to save results.
  • methodColors (dict of str) – dictionary containing a color string for every selection method.
  • javaConfig (str) – configuration parameters for java code (as specified in the config file).
  • rConfig (str) – configuration parameters for R code (as specified in the config file).
  • evalConfig (str) – configuration parameters for evaluation, e.g. how many features to select (as specified in the config file).
  • classificationConfig (str) – configuration parameters for classification, e.g. which classifiers to use (as specified in the config file).
drawLinePlot(inputDir, outputDir, topK, metric)[source]

Draws a line plot for a given metric, using all files containing evaluation results for that metric in inputDir. In the end, the plot will have one line per feature selection approach for which classification results are available.

Parameters:
  • inputDir (str) – absolute path to directory containing all input files (from which to draw the graph).
  • outputDir (str) – absolute path to the output directory where the the graph will be saved.
  • topK (int) – maximum x axis value
  • metric (str) – metric name for which to draw the graph.
evaluate()[source]

Triggers classification and evaluation in Java and creates corresponding plots for every metric that was selected in the config.

class evaluation.RankingsEvaluator(input, dataset, outputPath, methodColors)[source]

Bases: evaluation.Evaluator

Evaluates the rankings themselves by generating overlaps and comparing fold change differences.

Parameters:
  • input (str) – absolute path to the directory where the input data is located.
  • output (str) – absolute path to the directory to which to save results.
  • methodColors (dict of str) – dictionary containing a color string for every selection method.
  • javaConfig (str) – configuration parameters for java code (as specified in the config file).
  • rConfig (str) – configuration parameters for R code (as specified in the config file).
  • evalConfig (str) – configuration parameters for evaluation, e.g. how many features to select (as specified in the config file).
  • classificationConfig (str) – configuration parameters for classification, e.g. which classifiers to use (as specified in the config file).
  • dataset (str) – absolute file path to the input data set (from which features were selected).
  • metrics (List of str) – list of metrics to apply to ranking evaluation (as specified in the config file).
generateOverlaps()[source]

Creates overlap plots for the available rankings (set during creating to self.input). For up to two rankings, use Python’s matplotlib to create Venn diagrams. For three rankings and above, create UpsetR (https://github.com/hms-dbmi/UpSetR) diagrams via R.

loadGeneRanks(inputDir, topK)[source]

Used for computing Kendall’s W. Loads rankings and creates a table (approach x features) containing individual ranks per feature per approach, e.g. #approach G1 G2 G3 #Ranker1 1 2 3 #Ranker 2 3 1 2

Parameters:
  • inputDir (str) – absolute path to the directory containing all ranking files.
  • topK (int) – maximum number of features to use (=length of the rankings).
Returns:

Ranking table containing every assigned rank for every feature per ranking approach.

Return type:

numpy.array

computePValue(W, m, n)[source]

Computes the p-value for a given Kendall’s W score via a simple permutation (1000 times) test.

Parameters:
  • W (float) – Kendall’s W score.
  • m (int) – number of approaches/rankings to compare.
  • n (int) – number of features in each ranking.
Returns:

p-value of Kendall’s W score.

Return type:

float

computeKendallsWScores()[source]

Computes Kendall’s correlation coefficients (W) and its corresponding p-value for the top 50, 500, 5,000 and all (code: 0) ranked features of existing rankings. Conducts a permutation test for all scores to receive p-value. Writes output to a file containing the correlation coefficients and their corresponding p-value for different length of rankings.

drawBoxPlot(data, labels, prefix)[source]

Draws a box plot from the given data with the given labels on the x axis and the given prefix in the headlines.

Parameters:
  • data (List of lists of floats) – Data to plot; a list containing lists of values.
  • labels (List of str) – List of method names.
  • prefix (str) – Prefix to use for file name and title.
computeFoldChangeDiffs()[source]

Computes median and mean fold changes for all selected features per approach. Writes fold changes to file and creates corresponding box plots.

evaluate()[source]

Runs evaluations on feature rankings based on what is specified in the config file. Currently, can compute feature overlaps, Kendall’s correlation coefficient (W), and box plots for mean and median fold changes of selected features.

class evaluation.CrossEvaluator(input, rankingsDir, output, methodColors)[source]

Bases: evaluation.Evaluator

Runs the evaluation across a second data set. Takes the top k ranked features, removes all other features from that second data set. Runs a ClassificationEvaluator on that second data set with the selected features.

Parameters:
  • input (str) – absolute path to the directory where the second data set for cross-validation is located.
  • rankingsDir (str) – absolute path to the directory containing all rankings.
  • output (str) – absolute path to the directory to which to write all classification results.
  • methodColors (dict) – Dictionary that assigns every (ranking) method a unique color (used for drawing subsequent plots).
  • javaConfig (str) – configuration parameters for java code (as specified in the config file).
  • rConfig (str) – configuration parameters for R code (as specified in the config file).
  • evalConfig (str) – configuration parameters for evaluation, e.g. how many features to select (as specified in the config file).
  • classificationConfig (str) – configuration parameters for classification, e.g. which classifiers to use (as specified in the config file).
evaluate()[source]

Runs crossClassification = takes the features selected on the original data set and uses them to classify a second (cross-validation) data set.

class evaluation.AnnotationEvaluator(input, output, methodColors)[source]

Bases: evaluation.Evaluator

Annotates and enriches feature rankings with EnrichR (https://maayanlab.cloud/Enrichr/). What library to be used for annotation must be specified in the config file. Can compute annotation and enrichment overlaps for different feature rankings. Annotation = annotate features with terms/information Enrichment = check what terms are enriched in a feature ranking and are related to multiple features. Overlaps then can show a) if feature rankings represent the same underlying processes via annotation (maybe although having selected different features), or b) if the underlying processes are equally strongly represented by checking the enrichment (maybe altough having seleced different features).

Parameters:
  • input (str) – absolute path to the directory where the second data set for cross-validation is located.
  • output (str) – absolute path to the directory to which to write all classification results.
  • methodColors (dict) – dictionary that assigns every (ranking) method a unique color (used for drawing subsequent plots).
  • metrics (List of str) – list of metrics to compute, to be configured in config.
  • dataConfig (dict) – config parameters for the input dataset.
  • javaConfig (str) – configuration parameters for java code (as specified in the config file).
  • rConfig (str) – configuration parameters for R code (as specified in the config file).
  • evalConfig (str) – configuration parameters for evaluation, e.g. how many features to select (as specified in the config file).
  • classificationConfig (str) – configuration parameters for classification, e.g. which classifiers to use (as specified in the config file).
countAnnotationPercentages(featureLists, inputDir)[source]

Count the number of features (=number of lines in annotation file) that have been annotated and compute percentages. Write output to a file “annotationsPercentages.csv” in self.outputDir.

Parameters:
  • featureLists (dict) – dictionary of lists of features per selection method.
  • inputDir (str) – absolute path to directory containing annotation files.
loadAnnotationFiles(inputDir, inputFiles)[source]

Loads files with feature annotations.

Parameters:
  • inputDir (str) – absolute path to directory containing annotation files.
  • inputFiles (List of str) – list of annotation file names to load.
Returns:

dictionary of annotation sets per selection method.

Return type:

dict

computeOverlap(inputDir, fileSuffix)[source]

Creates overlap plots for the available annotations/enrichments. For up to two rankings, use Python’s matplotlib to create Venn diagrams. For three rankings and above, create UpsetR (https://github.com/hms-dbmi/UpSetR) diagrams via R.

Parameters:
  • inputDir (str) – absolute path to directory containing files for which to compute overlap.
  • fileSuffix (str) – suffix in filename to recognize the right files.
evaluate()[source]

Runs the annotation/enrichment evaluation on the rankings. Depending on what was specified in the config file, annotate and/or enrich feature rankings and compute overlaps or percentages. Overlaps then can show a) if feature rankings represent the same underlying processes via annotation (maybe although having selected different features), or b) if the underlying processes are equally strongly represented by checking the enrichment (maybe altough having seleced different features).

class evaluation.DatasetEvaluator(input, output, separator, options)[source]

Bases: evaluation.Evaluator

Creates plots regarding data set quality, currently: MDS, density, and box plots. Wrapper class because the actual evaluation and plot creation is done in an R script.

Parameters:
  • input (str) – absolute path to the directory where the input data set is located (for which to create the plots).
  • output (str) – absolute path to the directory to which to save plots.
  • separator (str) – separator character in data set to read it correctly.
  • options (list of str) – what plots to create, a list of method names that must be specified in the config file.
  • javaConfig (str) – configuration parameters for java code (as specified in the config file).
  • rConfig (str) – configuration parameters for R code (as specified in the config file).
  • evalConfig (str) – configuration parameters for evaluation, e.g. how many features to select (as specified in the config file).
  • classificationConfig (str) – configuration parameters for classification, e.g. which classifiers to use (as specified in the config file).
evaluate()[source]

Triggers the actual evaluation/plot generation in R. If a second data set for cross-validation was provided, also run the corresponding R script on that data set.

class evaluation.KnowledgeBaseEvaluator(output, knowledgebases, searchterms)[source]

Bases: evaluation.Evaluator

Creates plots to evaluate knowledge base coverage. Queries the knowledge bases with the given search terms and checks how many genes or pathways are found.

Parameters:
  • output (str) – absolute path to the directory to which to save plots.
  • knowledgebases (list of str) – a list of knowledgebases to test.
  • searchterms (list of str) – list of search terms for which to check knowledge base coverage.
  • javaConfig (str) – configuration parameters for java code (as specified in the config file).
  • rConfig (str) – configuration parameters for R code (as specified in the config file).
  • evalConfig (str) – configuration parameters for evaluation, e.g. how many features to select (as specified in the config file).
  • classificationConfig (str) – configuration parameters for classification, e.g. which classifiers to use (as specified in the config file).
drawCombinedPlot(stats, colIndex, filename, title, ylabel1, ylabel2, colors)[source]

Creates combined plot of box and bar plot from a data set.

Parameters:
  • stats (pandas.DataFrame) – statistics to plot.
  • colIndex (int) – column index to use as column.
  • filename (str) – filename for the plot.
  • title (str) – title for the plot.
  • ylabel1 (str) – label of y axis (left side/box plot).
  • ylabel2 (str) – label of y axis (right side/bar plot).
  • colors (List of str) – List of colors to use for the different search terms.
createKnowledgeBases(knowledgebaseList)[source]

Creates knowledge base objects from a given list.

Parameters:knowledgeBaseList (List of str.) – List of knowledge base names to create.
Returns:List of knowledge base objects
Return type:List of KnowledgeBase or inheriting classes
checkCoverage(kb, colors, useIDs)[source]

Checks the coverage for a given knowledge base and creates corresponding plots.

Parameters:
  • kb (knowledgebases.KnowledgeBase or inheriting class) – knowledge base object for which to check coverage.
  • colors (List of str) – List of colors to use for plots.
checkPathwayCoverage(kb, colors, useIDs)[source]

Checks the pathway coverage for a given knowledge base and creates corresponding plots.

Parameters:
  • kb (knowledgebases.KnowledgeBase or inheriting class) – knowledge base object for which to check pathway coverage.
  • colors (List of str) – List of colors to use for plots.
evaluate()[source]

Evaluates every given knowledge base and checks how many genes and pathways (and how large they are) are in there for the given search terms. Creates corresponding plots.