Java Code Documentation

Feature Selection

public class WEKA_FeatureSelector
[sources]
package de.hpi.bmg;

import weka.core.Instances;

import java.util.ArrayList;
import java.util.List;

/**
 * Entry point class for running a feature selector on a data set.
 * Invoke the jar of this java file to carry out feature selection procedure (see an example in :class:`featureselection.InfoGainSelector` how to do that).
 */
public class WEKA_FeatureSelector {

    /**
     * Loads the input data set and creates selector objects based on the provided list of feature selector names.
     * Invokes feature selection procedures for all selectors and writes the results to the output directory, one file per selector.
     *
     * @param args the parameters provided when invoking the jar. Provide the following parameters:
     *             - absolute path to the input data set.
     *             - absolute path to the output directory (where to write the feature rankings).
     *             - a string of feature selectors, separated by a comma (e.g. "InfoGain,ReliefF").
     */
    public static void main(String[] args) {

        DataLoader dl = new DataLoader(args[0], ",");

        Instances data = dl.getData();

        //delete sample column
        data.deleteAttributeAt(0);
        //System.out.print(data);
        //set classLabel column to classIndex column
        data.setClassIndex(0);
        //System.out.print(data.classAttribute());

        List<String> attributeSelectionMethods = new ArrayList<String>();

        for (int i=2; i < args.length; i++) {
            attributeSelectionMethods.add(args[i]);
        }

        for (String asMethod : attributeSelectionMethods) {

            AttributeSelector as = new AttributeSelector(data, asMethod);

            as.selectAttributes();
            //System.out.print(asMethod);

            as.saveSelectedAttributes(args[1]);
        }

    }

}

Entry point class for running a feature selector on a data set. Invoke the jar of this java file to carry out feature selection procedure. Is invoked by during feature selection by featureselection.InfoGainSelector.

public static void main(String[] args)

Loads the input data set and creates selector objects based on the provided list of feature selector names. Invokes feature selection procedures for all selectors and writes the results to the output directory, one file per selector.

Parameters:
  • args – The parameters provided when invoking the jar. Provide the following parameters: a) the absolute path to the input data set, b) the absolute path to the output directory (where to write the feature rankings), and c) a string of feature selectors to run, separated by a comma (e.g. “InfoGain,ReliefF”).
public class DataLoader
[sources]
package de.hpi.bmg;

import weka.core.Instances;
import weka.core.converters.CSVLoader;

import java.io.File;
import java.io.IOException;

/**
 * Class for loading a data set from a file.
 * Used by classes WEKA_Evaluator and WEKA_FeatureSelector.
 */
public class DataLoader {


    String sourceFile;

    Instances data;

    /**
     * Constructor method.
     * Loads the data from the specified source file and stores it in the data class attribute.
     *
     * @param sourceFile absolute path of the input file from which to load the data.
     * @param separator  separator to use for file reading, e.g a comma.
     */
    public DataLoader(String sourceFile, String separator) {
        this.sourceFile = sourceFile;
        loadData(separator);
    }

    /**
     * Returns the loaded data set.
     *
     * @return the data set.
     */
    public Instances getData() {
        return data;
    }

    /**
     * Carries out the actual data loading.
     * Stores the loaded data set in the data class attribute.
     *
     * @param separator separator to use for file reading, e.g. a comma.
     */
    private void loadData(String separator) {

        CSVLoader loader = new CSVLoader();
        loader.setFieldSeparator(separator);
        try {
            loader.setSource(new File(this.sourceFile));
            this.data = loader.getDataSet();
        } catch (IOException e) {
            e.printStackTrace();
            // see https://opensource.apple.com/source/Libc/Libc-320/include/sysexits.h
            System.exit(66);
        }
    }

}

Class for loading a data set from a file. Used by classes WEKA_FeatureSelector and WEKA_Evaluator.

Instances data
String sourceFile
public DataLoader(String sourceFile, String separator)

Constructor method. Loads the data from the specified source file and stores it in the data class attribute.

Parameters:
  • sourceFile – absolute path of the input file from which to load the data.
  • separator – separator to use for file reading, e.g a comma.
public Instances getData()

Returns the loaded data set.

Returns:the data set.
private void loadData(String separator)

Carries out the actual data loading. Stores the loaded data set in the data class attribute.

Parameters:
  • separator – separator to use for file reading, e.g. a comma.
public class AttributeSelector
[sources]
package de.hpi.bmg;

import com.opencsv.CSVWriter;
import weka.attributeSelection.*;
import weka.core.Instances;

import java.io.*;
import java.util.logging.Logger;

/**
 * Selector class that carries out the actual feature selection procedure.
 * Invoked by
 */
public class AttributeSelector {

    private final static Logger LOGGER = Logger.getLogger(AttributeSelector.class.getName());

    private String selectionMethod;
    private Instances data;
    private AttributeSelection attributeSelection;

    /**
     * Constructor method.
     *
     * @param data the input data set from which to select the features.
     * @param selectionMethod name of the feature selector to apply.
     */
    public AttributeSelector(Instances data, String selectionMethod){

        this.data = data;
        this.selectionMethod = selectionMethod;

    }

    /**
     * Do the actual feature selection.
     * Based on the selector name, create corresponding instances of classes provided by WEKA and generate a feature ranking.
     */
    public void selectAttributes() {
        ASEvaluation eval;

        switch (this.selectionMethod) {
            case "SVMpRFE":
                //the default kernel for WEKAs SVMAttributeEval is a poly kernel (as defined in class SMO)
                eval = new SVMAttributeEval();

                ((SVMAttributeEval) eval).setPercentThreshold(10);

                ((SVMAttributeEval) eval).setPercentToEliminatePerIteration(10);

                break;

            case "GainRatio":

                eval = new GainRatioAttributeEval();

                break;

            case "ReliefF":

                eval = new ReliefFAttributeEval();

                break;

            default:

                eval = new InfoGainAttributeEval();

        }

        Ranker ranker = new Ranker();

        this.attributeSelection = new AttributeSelection();

        this.attributeSelection.setEvaluator(eval);
        this.attributeSelection.setSearch(ranker);

        // perform attribute selection

        long begin = System.currentTimeMillis();

        try {
            this.attributeSelection.SelectAttributes(data);
        } catch (Exception e) {
            e.printStackTrace();
        }

        long end = System.currentTimeMillis();

        long dt = end - begin;

        LOGGER.info("" + dt + "," + this.selectionMethod);
        System.out.println("" + dt + "," + this.selectionMethod);
    }

    /**
     * Creates a feature ranking list and stores it in the specified file.
     *
     * @param saveLocation absolute path to the output file in which to store the ranking.
     */
    public void saveSelectedAttributes(String saveLocation) {

        try {


            CSVWriter writer = new CSVWriter(new FileWriter(saveLocation + "/" + this.selectionMethod + ".csv"), '\t');

            String[] header = {"attributeName","score"};

            writer.writeNext(header);

            double[][] rankedAttributes = this.attributeSelection.rankedAttributes();

            for (int i = 0; i < rankedAttributes.length; i++) {

                String attributeName = data.attribute((int) rankedAttributes[i][0]).name();

                String score = "" + rankedAttributes[i][1];

                String[] entry = {attributeName, score};

                writer.writeNext(entry);
            }

            writer.close();

        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

Selector class that carries out the actual feature selection procedure. Used by WEKA_FeatureSelector.

public AttributeSelector(Instances data, String selectionMethod)

Constructor method.

Parameters:
  • data – the input data set from which to select the features.
  • selectionMethod – name of the feature selector to apply.
public void saveSelectedAttributes(String saveLocation)

Create a feature ranking list and stores it in the specified file.

Parameters:
  • saveLocation – absolute path to the output file in which to store the ranking
public void selectAttributes()

Do the actual feature selection. Based on the selector name, create corresponding instances of classes provided by WEKA and generate a feature ranking.

Evaluation

public class WEKA_Evaluator
[sources]
package de.hpi.bmg;

import com.opencsv.CSVWriter;
import org.apache.commons.lang3.ArrayUtils;
import weka.classifiers.AbstractClassifier;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.functions.Logistic;
import weka.classifiers.functions.SMO;
import weka.classifiers.lazy.IBk;
import weka.classifiers.trees.J48;
import weka.classifiers.trees.RandomForest;
import weka.core.Instances;

import java.lang.reflect.Array;
import java.util.Arrays;
import java.util.HashMap;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.logging.Logger;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

/**
 * Entry point class for running classification on a data set using only the top 1 up to k features (one classification round per k).
 * Invoke the jar of this java file to start the classification procedure (see an example in :class:`evaluation.ClassificationEvaluator` how to do that).
 * Uses class:`DataLoader` to load input data set and class:`Analyzer` to run the actual classification procedure (and compute evaluation metrics).
 * Summarizes results from all classifiers and all input data sets (depending on how many features were used) and writes them into output files.
 */
public class WEKA_Evaluator {

    private final static Logger LOGGER = Logger.getLogger(WEKA_Evaluator.class.getName());

    /**
     * Process some of the command line parameters (for classifiers, metrics, and input data set locations).
     * Invokes classification procedure for every subdirectory (=selection method) that is contained in the input directory.
     *
     * @param args the parameters provided when invoking the jar. Provide the following parameters:
     *             - absolute path of the directory containing the reduced input data set files (one subdirectory per selection approach).
     *             - absolute path of the output directory (where to write all evaluation results).
     *             - minimum number of features to use for classification.
     *             - maximum number of features to use for classification.
     *             - k param for k-fold cross validation.
     *             - a string of classifiers, separated by a comma (e.g. "SVM,KNN3,KNN5").
     *             - a string of metrics to compute, separated by a comma (e.g. "accuracy,specificity,precision").
     */
    public static void main(String[] args) {

        //get reduced data set locations
        File folder = new File(args[0]);
        //the input path should only contain directories - one for each method
        File[] listOfDirs = folder.listFiles();

        //get classifiers from input
        String classifierParams = args[5];
        String[] classifiers = classifierParams.split(",");
        //get metrics to use from input
        String metricParams = args[6];
        String[] metrics = metricParams.split(",");

        for (File methodDir : listOfDirs) {

            classifyAndEvaluate(methodDir.getName(), methodDir.getAbsolutePath(), new File(args[1], methodDir.getName()).getAbsolutePath(),
                        Integer.parseInt(args[2]), Integer.parseInt(args[3]),  Integer.parseInt(args[4]), classifiers, metrics);
        }
    }

    /**
     * Runs the overall classification procedure for all feature set sizes of a particular selection approach.
     * Creates the specified classifier objects and filewriters for the results.
     * For every feature set size from topKmin to topKmax, invoke an instance of :class:`Analyzer`to carry out the actual classification and compute the metrics.
     *
     * @param selectionMethod the name of the feature selection method that generated the feature sets to evaluate.
     * @param reducedDatasetLocation absolute path to the directory containing the reduced input files (with increasing feature set sizes) for classification.
     * @param resultLocation absolute path to the output file to which to write the classification results.
     * @param topKmin minimum number of features to use.
     * @param topKmax maximum number of features to use.
     * @param numFolds k parameter for k-fold cross validation.
     * @param classifiers a list of classifier names to use for classification.
     * @param evalMetrics a list of metric names compute for the classification results.
     */
    private static void classifyAndEvaluate(String selectionMethod, String reducedDatasetLocation, String resultLocation, int topKmin, int topKmax, int numFolds, String[] classifiers, String[] evalMetrics) {
        System.out.println(Integer.toString(Array.getLength(evalMetrics)));
        System.out.println(reducedDatasetLocation);
        System.out.println(selectionMethod);
        System.out.println(reducedDatasetLocation);
        System.out.println(resultLocation);

        //LOGGER.info(Integer.toString(Array.getLength(evalMetrics)));
        //LOGGER.info(reducedDatasetLocation);
        //LOGGER.info(selectionMethod);
        //LOGGER.info(reducedDatasetLocation);
        //LOGGER.info(resultLocation);

        HashMap<String, CSVWriter> writers = new HashMap<String, CSVWriter>();
        try {
            AbstractClassifier[] classifierObjects = null;
            AbstractClassifier analyzer = null;
            String[] classifierNames = null;
            //create desired classifiers
            for (String method : classifiers) {
                switch (method) {
                    case "SMO":
                        analyzer = new SMO();
                        classifierObjects = (AbstractClassifier[]) ArrayUtils.addAll(classifierObjects, analyzer);
                        classifierNames = (String[]) ArrayUtils.addAll(classifierNames, "SMO");
                        break;
                    case "LR":
                        analyzer = new Logistic();
                        classifierObjects = (AbstractClassifier[]) ArrayUtils.addAll(classifierObjects, analyzer);
                        classifierNames = (String[]) ArrayUtils.addAll(classifierNames, "LR");
                        break;
                    case "KNN3":
                        analyzer = new IBk();
                        ((IBk) analyzer).setKNN(3);
                        classifierObjects = (AbstractClassifier[]) ArrayUtils.addAll(classifierObjects, analyzer);
                        classifierNames = (String[]) ArrayUtils.addAll(classifierNames, "KNN3");
                        break;
                    case "KNN5":
                        analyzer = new IBk();
                        ((IBk) analyzer).setKNN(5);
                        classifierObjects = (AbstractClassifier[]) ArrayUtils.addAll(classifierObjects, analyzer);
                        classifierNames = (String[]) ArrayUtils.addAll(classifierNames, "KNN5");
                        break;
                    case "NB":
                        analyzer = new NaiveBayes();
                        classifierObjects = (AbstractClassifier[]) ArrayUtils.addAll(classifierObjects, analyzer);
                        classifierNames = (String[]) ArrayUtils.addAll(classifierNames, "NB");
                        break;
                    case "C4.5":
                        analyzer = new J48();
                        classifierObjects = (AbstractClassifier[]) ArrayUtils.addAll(classifierObjects, analyzer);
                        classifierNames = (String[]) ArrayUtils.addAll(classifierNames, "C4.5");
                        break;
                    case "RF":
                        analyzer = new RandomForest();
                        classifierObjects = (AbstractClassifier[]) ArrayUtils.addAll(classifierObjects, analyzer);
                        classifierNames = (String[]) ArrayUtils.addAll(classifierNames, "RF");
                        break;
                    default:
                        System.out.println(method + " is no valid classifier/analysis module. Do nothing.");
                        //LOGGER.info(method + " is no valid classifier/analysis module. Do nothing.");
                        continue;
                }

            }

            for (String metric : evalMetrics){
                String filePath = resultLocation + "_" + metric + ".csv";
                writers.put(metric, new CSVWriter(new FileWriter(filePath), '\t', CSVWriter.NO_QUOTE_CHARACTER,
                                    CSVWriter.DEFAULT_ESCAPE_CHARACTER,
                                    CSVWriter.DEFAULT_LINE_END));

                String [] attributes = {"#ofAttributes"};
                String [] average = {"average"};
                String[] headerstart = (String[]) ArrayUtils.addAll(attributes, classifierNames);
                String[] header = (String[]) ArrayUtils.addAll(headerstart, average);
                writers.get(metric).writeNext(header);
            }

            File folder = new File(reducedDatasetLocation);
            File[] listOfDirs = folder.listFiles();

            for (int k = topKmin; k <= topKmax; k++) {
                String datasetFile = "";
                //get the file with k in its name and load its content
                datasetFile = reducedDatasetLocation + "/top" + String.valueOf(k) + "features_" + selectionMethod + ".csv";
                System.out.println("###################################");
                System.out.println(datasetFile);
                //LOGGER.info("###################################");
                //LOGGER.info(datasetFile);
                if (new File(datasetFile).isFile()){
                    DataLoader dl = new DataLoader(datasetFile, "\t");
                    Instances data = dl.getData();
                    data.deleteAttributeAt(0);
                    data.setClassIndex(0);
                    Analyzer ce = new Analyzer(data);

                    System.out.println(": Starting classification evaluation with models " + Arrays.toString(classifierNames) + " with k of " + k + " [" + datasetFile + "]");

                    //LOGGER.info(": Starting classification evaluation with models " + Arrays.toString(classifierNames) + " with k of " + k + " [" + datasetFile + "]");

                    HashMap<String, String> results = ce.trainAndEvaluateWithTopKAttributes(k, numFolds, classifierObjects, evalMetrics);
                    for (String metric : writers.keySet()) {
                        CSVWriter writer = writers.get(metric);
                        String resultLine = results.get(metric);
                        String[] line = resultLine.split("\t");
                        writer.writeNext(line);
                        //writer.flush();
                    }
                }
                else {
                    System.out.println("No rankings found for k = " + Integer.toString(k) + ". Stop classification for " + selectionMethod + ".");
                    //LOGGER.info("No rankings found for k = " + Integer.toString(k) + ". Stop classification for " + selectionMethod + ".");
                    break;
                }
                System.out.println(": Finished classification evaluation with models " +  Arrays.toString(classifiers) + " with k of " + k + " [" + datasetFile + "]");
                //LOGGER.info(": Finished classification evaluation with models " +  Arrays.toString(classifiers) + " with k of " + k + " [" + datasetFile + "]");
            }

            //close all open file writers
            for (String metric : evalMetrics) {
                writers.get(metric).close();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Entry point class for running classification on a data set using only the top 1 up to k features (one classification round per k). Is invoked by evaluation.ClassificationEvaluator to start the classification procedure. Uses DataLoader to load input data set and Analyzer to run the actual classification procedure (and compute evaluation metrics). Summarizes results from all classifiers and all input data sets (depending on how many features were used) and writes them into output files.

public static void main(String[] args)

Process some of the command line parameters (for classifiers, metrics, and input data set locations). Invokes classification procedure for every subdirectory (=selection method) that is contained in the input directory.

Parameters:
  • args – the parameters provided when invoking the jar. Provide the following parameters: a) the absolute path of the directory containing the reduced input data set files (one subdirectory per selection approach), b) the absolute path of the output directory (where to write all evaluation results), c) the minimum number of features to use for classification, d) the maximum number of features to use for classification, e) number of folds for cross validation, f) a string of classifiers, separated by a comma (e.g. “SVM,KNN3,KNN5”), and g) a string of metrics to compute, separated by a comma (e.g. “accuracy,specificity,precision”).
private static void classifyAndEvaluate(String selectionMethod, String reducedDatasetLocation, String resultLocation, int topKmin, int topKmax, int numFolds, String[] classifiers, String[] evalMetrics)

Runs the overall classification procedure for all feature set sizes of a particular selection approach. Creates the specified classifier objects and filewriters for the results. For every feature set size from topKmin to topKmax, invoke an instance of Analyzer to carry out the actual classification and compute the metrics.

Parameters:
  • selectionMethod – the name of the feature selection method that generated the feature sets to evaluate.
  • reducedDatasetLocation – absolute path to the directory containing the reduced input files (with increasing feature set sizes) for classification.
  • resultLocation – absolute path to the output file to which to write the classification results.
  • topKmin – minimum number of features to use.
  • topKmax – maximum number of features to use.
  • numFolds – k parameter for k-fold cross validation.
  • classifiers – a list of classifier names to use for classification.
  • evalMetrics – a list of metric names compute for the classification results.
public class Analyzer
[sources]
package de.hpi.bmg;

import weka.classifiers.Evaluation;
import weka.classifiers.trees.J48;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.functions.Logistic;
import weka.classifiers.functions.SMO;
import weka.classifiers.lazy.IBk;
import weka.classifiers.trees.RandomForest;
import weka.core.Instances;
import java.io.IOException;
import java.util.logging.Logger;
import weka.classifiers.AbstractClassifier;
import java.util.HashMap;

import java.util.Random;

/**
 * Carries out the actual k-fold cross validation on the specified classifiers.
 * Computes the desired evaluation metrics.
 * Uses WEKA.
 */
public class Analyzer {
    private Instances data;

    private final static Logger LOGGER = Logger.getLogger(Analyzer.class.getName());

    /**
     * Constructor method.
     *
     * @param data the data set to use for classification.
     * @return An instance of Analyzer.
     */
    public Analyzer(Instances data) {
        this.data = data;
    }

    /**
     * Runs the actual classification procedure.
     * Uses WEKA to run multiple classifiers (originally specified in config file) in a k-fold cross validation manner.
     * Computes standard evaluation metrics as required afterwards.
     *
     * @param numberOfAttributesRetained the data set to use for classification.
     * @param numFolds number of folds for cross validation.
     * @param classifiers a list of classifier objects to use for classification.
     * @param metrics a list of names of evaluation metrics to compute for the results.
     * @return the evaluation results as class:HashMap with the metric name as identifier and metric results (across classifiers and average) as values.
     */
    public  HashMap<String, String> trainAndEvaluateWithTopKAttributes(int numberOfAttributesRetained, int numFolds, AbstractClassifier[] classifiers, String[] metrics) {


        HashMap<String, String> returnStrings = new HashMap<String, String>();
        HashMap<String, Double> sums = new HashMap<String, Double>();
        Evaluation eval = null;

        //initialize maps for return strings and average computations
        for (String metric : metrics){
            String startString = Integer.toString(numberOfAttributesRetained);
            returnStrings.put(metric, startString);
            sums.put(metric, 0.0);
        }

        try {
            eval = new Evaluation(this.data);

            double sum = 0.0d;

            //AbstractClassifier analyzer = null;

            for (AbstractClassifier analyzer : classifiers){

                //run the analysis
                eval.crossValidateModel(analyzer, this.data, numFolds, new Random(1));
                for (String metric : metrics) {
                    String returnString = returnStrings.get(metric);
                    double metricVal = 0.0;
                    switch (metric) {
                        case "accuracy":
                            metricVal = eval.pctCorrect();
                            break;
                        case "kappa":
                            metricVal = eval.kappa();
                            break;
                        case "AUROC":
                            metricVal = eval.weightedAreaUnderROC();
                            break;
                        case "sensitivity":
                            metricVal = eval.weightedTruePositiveRate();
                            break;
                        case "specificity":
                            metricVal = eval.weightedTrueNegativeRate();
                            break;
                        case "F1":
                            metricVal = eval.weightedFMeasure();
                            break;
                        case "matthewcoef":
                            metricVal = eval.weightedMatthewsCorrelation();
                            break;
                        case "precision":
                            metricVal = eval.weightedPrecision();
                            break;
                    }
                    returnString += "\t" + String.valueOf(metricVal);
                    returnStrings.put(metric, returnString);
                    //update overall sum for average computation
                    sum = sums.get(metric);
                    sums.put(metric, sum + metricVal);
                }
            }

            for (String metric : metrics){
                String returnString = returnStrings.get(metric);
                returnString += "\t" + (sums.get(metric) / classifiers.length);
                returnStrings.put(metric, returnString);
            }

            //System.out.println(eval.toSummaryString(true));

        } catch (Exception e) {
            e.printStackTrace();
        }
        return returnStrings;
    }
}

Carries out the actual k-fold cross validation on the specified classifiers. Computes the desired evaluation metrics. Uses WEKA. Invoked by WEKA_Evaluator.

public Analyzer(Instances data)

Constructor method.

Parameters:
  • data – the data set to use for classification.
Return:

An instance of Analyzer.

public HashMap<String, String> trainAndEvaluateWithTopKAttributes(int numberOfAttributesRetained, int numFolds, AbstractClassifier[] classifiers, String[] metrics)

Runs the actual classification procedure. Uses WEKA to run multiple classifiers (originally specified in config file) in a k-fold cross validation manner. Computes standard evaluation metrics as required afterwards.

Parameters:
  • numberOfAttributesRetained – the data set to use for classification.
  • numFolds – number of folds for cross validation.
  • classifiers – a list of classifier objects to use for classification.
  • metrics – a list of names of evaluation metrics to compute for the results.
Returns:

the evaluation results as HashMap with the metric name as identifier and metric results (across classifiers and average) as values.