Computational Biology Exam: Feature Selection, Clustering, and Classification
Feature Selection
- Understanding concept, facts, and implementation
- Example using real datasets in Python & WEKA
- Show screenshot (How you implement)
- Solving problems
- Other
- Data information – attributes and instances
- Parameter tuning/setting – in ‘Select Attribute’
- The output from ‘Select Attribute’
What is feature selection?
Reducing the feature space by removing some of the (non-relevant) features. Also known as:
- variable selection
- feature reduction
- attribute selection
- variable subset selection
Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. In the big data mining field of Machine Learning implementation, feature selection plays an important role in disease classification.
Why use feature selection?
- It is cheaper to measure fewer variables.
- Prediction accuracy may improve
- The curse of dimensionality
Feature Selection Methods
1. Filter
Based on an evaluation criterion for quantifying how well feature (subsets) discriminate the classes. The selection is based on data-related measures, such as separability or crowding.
2. Wrapper
Wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria. This method searches for a feature which is best-suited for the machine learning algorithm and aims to improve the mining performance. To evaluate the features, the predictive accuracy used for classification tasks and goodness of cluster is evaluated using clustering.
DATASET
Breast Cancer Wisconsin (Diagnostic) dataset.
Clustering
- K-means: parameter tuning/setting. Students need to know “How to set the parameter in SimpleK-means in WEKA”
- Need to understand K-Means clustering performance related to parameter tuning.
- Cluster process vs Classify process in WEKA
Unsupervised learning
- Requires data but no labels
K-means
An iterative clustering algorithm
- Initialize: pick K random points as cluster centers
- Alternate:
- Assign data points to the closest cluster center
- Change the cluster center to the average of its assigned points
- Stop when no points assignment changes.
Supervised vs unsupervised
Classification
- Support Vector Machines — SMO
- In WEKA – Classifier – functions – SMO
- SVM in WEKA: parameter tuning/setting. Students need to know “How to set the important parameter in SMO in WEKA”
- Need to understand SVM classification performance related to parameter tuning.
1. Load arff
2. Target variable class
3. Choose classify
4. Choose algorithm
5. Select SMO/SVM
Choose the SVM algorithm:
- Click the “Choose” button and select “SMO” under the “function” group.
- Click on the name of the algorithm to review the algorithm configuration.
SMO refers to the specific efficient optimization algorithm used inside the SVM implementation, which stands for Sequential Minimal Optimization.
The C parameter, called the complexity parameter in Weka, controls how flexible the process for drawing the line to separate the classes can be. A value of 0 allows no violations of the margin, whereas the default is 1.
A key parameter in SVM is the type of Kernel to use. The simplest kernel is a Linear kernel that separates data with a straight line or hyperplane. The default in Weka is a Polynomial Kernel that will separate the classes using a curved or wiggly line; the higher the polynomial, the more wiggly (the exponent value).
A popular and powerful kernel is the RBF Kernel or Radial Basis Function Kernel that is capable of learning closed polygons and complex shapes to separate the classes.
It is a good idea to try a suite of different kernels and C (complexity) values on your problem and see what works best.
- Click “OK” to close the algorithm configuration.
Click the “Start” button to run the algorithm on the Ionosphere dataset.