Computational Biology Exam: Feature Selection, Clustering, and Classification

Posted on Jan 10, 2025 in Computers

Feature Selection

Understanding concept, facts, and implementation
Example using real datasets in Python & WEKA
Show screenshot (How you implement)
Solving problems
Other
Data information – attributes and instances
Parameter tuning/setting – in ‘Select Attribute’
The output from ‘Select Attribute’

What is feature selection?

Reducing the feature space by removing some of the (non-relevant) features. Also known as:

variable selection
feature reduction
attribute selection
variable subset selection

Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. In the big data mining field of Machine Learning implementation, feature selection plays an important role in disease classification.

Why use feature selection?

It is cheaper to measure fewer variables.
Prediction accuracy may improve
The curse of dimensionality

Feature Selection Methods

1. Filter

XCHO95xuHDhytL1JgByO2C0EBSuDF5zrOLluk8BQK4c+6ZM3i7L1xuAkZUAa9M+j8Z4vPGzzj42o9kQCZwj8HiNA8dtoXYDFTac31MsviOQnIdXy4DXHKg7dr1UtbrWTdtcEbqPVbW+3mXAr45qOsguq+dkf84yC3lc1HOyngH7B1nGXqLVOwLIAur2ByD3AbSbsc8CyJkMaEsA+f8HqKVnXR19igAAAABJRU5ErkJggg==

Based on an evaluation criterion for quantifying how well feature (subsets) discriminate the classes. The selection is based on data-related measures, such as separability or crowding.

2. Wrapper

QwAAAAAElFTkSuQmCC

Wrapper method needs one machine learning algorithm and uses its performance as evaluation criteria. This method searches for a feature which is best-suited for the machine learning algorithm and aims to improve the mining performance. To evaluate the features, the predictive accuracy used for classification tasks and goodness of cluster is evaluated using clustering.

DATASET

Breast Cancer Wisconsin (Diagnostic) dataset.

Clustering

K-means: parameter tuning/setting. Students need to know “How to set the parameter in SimpleK-means in WEKA”
Need to understand K-Means clustering performance related to parameter tuning.

Cluster process vs Classify process in WEKA

Unsupervised learning

Requires data but no labels

K-means

An iterative clustering algorithm

Initialize: pick K random points as cluster centers
Alternate:
1. Assign data points to the closest cluster center
2. Change the cluster center to the average of its assigned points
Stop when no points assignment changes.

Supervised vs unsupervised

Classification

Support Vector Machines — SMO
In WEKA – Classifier – functions – SMO
SVM in WEKA: parameter tuning/setting. Students need to know “How to set the important parameter in SMO in WEKA”
Need to understand SVM classification performance related to parameter tuning.

1. Load arff

2. Target variable class

3. Choose classify

4. Choose algorithm

5. Select SMO/SVM

Choose the SVM algorithm:

Click the “Choose” button and select “SMO” under the “function” group.
Click on the name of the algorithm to review the algorithm configuration.

SMO refers to the specific efficient optimization algorithm used inside the SVM implementation, which stands for Sequential Minimal Optimization.

The C parameter, called the complexity parameter in Weka, controls how flexible the process for drawing the line to separate the classes can be. A value of 0 allows no violations of the margin, whereas the default is 1.

A key parameter in SVM is the type of Kernel to use. The simplest kernel is a Linear kernel that separates data with a straight line or hyperplane. The default in Weka is a Polynomial Kernel that will separate the classes using a curved or wiggly line; the higher the polynomial, the more wiggly (the exponent value).

A popular and powerful kernel is the RBF Kernel or Radial Basis Function Kernel that is capable of learning closed polygons and complex shapes to separate the classes.

It is a good idea to try a suite of different kernels and C (complexity) values on your problem and see what works best.

Click “OK” to close the algorithm configuration.

Click the “Start” button to run the algorithm on the Ionosphere dataset.

Computational Biology Exam: Feature Selection, Clustering, and Classification

Feature Selection

Clustering

Classification

Recent Notes

Subjects

Publicidad