Peptide/Protein Sequence Feature Engineering using iFeature— [Bioinformatics Eps.4]

4 min readFeb 6, 2022

Computational biology and bioinformatics must be separated from the development of artificial intelligence and machine learning. Using machine learning on biological data is bringing more and more benefits to the world of research, such as making it easier to find patterns and insights, saving time, and coming up with more complex ways to compute data.

The steps of machine learning in bioinformatics data processing are very similar to the stages of machine learning. They start with data input, extracting features, putting the model into action, and getting the output. Machine learning must do all these steps to make the model you want.

Figure 1 Machine learning flow — Figure 1 Machine Learning flow

The biological data that we will specifically discuss here is protein sequence data. Protein sequence data is a sequence of peptides that encodes a function, represented by 20 protein symbols. In the computation, protein sequences will be written in FASTA format.

Figure 2 Representation of amino acid symbols in fasta data

This protein sequence data cannot be directly processed and studied by computers but must first be converted into a vector containing numerical values that are features of the data. There are several types of transforming protein sequence data into features for computation. These features represent conditions or properties of the data. Among the examples are :

Amino Acid Composition (AAC)

Dipeptide Pair Composition (DPC)

Figure 3 Dipeptide Pair Composition Example

Position Scoring Specific Matrix (PSSM)

Figure 4 Position Scoring Specific Matrix Example

To make this feature extraction easier, there is a web-based app that provides feature engineering computation for protein sequence data. This web is iFeature.

iFeature can figure out and extract complete spectra from 18 of the most critical sequence coding schemes, which cover 53 different types of feature descriptors. In addition to the default parameters, iFeature lets you use the AAindex database to pull out specific amino acid properties. Also, iFeature combines 12 commonly used algorithms for clustering, selecting, and reducing the number of dimensions of features. This makes it much easier to create, analyze, train, and compare the features of machine learning and prediction models.