Peptide/Protein Sequence Feature Engineering using iFeature— [Bioinformatics Eps.4]
Computational biology and bioinformatics must be separated from the development of artificial intelligence and machine learning. Using machine learning on biological data is bringing more and more benefits to the world of research, such as making it easier to find patterns and insights, saving time, and coming up with more complex ways to compute data.
The steps of machine learning in bioinformatics data processing are very similar to the stages of machine learning. They start with data input, extracting features, putting the model into action, and getting the output. Machine learning must do all these steps to make the model you want.

The biological data that we will specifically discuss here is protein sequence data. Protein sequence data is a sequence of peptides that encodes a function, represented by 20 protein symbols. In the computation, protein sequences will be written in FASTA format.

This protein sequence data cannot be directly processed and studied by computers but must first be converted into a vector containing numerical values that are features of the data. There are several types of transforming protein sequence data into features for computation. These features represent conditions or properties of the data. Among the examples are :
- Amino Acid Composition (AAC)

- Dipeptide Pair Composition (DPC)

- Position Scoring Specific Matrix (PSSM)

To make this feature extraction easier, there is a web-based app that provides feature engineering computation for protein sequence data. This web is iFeature.
iFeature can figure out and extract complete spectra from 18 of the most critical sequence coding schemes, which cover 53 different types of feature descriptors. In addition to the default parameters, iFeature lets you use the AAindex database to pull out specific amino acid properties. Also, iFeature combines 12 commonly used algorithms for clustering, selecting, and reducing the number of dimensions of features. This makes it much easier to create, analyze, train, and compare the features of machine learning and prediction models.

Features that can be extracted are as follows

We will try to do feature engineering for BLAST protein data for [Canine coronavirus] spike.

After preparing the data, we will go to the iFeature website.
- Input the data

2. Selecting the features to be generated

3. Select a feature cluster algorithm (optional)

4. Input labels for feature viewers (optional)

After finishing configuring, submit it for further processing. The server will display the job ID while the computation is running.

When the process is complete, the following page will appear where we can view the results and download them.

The output of feature engineering is in the form of files for the selected features.

- AAC Features Data

- DPC Features Data

- GAAC Features Data
