Peptide/Protein Sequence Feature Engineering using iFeature— [Bioinformatics Eps.4]

Faris Izzatur Rahman
4 min readFeb 6, 2022

--

Computational biology and bioinformatics must be separated from the development of artificial intelligence and machine learning. Using machine learning on biological data is bringing more and more benefits to the world of research, such as making it easier to find patterns and insights, saving time, and coming up with more complex ways to compute data.

The steps of machine learning in bioinformatics data processing are very similar to the stages of machine learning. They start with data input, extracting features, putting the model into action, and getting the output. Machine learning must do all these steps to make the model you want.

Figure 1 Machine learning flow
Figure 1 Machine Learning flow

The biological data that we will specifically discuss here is protein sequence data. Protein sequence data is a sequence of peptides that encodes a function, represented by 20 protein symbols. In the computation, protein sequences will be written in FASTA format.

Figure 2 Representation of amino acid symbols in fasta data

This protein sequence data cannot be directly processed and studied by computers but must first be converted into a vector containing numerical values that are features of the data. There are several types of transforming protein sequence data into features for computation. These features represent conditions or properties of the data. Among the examples are :

  • Amino Acid Composition (AAC)
Figure 3 Amino Acid Composition Example
  • Dipeptide Pair Composition (DPC)
Figure 3 Dipeptide Pair Composition Example
  • Position Scoring Specific Matrix (PSSM)
Figure 4 Position Scoring Specific Matrix Example

To make this feature extraction easier, there is a web-based app that provides feature engineering computation for protein sequence data. This web is iFeature.

iFeature can figure out and extract complete spectra from 18 of the most critical sequence coding schemes, which cover 53 different types of feature descriptors. In addition to the default parameters, iFeature lets you use the AAindex database to pull out specific amino acid properties. Also, iFeature combines 12 commonly used algorithms for clustering, selecting, and reducing the number of dimensions of features. This makes it much easier to create, analyze, train, and compare the features of machine learning and prediction models.

Figure 5 iFeature dashboard view

Features that can be extracted are as follows

Figure 6 List of feature that can be extracted by iFeature

We will try to do feature engineering for BLAST protein data for [Canine coronavirus] spike.

Figure 7 Preview data

After preparing the data, we will go to the iFeature website.

  1. Input the data
Figure 8 Input the protein sequence

2. Selecting the features to be generated

Figure 9 Feature selection to be generated

3. Select a feature cluster algorithm (optional)

Figure 10 the cluster algorithm selection

4. Input labels for feature viewers (optional)

Figure 11 Input the label

After finishing configuring, submit it for further processing. The server will display the job ID while the computation is running.

Figure 12 JobID for the process

When the process is complete, the following page will appear where we can view the results and download them.

Figure 13 The result of the process

The output of feature engineering is in the form of files for the selected features.

Figure 14 All the file result of the feature engineering
  • AAC Features Data
Figure 15 AAC Features Data
  • DPC Features Data
Figure 16 DPC Features Data
  • GAAC Features Data
Figure 17 GAAC Features Data

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Faris Izzatur Rahman
Faris Izzatur Rahman

Written by Faris Izzatur Rahman

Computer Science Fresh Graduate who Love Genomics

No responses yet

Write a response