Exploring DNA Sequence Data Using BLAST — [Bioinformatics Eps.2]

Faris Izzatur Rahman
4 min readJul 26, 2021

--

DNA is the main thing that gives all living things their traits and traits. DNA is packed into chromosomes in a certain way.

We will try to look at data about DNA sequences. The study will look at how the DNA of a living thing is set up, and we will look for DNA that has the most similar setup.

BLAST, which stands for Basic Local Alignment Search Tool, is an algorithm and program for comparing primary biological sequence information, such as protein amino acid sequences, DNA nucleotide sequences, and RNA sequences.

The NCBI database has the information we will use.

First, we choose the information that details will use on the NCBI website. Go to the home page and click on the “top search” section..

Figure 1 NCBI Home view (Personal Documentation)

In the NCBI database, select the nucleotide section

Figure 2 select the nucleotide menu

Look for information about nucleotides that will be used in the search field. In this case, the term “sars-cov2” is used to get the genome of the Corona Virus. “Severe acute respiratory syndrome coronavirus 2 reference genome” is the information we get back.

Figure 3 Input ‘sars-cov-2’ in the search bar
Figure 4 The result view of the keyword

Select the genomic data.

Figure 5 The genomic data of SARS-CoV-2 Virus in NCBI

Select the FASTA format to observe the DNA sequence.

Figure 6 the FASTA format view of the SARS-CoV-2 virus genome

To see the DNA sequence comparison, we use the BLAST tool provided. Select the ‘Run BLAST’ option available on the right bar of the page, section ‘Analyze this squence’.

Figure 6 Select ‘Run BLAST’ in the right bar section

In the BLAST Program, the submitted DNA has been prepared, click the BLAST button to start the process.

Figure 7 BLAST Process

After BLAST is run, data will show up with sequences matching the genome the query chose before.

Figure 8 BLAST Result

There is a table with several columns in the results that were found. The table shows how much the genome sequences of the data we put in are alike (Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome). For instance, there are facts.

  • Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/KAZ/KZ_Almaty/2020, complete genome’, is the coronavirus genome data obtained from Kazastan samples
  • Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2102921286/2020, complete genome’, is data obtained from USA samples, etc.

In the result table, there are several columns, let’s discuss them one by one.

  1. Description. The name of the genome data in the database is shown in this column.
  2. Scientific Name. This is a column that lists the scientific name of the organism that was used to get the genome data.
  3. Max Points. The highest value is from the sum of the scores of aligned gene sequences. If the sequences match, the value will be positive, but if there are gaps or mismatches, the value will be zero or negative.
  4. Total Score. Shows how the gene alignment was scored. It shows how close the pieces of the gene are to each other. The gene snippet is more critical, and like the BLAST data, the higher the max score.
  5. Cover Query. The amount of the query was the same as the BLAST query. The nucleus base data in the DNA sequence is the question here — the result of comparing the BLAST data’s Accession Length and Query Length.
  6. E-value. A parameter that says how many hits are “expected” to be found by chance when searching a database of a specific size. It gets worse and worse as the Score (S) of the match goes up. The E-value is a way to describe the random noise in the background. The match is more “significant” when the E-value is low or close to zero.
  7. Percent Identity. This shows what percentage of the gene data that was BLASTed matches this gene data.
  8. Acc Length. How long is the query that the data is part of is?
  9. Accession. A number that points to the record of the sequence and doesn’t change when the sequence changes. This number is an “identifier” in the NCBI database system. The “identifier” is the leading way the database sequence data can be found.

We can explore many other data available in the primary database at NCBI.

That’s all for this episode, & Happy Learning.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Faris Izzatur Rahman
Faris Izzatur Rahman

Written by Faris Izzatur Rahman

Computer Science Fresh Graduate who Love Genomics

No responses yet

Write a response