1. |
DNA sequences introduction. |
|
Deoxyribonucleic acid (DNA) is a polymer composed of four molecules, called
nucleobases, nitrogenous bases, nucleic acid bases, or simply bases.
Their names and one-letter abbreviations are adenine (A), cytosine (C), guanine (G), and thymine (T). The bases are joined end to end to form
a single strand of DNA (single-stranded DNA). In the cell, DNA usually appears in a double-stranded
form, with two strands wrapped around each other in a double helix
shape. The two strands of the double helix have matching bases, known as the base
pairs. In the DNA double helix, an A on one strand is always opposite a T on the other strand, and a G is always paired with a
C.
|
If you add a sugar (2-deoxyribose in DNA) to the bases, you get the corresponding nucleosides: adenosine, guanosine, cytidine, thymidine (for DNA, more correctly: deoxyadenosine...). You can further add a phosphate
and get the corresponding nucleotide (or nucleoside monophosphate): adenosine
monophosphate (adenylic acid), guanosine monophosphate (guanylic acid), cytidine monophosphate (cytidylic acid), thymidine monophosphate (thymidylic acid).
The polymer of these nucleotides forms a deoxyribonucleic acid (DNA). Have a look at a biochemistry book for details.
|
There is also an orientation (or directionality) to the strands.
One end of a nucleotide is called the 5' (five prime) end, and the other is called the 3' (three prime) end.
When nucleotides join to make a single strand of DNA, they always connect the 5' end of one to the 3' end of the other. Furthermore, when the cell uses the DNA,
as in transcripting it to RNA, it does so base by base from the 5' end to the 3' end of the molecule. Thus, when DNA is written, it's done so left to right on
the page, corresponding to the 5' to 3' orientation of the bases. When two strands are joined in a double helix, the two strands have opposite orientations. That
is, the 5' to 3' orientation of one strand runs in an opposite direction as the 5' to 3' orientation of the other strand: At each end of the double helix, one
strand has a 3' end; the other has a 5' end. Because the base pairs are always matched A-T and C-G and the orientation of the strands are the reverse of each
other, the expression reverse complement describes the
relationship of the bases of the two strands. It's "reverse" because the orientations are reversed, and "complement" because the bases always pair to their
complementary bases, A to T and C to G.
|
As DNA is essentially a polymer, made from 4 building blocks, the nucleotides, attached end to end, it's possible to summarize the
structure of a DNA molecule by simply giving the sequence of the nucleotides (sequence of the bases). Thus, a DNA sequence
may be represented by a string (sequence of characters), composed of of the letters A, C, G, and T, representing the 4 DNA nucleic acids (normally these
codes are written as uppercase; however, lots of bioinformatics applications also accept lowercase base codes).
|
As not all bases of a sequence are always known, there are also single letter codes for all possible groups of two, three, or four nucleic acids. Here's
the complete table of standard IUB/IUPAC nucleic acid codes (note, that U = Uracil is a base, present in RNA sequences,
corresponding to T in DNA).
Code | Nucleobase |
A |
Adenine |
C |
Cytosine |
G |
Guanine |
T |
Thymine |
U |
Uracil |
|
|
Code | Nucleobases |
M |
A or C (amino) |
R |
A or G (purine) |
W |
A or T (weak) |
S |
C or G (strong) |
Y |
C or T (pyrimidine) |
K |
G or T (keto) |
|
|
Code | Nucleobases |
V |
A or C or G |
H |
A or C or T |
D |
A or G or T |
B |
C or G or T |
N |
A or G or C or T (any) |
|
|
Sequence formatting: The FASTA format.
|
The simplest way to create a DNA sequence file, is to use raw data, i.e. saving it as a text file, containing one or several lines of strings, formed of
a sequence of the base codes. However, apart from the bases it contains, this tells us nothing about the sequence, in particular there is no indication,
what molecule it is, and where it comes from. Over time, biologists and programmers have invented several ways to format sequence data in computer files,
in order to create standard representations of sequences, including all kind of information about the sequence. There are lots of such formats; the most
widely used are FASTA and GenBank.
|
FASTA format is basically just lines of sequence data with newlines at the end, so it can easily be printed on a page or
displayed on a computer screen. The length of the lines isn't specified, but for compatibility, it's best to limit them to 80 characters in length. There
is also a FASTA header: one (or several) line(s) at the beginning of the file, and starting with the greater-than (>)
character. The FASTA header can contain any text whatsoever (or no text). Typically, a header line contains the name of the DNA or the gene it comes from,
often separated by a vertical bar (|) for additional information about the sequence, the experiment that produced it, or other, non-sequence information of
that nature. Most FASTA-aware software insists that there must be only one header line. The addition of comments (starting with a # character) is not
officially supported.
|
If you add several FASTA-formatted sequences to the same file, you get a multiple sequence FASTA file. These are not part of
the FASTA format, but several bioinformatics applications and web-based user-interfaces accept such files.
|
|
2. |
DNA molecular weight online calculator. |
|
"DNA molecular weight online calculator" is a web-application, that may be used to calculate the molecular weight of one or more DNA
sequences. The base codes may be uppercase or lowercase (transformation to uppercase in this case). U (uracil) will be rejected as an invalid DNA
base. Spaces, end-of-line and tab characters are removed, before the sequence is validated. The sequences may be entered either as
raw data, either in FASTA format. The FASTA headers must be one single line; comments are not permitted. Multiple-sequence FASTA is supported.
|
You can enter the sequence manually (for example, using Copy/Paste) or upload a file, stored on your
computer. To use a file, select the corresponding checkbox. Please, note, that the file size is limited to 25kB and that filenames must be all letters,
numbers, spaces, underscores or hyphens.
|
You can choose, if you want to do the molecular weight calculation for single-stranded DNA (the sequence, that you entered) or for
double-stranded DNA (the entered sequence plus its reversed complement). You can also choose the DNA sequence
topology: linear or circular. Linear sequences are assumed to have a 5' phosphate.
|
Molecular weight is calculated by adding the molecular weights of the sequence's bases (resp. the sequence's and reverse complement's
bases): A = 313.21; C = 289.18; G = 329.21; T = 304.2. For linear sequences, 17.01, corresponding to 1 supplementary oxygen and 1
supplementary hydrogen atom as parts of the "free" 5' phosphate, is added.
|
If the sequence contains extended base codes, two molecular weights are calculated. The minimum molecular weight, corresponding to
the molecular weight of the sequence, where all uncertain bases are the ones with the smallest molecular weight. The maximum molecular
weight, corresponding to the molecular weight of the sequence, where all uncertain bases are the ones with the highest molecular weight. For example,
for the base code D (that may either be A or G or T), the minimum molecular weight is 304.2 (the one of T), the maximum molecular weight is 329.21 (the one
of G).
|
Use the following link to start the online application.
|
|
3. |
DNA molecular weight calculator Perl script. |
|
Click the following link to download the DNA molecular weight calculator
Perl script and all other files needed to run this application on your web server. Have a look at the ReadMe.txt file, included in the download archive, for
details about the different files, and where to place them on the server.
|
The Perl script is rather long and I do not display the source code here. Just some remarks, concerning how the application works:
- The script first checks from where it should read the DNA sequence.
- If the Load DNA sequence from local file checkbox is selected, it assumes that the user wants to
upload a file containing the sequence. Thus, the user has to browse for the file before they push the Calculate button, allowing
the script to get the filename in order to create a handle (an error message is displayed if this is not the case). File upload is always a potential danger that could be used to hack the webserver and even the operating system. If you are not sure about this, you might
want to have a look at my Uploading files using CGI and Perl tutorial (the tutorial example
is my DNA molecular weight calculator application). If the filename is given and the filename is valid, the script creates a file
handle and reads the file content into a string variable (thus, no file is saved on the server here).
- If the Load DNA sequence from local file checkbox is not selected, the sequence is read from the text
box.
- The application web page is generated by reading a template file and replacing all custom
tags (template lines containing a tag start with '#' and all tags are placed between '#' symbols) by the corresponding actual values, in particular the
tag '#table#' will be replaced with a HTML table, showing the minimum and maximum molecular weights for the sequence(s) analyzed.
- The main DNA molecular calculation routine (called calculateMolecularWeights)
parses the sequence data, filters out the FASTA headers, if there are any, and for each individual sequence (remember that the DNA input may be multiple sequence FASTA format), calls the sub molecularWeightSeq (that, itself, calls molecularWeightBase for each base) to calculate the molecular weight of a single strand of DNA, depending on its topology. If the DNA is
double-stranded, a second call, is made to the sub, in order to calculate the weight of the reverse complement of the sequence.
|
The script being to long to publish here, I said, this is not the case for some generally usable DNA related routines.
|
Calculate molecular weight(s) of a DNA sequence.
|
sub molecularWeightSeq {
my ($dna, $topology) = @_; my $mw1 = 0; my $mw2 = 0;
my $oh = 17.01;
if (validDna($dna)) {
for (my $i=0; $i < length($dna); $i++) {
my ($bmw1, $bmw2) = molecularWeightBase(substr($dna, $i, 1));
$mw1 += $bmw1; $mw2 += $bmw2;
}
if ($topology eq 'linear') {
$mw1 += $oh; $mw2 += $oh;
}
}
return($mw1, $mw2);
}
|
Calculate molecular weight(s) of a DNA base.
|
sub molecularWeightBase {
my ($base) = @_; my $mw1 = 0; my $mw2 = 0;
my %dna_extended = (
'M' => 'CA', 'R' => 'AG', 'W' => 'TA', 'S' => 'CG', 'Y' => 'CT', 'K' => 'TG',
'V' => 'CAG', 'H' => 'CTA', 'D' => 'TAG', 'B' => 'CTG', 'N' => 'CTAG'
);
my %baseWeights = (
'A' => 313.21, 'C' => 289.18, 'G' => 329.21, 'T' => 304.20
);
if ($base =~ /^([ACGT])$/) {
$mw1 = $baseWeights{$base}; $mw2 = $mw1;
}
else {
my $bases_extended = $dna_extended{$base};
$mw1 = $baseWeights{substr($bases_extended, 0, 1)}; $mw2 = $baseWeights{substr($bases_extended, -1, 1)};
}
return($mw1, $mw2);
}
|
Calculate reverse complement of a DNA sequence.
|
sub reverseComp {
my ($dna) = @_;
for (my $i = 0; $i < length($dna); $i++) {
substr($dna, $i, 1, baseComp(substr($dna, $i, 1)));
}
$dna = reverse($dna);
return $dna;
}
|
Calculate complement of a DNA base.
|
sub baseComp {
my ($base) = @_;
$base =~ tr/ATCGMRWSYKVHDBN/TAGCKYWSRMBDHVN/;
return $base;
}
|
Check DNA sequence validity.
|
sub validDna {
my ($dna) = @_; my $valid = 0;
my $bases = 'ACGTMRWSYKVHDBN';
if ($dna =~ /^([$bases]+)$/) {
$valid = 1;
}
return $valid;
}
|
|
If you want to place a link to the application on some other page, include the following into this page's HTML:
<a href="/cgi-bin/dna_molweight.pl">DNA molecular weight calculator</a>
|
|
4. |
Related stuff on this site. |
|
If you need a more general application to determine the molecular weight of chemical molecules, maybe you'll like my Lazarus/Free Pascal GUI application
MolWeight, that may be used to determine the weight of molecules, entered by their chemical formula, as well as the weight of
DNA, RNA and protein sequences. Click the following link to view the description of the "MolWeight" PC
application.
|
|