|
Proteins are polymers composed of monomers, called amino acids. In other words, a protein is a chain of amino acids. Because amino
acids were once called peptides, the chain is also referred to as a polypeptide. All amino acids have an amino end (-NH2) and a carboxyl end (-COOH). They also
have a side chain, called R (reactive) group. Of the many hundreds of described amino acids, 20 (22) are proteinogenic
("protein-building"). It is these 20 compounds that combine to give a vast array of peptides and proteins assembled by ribosomes.
|
In proteins, amino acids are linked together by covalent bonds called peptide bonds. A peptide bond is formed by the linkage of the
-NH2 end of one amino acid with the -COOH end of another amino acid. Because of the way in which the peptide bond forms, a polypeptide chain always has an amino
end and a carboxyl end.
|
The primary structure (amino acid chain) is sufficient to uniquely identify it. This means that a protein may be described by a sequence
of letters, where each letter codes for a given amino acid. These 1-letter codes are fine for being read by computer programs; to make
a protein sequence better readable for humans, there are also 3-letter codes.
|
There are various ways to classify amino acids: by their volume (molecule size); by their side chain structure; by their polarity; by their charge; by their
hydropathy; by their nutritive requirement (for humans); by their chemical function (metabolism).
|
For further details, please, have a look at my tutorial A general overview of the protein structure.
|
|
|
"Quantitative analysis of proteins online calculator" is a web-application, that may be used to determine the composition (amino acids
count and percentage) of a protein sequence. The sequence must be in 1-letter coding; only the codes of the 20 standard proteinogenic
amino acids are accepted (the presence of the "uncertain" resp. "unknown" codes B, Z, and X, as well as O = pyrrolysine, and U = selenocysteine will result in an
"Invalid protein sequence" error). The sequence may be entered either as raw data, or in FASTA format (cf. my tutorial
Biological sequences and genetic code primer). The FASTA headers must be one single line; comments are not permitted.
Multiple-sequence FASTA is not supported.
|
You can enter the sequence manually (for example, using Copy/Paste) or upload a file, stored on your computer.
To use a file, select the corresponding checkbox. Please, note, that the file size is limited to 25kB and that filenames must be all letters, numbers, spaces,
underscores or hyphens.
|
With a valid sequence entered, pushing the Calculate button, will display a table with the 20 amino acids counts and
percentages. To display the count/percentages for the different classification categories (molecule size, side chain, etc.), click the
corresponding link.
|
Use the following link to start the online application.
|
|
|
Click the following link to download the Quantitative analysis of proteins
Perl script and all other files needed to run this application on your web server. Have a look at the ReadMe.txt file, included in the
download archive, for details about the different files, and where to place them on the server.
|
The Perl script is rather long and I do not display the entire source code here. Some remarks, concerning how the application works:
- The script first checks where it should read the protein sequence from.
- If the Load protein sequence from local file checkbox is selected, it assumes that the user wants to
upload a file containing the sequence. Thus, the user has to browse for the file before they push the Calculate button, allowing
the script to get the filename in order to create a handle (an error message is displayed if this is not the case). File upload is always a potential danger that could be used to hack the webserver and even the operating system. If you are not sure about this, you might
want to have a look at my Uploading files using CGI and Perl tutorial (the tutorial example
is my DNA molecular weight calculator application). If the filename is given and the filename is valid, the script creates a file
handle and reads the file content into a string variable (thus, no file is saved on the server here).
- If the Load protein sequence from local file checkbox is not selected, the sequence is read from the text
box.
- The application web page is generated by reading a template file and replacing all
custom tags (template lines containing a tag start with '#' and all tags are placed between '#' symbols) by the corresponding actual
values, in particular the tag '#counts#' will be replaced with a HTML table, showing the counts and percentages of the different amino acids present in the
sequence; the tag #classes# will be replaced by the links, that show the corresponding category counts table. These tables are actually part of the template file;
the Perl script only replaces the category name and the count and percentage values.
- The protein analysis routine iterates through the sequence (the FASTA header has been
removed before) and for each amino acid increments its counter value, as well the one of the different category groups it belongs to. Before the protein is
analyzed, another subroutine checks if the sequence is made of valid amino acid codes.
- The category tables are fields of an outer table's rows, these rows being defined with a numbered id and the property
style="visibility:collapse". Clicking one of the category links calls a Javascript function that makes the corresponding row of the
outer table (i.e. the corresponding category table) visible (hiding all others). For details about how this can be implemented, you might want to have a look
at my tutorial Using Javascript to hide/show given text paragraphs.
|
As I said above, the Perl script is too long to display the entire source code here. This is not the case for some protein sequences related code. Click the
following links to display the code of the protein validation and the protein
analysis subroutines.
|
Protein validation subroutine.
|
sub valid_protein {
my ($protein) = @_; my $valid = 0;
my $aa = 'ACDEFGHIKLMNPQRSTVWY';
if ($protein =~ /^([$aa]+)$/) {
$valid = 1;
}
return $valid;
}
|
|
Protein analysis subroutine.
|
sub analyze {
my ($protein, $ref_amino_acids, $ref_aa_counts) = @_;
my %amino_acids = %$ref_amino_acids; my %aa_counts = %$ref_aa_counts;
my @size_counts = (0, 0, 0, 0, 0); my @chain_counts = (0, 0, 0, 0, 0, 0, 0); my @polarity_counts = (0, 0);
my @charge_counts = (0, 0, 0); my @hydro_counts = (0, 0, 0); my @requirement_counts = (0, 0, 0); my @function_counts = (0, 0, 0);
my @sizes = ( 'AGS', 'CDNPT', 'EHQV', 'IKLMR', 'FWY' );
my @chains = ( 'AIGLV', 'FHWY', 'P', 'CM', 'DE', 'KR', 'NQST' );
my @polarities = ( 'DEHKNQRSTY', 'ACFGILMPVW' );
my @charges = ( 'HKR', 'DE', 'ACFGILMNPSTQVWY' );
my @hydros = ('ACFILMVW', 'DEKNQR', 'GHPSTY');
my @requirements = ('FIKLMTVW', 'HR', 'ACDEGNPNSY');
my @functions = ('KL', 'FITWY', 'ACDEGHMNPQRSV');
for (my $i = 0; $i < length($protein); $i++) {
my $aa = substr($protein, $i, 1);
$aa_counts{$amino_acids{$aa}{'name'}}++;
for (my $j = 0; $j <= 4; $j++) {
if ($aa =~ /[$sizes[$j]]/) {
$size_counts[$j]++;
}
}
for (my $j = 0; $j <= 6; $j++) {
if ($aa =~ /[$chains[$j]]/) {
$chain_counts[$j]++;
}
}
for (my $j = 0; $j <= 1; $j++) {
if ($aa =~ /[$polarities[$j]]/) {
$polarity_counts[$j]++;
}
}
for (my $j = 0; $j <= 2; $j++) {
if ($aa =~ /[$charges[$j]]/) {
$charge_counts[$j]++;
}
if ($aa =~ /[$hydros[$j]]/) {
$hydro_counts[$j]++;
}
if ($aa =~ /[$requirements[$j]]/) {
$requirement_counts[$j]++;
}
if ($aa =~ /[$functions[$j]]/) {
$function_counts[$j]++;
}
}
}
return (\%aa_counts, \@size_counts, \@chain_counts, \@polarity_counts, \@charge_counts, \@hydro_counts, \@requirement_counts, \@function_counts);
}
|
|
If you want to place a link to the application on some other page, include the following into that page's HTML:
<a href="/cgi-bin/proteins.pl">Quantitative analysis of proteins</a>
|
|