Monday, August 10, 2015

Protein scrabble

I have been playing around displaying protein sequences in 2D.  I find it interesting that many protein sequences line up vertically when you draw the protein sequences as a 2 dimensional grid with a specific number of amino acids in each row.

The colours represent the chemical properties of the amino acids.  Each letter represent one of the 20 amino acids in nature:

The Red blocks indicates (splice junctions) where the exons were spliced together to form the proteins - the parts between the red markers are the sections which COULD be alternatively spliced to form different proteins using different combination of included or excluded exons.

Pink blocks represent Hydrophobic amino acids which normally gets folded to the inside of a protein.
Polar amino acids normally automatically fold to be on the outside, because it likes to interact with water molecules.
Blue blocks represent Positively Charged Polar amino acids
Cyan blocks represent Uncharged Polar amino acids
Green blocks represent Negatively charged Polar amino acids
Yellow blocks represent the STOP codons in the protein

This is a view of the LTBP1 Protein primary structure with a width of 41
(found on chromosome 2)

This is a view of the KIDINS220 protein primary structure with a width of 33 
(found on chromosome 2)
What was interesting to me is that all of the exons and amino acids with similar properties neatly aligns vertically like letters in a crossword puzzle. 


The following is the TITIN protein
(found on chromosome 2)
(This is one of the biggest proteins in the human body at 33423 amino acids in length)
It is built from 312 exon sequences which are spliced together.
Here it is shown with a width of 93 amino acids in each row.  You can see how the exon splice positions lines up in the view. 
(The colours also shows how amino acids with similar properties line up every 93 amino acids


Here is another snippet of the same protein - but with a width of 94 amino acids. See how the exon positions align.



The following is the Nebulin (NEB) protein
(found on chromosome 2)
Here it is shown with a width of 35 amino acids in each row.  



The following is the Obscurin (OBSCN) protein which interacts with the TITIN protein in the cytoskeleton
(found on chromosome 1)
Here it is shown with a width of 88 amino acids in each row.



The following is the growth factor produced by the MEGF6 gene
(found on chromosome 1)
Here it is shown with a width of 44 amino acids in each row.



The following is the Collagen type 11 (from the COL11A2 gene)
(found on chromosome 6)
Here it is shown with a width of 18 amino acids in each row. (The reason that there are so many G's at the start of exons, is due to the fact that the consensus sequence AG at intron ends many times are followed by another G which falls inside the next exon - and that may code for G,D,V,E, or A) 



Another protein that forms exon patters when shown at row width of 18 is Collagen, type 5,  (coded by the COL5A3 gene)
(found on chromosome 19)
Here it is shown with a width of 18 amino acids in each row.



Another is the Collagen protein from gene COL28A1)
(found on chromosome 7)
Here it is shown with a width of 23 amino acids in each row. 



The following is Relaxin/insulin-like family peptide receptor 1 (RXFP1)
(found on chromosome 4)
Here it is shown with a width of 24 amino acids in each row.



I realized that the reason for these patters are exons with similar lengths repeating in the same protein.  But what is puzzling to me is that the exons may be located many bases of introns apart on the chromosome, but still end up making these patterns when spliced together.

No comments:

Post a Comment

Please leave me a comment