Thursday, August 27, 2015

Hemicentin 1 (HMCN1)

Human Hemicentin protein is encoded by the HMCN1 gene on chromosome 1 at position 
chr1:186024525-186024806 and consists of 107 exons encoded on the + strand which are spliced together to code for a protein of length 5636 amino acids. 
Mutations in this gene may be associated with age-related macular degeneration in the eye.


I have arranged the amino acids based in a row length of 93 amino acids, which gives me the following ordered pattern of amino acids....
Notice how the exon splice junctions (marked with red blocks) nicely line up.  One can also observe that there are clearly visible hydrophobic (pink), polar uncharged (blue) bands.  The negatively charged (green) and positively charged (cyan) amino acids are scattered at the edges of the hydrophobic regions.


Here is the same protein using a heat map.



When one looks at the amino acid fasta letters the same pattern is observable.


For those interested here is the FASTA sequence:

>HMCN1 ; Homo sapiens hemicentin 1 (HMCN1), mRNA. ; Protein length = 5636 (uc001grq.1 Full gene position = chr1:185703683-186160085)
MISWEVVHTVFLFALLYSSLAQDASPQSEIRAEEIPEGASTLAFVFDVTGSMYDDLVQVIEGASKILETSLKRPKRPLFNFALVPFHDPEIGPV
TITTDPKKFQYELRELYVQGGGDCPEMSIGAIKIALEISLPGSFIYVFTDARSKDYRLTHEVLQLIQQKQSQVVFVLTGDCDDRTHIGYKVYEE
IASTSSGQVFHLDKKQVNEVLKWVEEAVQASKVHLLSTDHLEQAVNTWRIPFDPSLKEVTVSLSGPSPMIEIRNPLGKLIKKGFGLHELLNIHN
SAKVVNVKEPEAGMWTVKTSSSGRHSVRITGLSTIDFRAGFSRKPTLDFKKTVSRPVQGIPTYVLLNTSGISTPARIDLLELLSISGSSLKTIP
VKYYPHRKPYGIWNISDFVPPNEAFFLKVTGYDKDDYLFQRVSSVSFSSIVPDAPKVTMPEKTPGYYLQPGQIPCSVDSLLPFTLSFVRNGVTL
GVDQYLKESASVNLDIAKVTLSDEGFYECIAVSSAGTGRAQTFFDVSEPPPVIQVPNNVTVTPGERAVLTCLIISAVDYNLTWQRNDRDVRLAE
PARIRTLANLSLELKSVKFNDAGEYHCMVSSEGGSSAASVFLTVQEPPKVTVMPKNQSFTGGSEVSIMCSATGYPKPKIAWTVNDMFIVGSHRY
RMTSDGTLFIKNAAPKDAGIYGCLASNSAGTDKQNSTLRYIEAPKLMVVQSELLVALGDITVMECKTSGIPPPQVKWFKGDLELRPSTFLIIDP
LLGLLKIQETQDLDAGDYTCVAINEAGRATGKITLDVGSPPVFIQEPADVSMEIGSNVTLPCYVQGYPEPTIKWRRLDNMPIFSRPFSVSSISQ
LRTGALFILNLWASDKGTYICEAENQFGKIQSETTVTVTGLVAPLIGISPSVANVIEGQQLTLPCTLLAGNPIPERRWIKNSAMLLQNPYITVR
SDGSLHIERVQLQDGGEYTCVASNVAGTNNKTTSVVVHVLPTIQHGQQILSTIEGIPVTLPCKASGNPKPSVIWSKKGELISTSSAKFSAGADG
SLYVVSPGGEESGEYVCTATNTAGYAKRKVQLTVYVRPRVFGDQRGLSQDKPVEISVLAGEEVTLPCEVKSLPPPIITWAKETQLISPFSPRHT
FLPSGSMKITETRTSDSGMYLCVATNIAGNVTQAVKLNVHVPPKIQRGPKHLKVQVGQRVDIPCNAQGTPLPVITWSKGGSTMLVDGEHHVSNP
DGTLSIDQATPSDAGIYTCVATNIAGTDETEITLHVQEPPTVEDLEPPYNTTFQERVANQRIEFPCPAKGTPKPTIKWLHNGRELTGREPGISI
LEDGTLLVIASVTPYDNGEYICVAVNEAGTTERKYNLKVHVPPVIKDKEQVTNVSVLLNQLTNLFCEVEGTPSPIIMWYKDNVQVTESSTIQTV
NNGKILKLFRATPEDAGRYSCKAINIAGTSQKYFNIDVLVPPTIIGTNFPNEVSVVLNRDVALECQVKGTPFPDIHWFKDGKPLFLGDPNVELL
DRGQVLHLKNARRNDKGRYQCTVSNAAGKQAKDIKLTIYIPPSIKGGNVTTDISVLINSLIKLECETRGLPMPAITWYKDGQPIMSSSQALYID
KGQYLHIPRAQVSDSATYTCHVANVAGTAEKSFHVDVYVPPMIEGNLATPLNKQVVIAHSLTLECKAAGNPSPILTWLKDGVPVKANDNIRIEA
GGKKLEIMSAQEIDRGQYICVATSVAGEKEIKYEVDVLVPPAIEGGDETSYFIVMVNNLLELDCHVTGSPPPTIMWLKDGQLIDERDGFKILLN
GRKLVIAQAQVSNTGLYRCMAANTAGDHKKEFEVTVHVPPTIKSSGLSERVVVKYKPVALQCIANGIPNPSITWLKDDQPVNTAQGNLKIQSSG
RVLQIAKTLLEDAGRYTCVATNAAGETQQHIQLHVHEPPSLEDAGKMLNETVLVSNPVQLECKAAGNPVPVITWYKDNRLLSGSTSMTFLNRGQ
IIDIESAQISDAGIYKCVAINSAGATELFYSLQVHVAPSISGSNNMVAVVVNNPVRLECEARGIPAPSLTWLKDGSPVSSFSNGLQVLSGGRIL
ALTSAQISDTGRYTCVAVNAAGEKQRDIDLRVYVPPNIMGEEQNVSVLISQAVELLCQSDAIPPPTLTWLKDGHPLLKKPGLSISENRSVLKIE
DAQVQDTGRYTCEATNVAGKTEKNYNVNIWVPPNIGGSDELTQLTVIEGNLISLLCESSGIPPPNLIWKKKGSPVLTDSMGRVRILSGGRQLQI
SIAEKSDAALYSCVASNVAGTAKKEYNLQVYIRPTITNSGSHPTEIIVTRGKSISLECEVQGIPPPTVTWMKDGHPLIKAKGVEILDEGHILQL
KNIHVSDTGRYVCVAVNVAGMTDKKYDLSVHAPPSIIGNHRSPENISVVEKNSVSLTCEASGIPLPSITWFKDGWPVSLSNSVRILSGGRMLRL
MQTTMEDAGQYTCVVRNAAGEERKIFGLSVLVPPHIVGENTLEDVKVKEKQSVTLTCEVTGNPVPEITWHKDGQPLQEDEAHHIISGGRFLQIT
NVQVPHTGRYTCLASSPAGHKSRSFSLNVFVSPTIAGVGSDGNPEDVTVILNSPTSLVCEAYSYPPATITWFKDGTPLESNRNIRILPGGRTLQ
ILNAQEDNAGRYSCVATNEAGEMIKHYEVKVYIPPIINKGDLWGPGLSPKEVKIKVNNTLTLECEAYAIPSASLSWYKDGQPLKSDDHVNIAAN
GHTLQIKEAQISDTGRYTCVASNIAGEDELDFDVNIQVPPSFQKLWEIGNMLDTGRNGEAKDVIINNPISLYCETNAAPPPTLTWYKDGHPLTS
SDKVLILPGGRVLQIPRAKVEDAGRYTCVAVNEAGEDSLQYDVRVLVPPIIKGANSDLPEEVTVLVNKSALIECLSSGSPAPRNSWQKDGQPLL
EDDHHKFLSNGRILQILNTQITDIGRYVCVAENTAGSAKKYFNLNVHVPPSVIGPKSENLTVVVNNFISLTCEVSGFPPPDLSWLKNEQPIKLN
TNTLIVPGGRTLQIIRAKVSDGGEYTCIAINQAGESKKKFSLTVYVPPSIKDHDSESLSVVNVREGTSVSLECESNAVPPPVITWYKNGRMITE
STHVEILADGQMLHIKKAEVSDTGQYVCRAINVAGRDDKNFHLNVYVPPSIEGPEREVIVETISNPVTLTCDATGIPPPTIAWLKNHKRIENSD
SLEVRILSGGSKLQIARSQHSDSGNYTCIASNMEGKAQKYYFLSIQVPPSVAGAEIPSDVSVLLGENVELVCNANGIPTPLIQWLKDGKPIASG
ETERIRVSANGSTLNIYGALTSDTGKYTCVATNPAGEEDRIFNLNVYVTPTIRGNKDEAEKLMTLVDTSINIECRATGTPPPQINWLKNGLPLP
LSSHIRLLAAGQVIRIVRAQVSDVAVYTCVASNRAGVDNKHYNLQVFAPPNMDNSMGTEEITVLKGSSTSMACITDGTPAPSMAWLRDGQPLGL
DAHLTVSTHGMVLQLLKAETEDSGKYTCIASNEAGEVSKHFILKVLEPPHINGSEEHEEISVIVNNPLELTCIASGIPAPKMTWMKDGRPLPQT
DQVQTLGGGEVLRISTAQVEDTGRYTCLASSPAGDDDKEYLVRVHVPPNIAGTDEPRDITVLRNRQVTLECKSDAVPPPVITWLRNGERLQATP
RVRILSGGRYLQINNADLGDTANYTCVASNIAGKTTREFILTVNVPPNIKGGPQSLVILLNKSTVLECIAEGVPTPRITWRKDGAVLAGNHARY
SILENGFLHIQSAHVTDTGRYLCMATNAAGTDRRRIDLQVHVPPSIAPGPTNMTVIVNVQTTLACEATGIPKPSINWRKNGHLLNVDQNQNSYR
LLSSGSLVIISPSVDDTATYECTVTNGAGDDKRTVDLTVQVPPSIADEPTDFLVTKHAPAVITCTASGVPFPSIHWTKNGIRLLPRGDGYRILS
SGAIEILATQLNHAGRYTCVARNAAGSAHRHVTLHVHEPPVIQPQPSELHVILNNPILLPCEATGTPSPFITWQKEGINVNTSGRNHAVLPSGG
LQISRAVREDAGTYMCVAQNPAGTALGKIKLNVQVPPVISPHLKEYVIAVDKPITLSCEADGLPPPDITWHKDGRAIVESIRQRVLSSGSLQIA
FVQPGDAGHYTCMAANVAGSSSTSTKLTVHVPPRIRSTEGHYTVNENSQAILPCVADGIPTPAINWKKDNVLLANLLGKYTAEPYGELILENVV
LEDSGFYTCVANNAAGEDTHTVSLTVHVLPTFTELPGDVSLNKGEQLRLSCKATGIPLPKLTWTFNNNIIPAHFDSVNGHSELVIERVSKEDSG
TYVCTAENSVGFVKAIGFVYVKEPPVFKGDYPSNWIEPLGGNAILNCEVKGDPTPTIQWNRKGVDIEISHRIRQLGNGSLAIYGTVNEDAGDYT
CVATNEAGVVERSMSLTLQSPPIITLEPVETVINAGGKIILNCQATGEPQPTITWSRQGHSISWDDRVNVLSNNSLYIADAQKEDTSEFECVAR
NLMGSVLVRVPVIVQVHGGFSQWSAWRACSVTCGKGIQKRSRLCNQPLPANGGKPCQGSDLEMRNCQNKPCPVDGSWSEWSLWEECTRSCGRGN
QTRTRTCNNPSVQHGGRPCEGNAVEIIMCNIRPCPVHGAWSAWQPWGTCSESCGKGTQTRARLCNNPPPAFGGSYCDGAETQMQVCNERNCPIH
GKWATWASWSACSVSCGGGARQRTRGCSDPVPQYGGRKCEGSDVQSDFCNSDPCPTHGNWSPWSGWGTCSRTCNGGQMRRYRTCDNPPPSNGGR
ACGGPDSQIQRCNTDMCPVDGSWGSWHSWSQCSASCGGGEKTRKRLCDHPVPVKGGRPCPGDTTQVTRCNVQACPGGPQRARGSVIGNINDVEF
GIAFLNATITDSPNSDTRIIRAKITNVPRSLGSAMRKIVSILNPIYWTTAKEIGEAVNGFTLTNAVFKRETQVEFATGEILQMSHIARGLDSDG
SLLLDIVVSGYVLQLQSPAEVTVKDYTEDYIQTGPGQLYAYSTRLFTIDGISIPYTWNHTVFYDQAQGRMPFLVETLHASSVESDYNQIEETLG
FKIHASISKGDRSNQCPSGFTLDSVGPFCADEDECAAGNPCSHSCHNAMGTYYCSCPKGLTIAADGRTCQDIDECALGRHTCHAGQDCDNTIGS
YRCVVRCGSGFRRTSDGLSCQDINECQESSPCHQRCFNAIGSFHCGCEPGYQLKGRKCMDVNECRQNVCRPDQHCKNTRGGYKCIDLCPNGMTK
AENGTCIDIDECKDGTHQCRYNQICENTRGSYRCVCPRGYRSQGVGRPCMDINECEQVPKPCAHQCSNTPGSFKCICPPGQHLLGDGKSCAGLE
RLPNYGTQYSSYNLARFSPVRNNYQPQQHYRQYSHLYSSYSEYRNSRTSLSRTRRTIRKTCPEGSEASHDTCVDIDECENTDACQHECKNTFGS
YQCICPPGYQLTHNGKTCQDIDECLEQNVHCGPNRMCFNMRGSYQCIDTPCPPNYQRDPVSGFCLKNCPPNDLECALSPYALEYKLVSLPFGIA
TNQDLIRLVAYTQDGVMHPRTTFLMVDEEQTVPFALRDENLKGVVYTTRPLREAETYRMRVRASSYSANGTIEYQTTFIVYIAVSAYPY*

Monday, August 10, 2015

Protein scrabble

I have been playing around displaying protein sequences in 2D.  I find it interesting that many protein sequences line up vertically when you draw the protein sequences as a 2 dimensional grid with a specific number of amino acids in each row.

The colours represent the chemical properties of the amino acids.  Each letter represent one of the 20 amino acids in nature:

The Red blocks indicates (splice junctions) where the exons were spliced together to form the proteins - the parts between the red markers are the sections which COULD be alternatively spliced to form different proteins using different combination of included or excluded exons.

Pink blocks represent Hydrophobic amino acids which normally gets folded to the inside of a protein.
Polar amino acids normally automatically fold to be on the outside, because it likes to interact with water molecules.
Blue blocks represent Positively Charged Polar amino acids
Cyan blocks represent Uncharged Polar amino acids
Green blocks represent Negatively charged Polar amino acids
Yellow blocks represent the STOP codons in the protein

This is a view of the LTBP1 Protein primary structure with a width of 41
(found on chromosome 2)

This is a view of the KIDINS220 protein primary structure with a width of 33 
(found on chromosome 2)
What was interesting to me is that all of the exons and amino acids with similar properties neatly aligns vertically like letters in a crossword puzzle. 


The following is the TITIN protein
(found on chromosome 2)
(This is one of the biggest proteins in the human body at 33423 amino acids in length)
It is built from 312 exon sequences which are spliced together.
Here it is shown with a width of 93 amino acids in each row.  You can see how the exon splice positions lines up in the view. 
(The colours also shows how amino acids with similar properties line up every 93 amino acids


Here is another snippet of the same protein - but with a width of 94 amino acids. See how the exon positions align.



The following is the Nebulin (NEB) protein
(found on chromosome 2)
Here it is shown with a width of 35 amino acids in each row.  



The following is the Obscurin (OBSCN) protein which interacts with the TITIN protein in the cytoskeleton
(found on chromosome 1)
Here it is shown with a width of 88 amino acids in each row.



The following is the growth factor produced by the MEGF6 gene
(found on chromosome 1)
Here it is shown with a width of 44 amino acids in each row.



The following is the Collagen type 11 (from the COL11A2 gene)
(found on chromosome 6)
Here it is shown with a width of 18 amino acids in each row. (The reason that there are so many G's at the start of exons, is due to the fact that the consensus sequence AG at intron ends many times are followed by another G which falls inside the next exon - and that may code for G,D,V,E, or A) 



Another protein that forms exon patters when shown at row width of 18 is Collagen, type 5,  (coded by the COL5A3 gene)
(found on chromosome 19)
Here it is shown with a width of 18 amino acids in each row.



Another is the Collagen protein from gene COL28A1)
(found on chromosome 7)
Here it is shown with a width of 23 amino acids in each row. 



The following is Relaxin/insulin-like family peptide receptor 1 (RXFP1)
(found on chromosome 4)
Here it is shown with a width of 24 amino acids in each row.



I realized that the reason for these patters are exons with similar lengths repeating in the same protein.  But what is puzzling to me is that the exons may be located many bases of introns apart on the chromosome, but still end up making these patterns when spliced together.