Splice Junction - The Visual Genome Browser: August 2015

Human Hemicentin protein is encoded by the HMCN1 gene on chromosome 1 at position

chr1:186024525-186024806 and consists of 107 exons encoded on the + strand which are spliced together to code for a protein of length 5636 amino acids.

Mutations in this gene may be associated with age-related macular degeneration in the eye.

I have arranged the amino acids based in a row length of 93 amino acids, which gives me the following ordered pattern of amino acids....

Notice how the exon splice junctions (marked with red blocks) nicely line up. One can also observe that there are clearly visible hydrophobic (pink), polar uncharged (blue) bands. The negatively charged (green) and positively charged (cyan) amino acids are scattered at the edges of the hydrophobic regions.

Here is the same protein using a heat map.

When one looks at the amino acid fasta letters the same pattern is observable.

For those interested here is the FASTA sequence:

>HMCN1 ; Homo sapiens hemicentin 1 (HMCN1), mRNA. ; Protein length = 5636 (uc001grq.1 Full gene position = chr1:185703683-186160085)

MISWEVVHTVFLFALLYSSLAQDASPQSEIRAEEIPEGASTLAFVFDVTGSMYDDLVQVIEGASKILETSLKRPKRPLFNFALVPFHDPEIGPV

TITTDPKKFQYELRELYVQGGGDCPEMSIGAIKIALEISLPGSFIYVFTDARSKDYRLTHEVLQLIQQKQSQVVFVLTGDCDDRTHIGYKVYEE

IASTSSGQVFHLDKKQVNEVLKWVEEAVQASKVHLLSTDHLEQAVNTWRIPFDPSLKEVTVSLSGPSPMIEIRNPLGKLIKKGFGLHELLNIHN

SAKVVNVKEPEAGMWTVKTSSSGRHSVRITGLSTIDFRAGFSRKPTLDFKKTVSRPVQGIPTYVLLNTSGISTPARIDLLELLSISGSSLKTIP

VKYYPHRKPYGIWNISDFVPPNEAFFLKVTGYDKDDYLFQRVSSVSFSSIVPDAPKVTMPEKTPGYYLQPGQIPCSVDSLLPFTLSFVRNGVTL

GVDQYLKESASVNLDIAKVTLSDEGFYECIAVSSAGTGRAQTFFDVSEPPPVIQVPNNVTVTPGERAVLTCLIISAVDYNLTWQRNDRDVRLAE

PARIRTLANLSLELKSVKFNDAGEYHCMVSSEGGSSAASVFLTVQEPPKVTVMPKNQSFTGGSEVSIMCSATGYPKPKIAWTVNDMFIVGSHRY

RMTSDGTLFIKNAAPKDAGIYGCLASNSAGTDKQNSTLRYIEAPKLMVVQSELLVALGDITVMECKTSGIPPPQVKWFKGDLELRPSTFLIIDP

LLGLLKIQETQDLDAGDYTCVAINEAGRATGKITLDVGSPPVFIQEPADVSMEIGSNVTLPCYVQGYPEPTIKWRRLDNMPIFSRPFSVSSISQ

LRTGALFILNLWASDKGTYICEAENQFGKIQSETTVTVTGLVAPLIGISPSVANVIEGQQLTLPCTLLAGNPIPERRWIKNSAMLLQNPYITVR

SDGSLHIERVQLQDGGEYTCVASNVAGTNNKTTSVVVHVLPTIQHGQQILSTIEGIPVTLPCKASGNPKPSVIWSKKGELISTSSAKFSAGADG

SLYVVSPGGEESGEYVCTATNTAGYAKRKVQLTVYVRPRVFGDQRGLSQDKPVEISVLAGEEVTLPCEVKSLPPPIITWAKETQLISPFSPRHT

FLPSGSMKITETRTSDSGMYLCVATNIAGNVTQAVKLNVHVPPKIQRGPKHLKVQVGQRVDIPCNAQGTPLPVITWSKGGSTMLVDGEHHVSNP

DGTLSIDQATPSDAGIYTCVATNIAGTDETEITLHVQEPPTVEDLEPPYNTTFQERVANQRIEFPCPAKGTPKPTIKWLHNGRELTGREPGISI

LEDGTLLVIASVTPYDNGEYICVAVNEAGTTERKYNLKVHVPPVIKDKEQVTNVSVLLNQLTNLFCEVEGTPSPIIMWYKDNVQVTESSTIQTV

NNGKILKLFRATPEDAGRYSCKAINIAGTSQKYFNIDVLVPPTIIGTNFPNEVSVVLNRDVALECQVKGTPFPDIHWFKDGKPLFLGDPNVELL

DRGQVLHLKNARRNDKGRYQCTVSNAAGKQAKDIKLTIYIPPSIKGGNVTTDISVLINSLIKLECETRGLPMPAITWYKDGQPIMSSSQALYID

KGQYLHIPRAQVSDSATYTCHVANVAGTAEKSFHVDVYVPPMIEGNLATPLNKQVVIAHSLTLECKAAGNPSPILTWLKDGVPVKANDNIRIEA

GGKKLEIMSAQEIDRGQYICVATSVAGEKEIKYEVDVLVPPAIEGGDETSYFIVMVNNLLELDCHVTGSPPPTIMWLKDGQLIDERDGFKILLN

GRKLVIAQAQVSNTGLYRCMAANTAGDHKKEFEVTVHVPPTIKSSGLSERVVVKYKPVALQCIANGIPNPSITWLKDDQPVNTAQGNLKIQSSG

RVLQIAKTLLEDAGRYTCVATNAAGETQQHIQLHVHEPPSLEDAGKMLNETVLVSNPVQLECKAAGNPVPVITWYKDNRLLSGSTSMTFLNRGQ

IIDIESAQISDAGIYKCVAINSAGATELFYSLQVHVAPSISGSNNMVAVVVNNPVRLECEARGIPAPSLTWLKDGSPVSSFSNGLQVLSGGRIL

ALTSAQISDTGRYTCVAVNAAGEKQRDIDLRVYVPPNIMGEEQNVSVLISQAVELLCQSDAIPPPTLTWLKDGHPLLKKPGLSISENRSVLKIE

DAQVQDTGRYTCEATNVAGKTEKNYNVNIWVPPNIGGSDELTQLTVIEGNLISLLCESSGIPPPNLIWKKKGSPVLTDSMGRVRILSGGRQLQI

SIAEKSDAALYSCVASNVAGTAKKEYNLQVYIRPTITNSGSHPTEIIVTRGKSISLECEVQGIPPPTVTWMKDGHPLIKAKGVEILDEGHILQL

KNIHVSDTGRYVCVAVNVAGMTDKKYDLSVHAPPSIIGNHRSPENISVVEKNSVSLTCEASGIPLPSITWFKDGWPVSLSNSVRILSGGRMLRL

MQTTMEDAGQYTCVVRNAAGEERKIFGLSVLVPPHIVGENTLEDVKVKEKQSVTLTCEVTGNPVPEITWHKDGQPLQEDEAHHIISGGRFLQIT

NVQVPHTGRYTCLASSPAGHKSRSFSLNVFVSPTIAGVGSDGNPEDVTVILNSPTSLVCEAYSYPPATITWFKDGTPLESNRNIRILPGGRTLQ

ILNAQEDNAGRYSCVATNEAGEMIKHYEVKVYIPPIINKGDLWGPGLSPKEVKIKVNNTLTLECEAYAIPSASLSWYKDGQPLKSDDHVNIAAN

GHTLQIKEAQISDTGRYTCVASNIAGEDELDFDVNIQVPPSFQKLWEIGNMLDTGRNGEAKDVIINNPISLYCETNAAPPPTLTWYKDGHPLTS

SDKVLILPGGRVLQIPRAKVEDAGRYTCVAVNEAGEDSLQYDVRVLVPPIIKGANSDLPEEVTVLVNKSALIECLSSGSPAPRNSWQKDGQPLL

EDDHHKFLSNGRILQILNTQITDIGRYVCVAENTAGSAKKYFNLNVHVPPSVIGPKSENLTVVVNNFISLTCEVSGFPPPDLSWLKNEQPIKLN

TNTLIVPGGRTLQIIRAKVSDGGEYTCIAINQAGESKKKFSLTVYVPPSIKDHDSESLSVVNVREGTSVSLECESNAVPPPVITWYKNGRMITE

STHVEILADGQMLHIKKAEVSDTGQYVCRAINVAGRDDKNFHLNVYVPPSIEGPEREVIVETISNPVTLTCDATGIPPPTIAWLKNHKRIENSD

SLEVRILSGGSKLQIARSQHSDSGNYTCIASNMEGKAQKYYFLSIQVPPSVAGAEIPSDVSVLLGENVELVCNANGIPTPLIQWLKDGKPIASG

ETERIRVSANGSTLNIYGALTSDTGKYTCVATNPAGEEDRIFNLNVYVTPTIRGNKDEAEKLMTLVDTSINIECRATGTPPPQINWLKNGLPLP

LSSHIRLLAAGQVIRIVRAQVSDVAVYTCVASNRAGVDNKHYNLQVFAPPNMDNSMGTEEITVLKGSSTSMACITDGTPAPSMAWLRDGQPLGL

DAHLTVSTHGMVLQLLKAETEDSGKYTCIASNEAGEVSKHFILKVLEPPHINGSEEHEEISVIVNNPLELTCIASGIPAPKMTWMKDGRPLPQT

DQVQTLGGGEVLRISTAQVEDTGRYTCLASSPAGDDDKEYLVRVHVPPNIAGTDEPRDITVLRNRQVTLECKSDAVPPPVITWLRNGERLQATP

RVRILSGGRYLQINNADLGDTANYTCVASNIAGKTTREFILTVNVPPNIKGGPQSLVILLNKSTVLECIAEGVPTPRITWRKDGAVLAGNHARY

SILENGFLHIQSAHVTDTGRYLCMATNAAGTDRRRIDLQVHVPPSIAPGPTNMTVIVNVQTTLACEATGIPKPSINWRKNGHLLNVDQNQNSYR

LLSSGSLVIISPSVDDTATYECTVTNGAGDDKRTVDLTVQVPPSIADEPTDFLVTKHAPAVITCTASGVPFPSIHWTKNGIRLLPRGDGYRILS

SGAIEILATQLNHAGRYTCVARNAAGSAHRHVTLHVHEPPVIQPQPSELHVILNNPILLPCEATGTPSPFITWQKEGINVNTSGRNHAVLPSGG

LQISRAVREDAGTYMCVAQNPAGTALGKIKLNVQVPPVISPHLKEYVIAVDKPITLSCEADGLPPPDITWHKDGRAIVESIRQRVLSSGSLQIA

FVQPGDAGHYTCMAANVAGSSSTSTKLTVHVPPRIRSTEGHYTVNENSQAILPCVADGIPTPAINWKKDNVLLANLLGKYTAEPYGELILENVV

LEDSGFYTCVANNAAGEDTHTVSLTVHVLPTFTELPGDVSLNKGEQLRLSCKATGIPLPKLTWTFNNNIIPAHFDSVNGHSELVIERVSKEDSG

TYVCTAENSVGFVKAIGFVYVKEPPVFKGDYPSNWIEPLGGNAILNCEVKGDPTPTIQWNRKGVDIEISHRIRQLGNGSLAIYGTVNEDAGDYT

CVATNEAGVVERSMSLTLQSPPIITLEPVETVINAGGKIILNCQATGEPQPTITWSRQGHSISWDDRVNVLSNNSLYIADAQKEDTSEFECVAR

NLMGSVLVRVPVIVQVHGGFSQWSAWRACSVTCGKGIQKRSRLCNQPLPANGGKPCQGSDLEMRNCQNKPCPVDGSWSEWSLWEECTRSCGRGN

QTRTRTCNNPSVQHGGRPCEGNAVEIIMCNIRPCPVHGAWSAWQPWGTCSESCGKGTQTRARLCNNPPPAFGGSYCDGAETQMQVCNERNCPIH

GKWATWASWSACSVSCGGGARQRTRGCSDPVPQYGGRKCEGSDVQSDFCNSDPCPTHGNWSPWSGWGTCSRTCNGGQMRRYRTCDNPPPSNGGR

ACGGPDSQIQRCNTDMCPVDGSWGSWHSWSQCSASCGGGEKTRKRLCDHPVPVKGGRPCPGDTTQVTRCNVQACPGGPQRARGSVIGNINDVEF

GIAFLNATITDSPNSDTRIIRAKITNVPRSLGSAMRKIVSILNPIYWTTAKEIGEAVNGFTLTNAVFKRETQVEFATGEILQMSHIARGLDSDG

SLLLDIVVSGYVLQLQSPAEVTVKDYTEDYIQTGPGQLYAYSTRLFTIDGISIPYTWNHTVFYDQAQGRMPFLVETLHASSVESDYNQIEETLG

FKIHASISKGDRSNQCPSGFTLDSVGPFCADEDECAAGNPCSHSCHNAMGTYYCSCPKGLTIAADGRTCQDIDECALGRHTCHAGQDCDNTIGS

YRCVVRCGSGFRRTSDGLSCQDINECQESSPCHQRCFNAIGSFHCGCEPGYQLKGRKCMDVNECRQNVCRPDQHCKNTRGGYKCIDLCPNGMTK

AENGTCIDIDECKDGTHQCRYNQICENTRGSYRCVCPRGYRSQGVGRPCMDINECEQVPKPCAHQCSNTPGSFKCICPPGQHLLGDGKSCAGLE

RLPNYGTQYSSYNLARFSPVRNNYQPQQHYRQYSHLYSSYSEYRNSRTSLSRTRRTIRKTCPEGSEASHDTCVDIDECENTDACQHECKNTFGS

YQCICPPGYQLTHNGKTCQDIDECLEQNVHCGPNRMCFNMRGSYQCIDTPCPPNYQRDPVSGFCLKNCPPNDLECALSPYALEYKLVSLPFGIA

TNQDLIRLVAYTQDGVMHPRTTFLMVDEEQTVPFALRDENLKGVVYTTRPLREAETYRMRVRASSYSANGTIEYQTTFIVYIAVSAYPY*

I have been playing around displaying protein sequences in 2D. I find it interesting that many protein sequences line up vertically when you draw the protein sequences as a 2 dimensional grid with a specific number of amino acids in each row.

The colours represent the chemical properties of the amino acids. Each letter represent one of the 20 amino acids in nature:

Letters used for amino acids in proteins

The Red blocks indicates (splice junctions) where the exons were spliced together to form the proteins - the parts between the red markers are the sections which COULD be alternatively spliced to form different proteins using different combination of included or excluded exons.

Pink blocks represent Hydrophobic amino acids which normally gets folded to the inside of a protein.

Polar amino acids normally automatically fold to be on the outside, because it likes to interact with water molecules.

Blue blocks represent Positively Charged Polar amino acids

Cyan blocks represent Uncharged Polar amino acids

Green blocks represent Negatively charged Polar amino acids

Yellow blocks represent the STOP codons in the protein

This is a view of the LTBP1 Protein primary structure with a width of 41

(found on chromosome 2)

This is a view of the KIDINS220 protein primary structure with a width of 33

(found on chromosome 2)

What was interesting to me is that all of the exons and amino acids with similar properties neatly aligns vertically like letters in a crossword puzzle.

The following is the TITIN protein

(found on chromosome 2)

(This is one of the biggest proteins in the human body at 33423 amino acids in length)

It is built from 312 exon sequences which are spliced together.

Here it is shown with a width of 93 amino acids in each row. You can see how the exon splice positions lines up in the view.

(The colours also shows how amino acids with similar properties line up every 93 amino acids

Here is another snippet of the same protein - but with a width of 94 amino acids. See how the exon positions align.

The following is the Nebulin (NEB) protein

(found on chromosome 2)

Here it is shown with a width of 35 amino acids in each row.

The following is the Obscurin (OBSCN) protein which interacts with the TITIN protein in the cytoskeleton

(found on chromosome 1)

Here it is shown with a width of 88 amino acids in each row.

The following is the growth factor produced by the MEGF6 gene

(found on chromosome 1)

Here it is shown with a width of 44 amino acids in each row.

The following is the Collagen type 11 (from the COL11A2 gene)

(found on chromosome 6)

Here it is shown with a width of 18 amino acids in each row. (The reason that there are so many G's at the start of exons, is due to the fact that the consensus sequence AG at intron ends many times are followed by another G which falls inside the next exon - and that may code for G,D,V,E, or A)

Another protein that forms exon patters when shown at row width of 18 is Collagen, type 5, (coded by the COL5A3 gene)

(found on chromosome 19)

Here it is shown with a width of 18 amino acids in each row.

Another is the Collagen protein from gene COL28A1)

(found on chromosome 7)

Here it is shown with a width of 23 amino acids in each row.

The following is Relaxin/insulin-like family peptide receptor 1 (RXFP1)

(found on chromosome 4)

Here it is shown with a width of 24 amino acids in each row.

I realized that the reason for these patters are exons with similar lengths repeating in the same protein. But what is puzzling to me is that the exons may be located many bases of introns apart on the chromosome, but still end up making these patterns when spliced together.

Splice Junction - The Visual Genome Browser

Interesting links

Thursday, August 27, 2015

Hemicentin 1 (HMCN1)

Monday, August 10, 2015

Protein scrabble

Total Pageviews

About Me