Thursday, January 2, 2025

How to build SAMTOOLS on a Windows PC

I have recently started to add functionality in the Visual genome Browser to be able to display the aligned reads from a SAM/BAM/CRAM file as a graphical representation similar to IGV. In order o be able to do that, I needed to compile a 64-bit EXE version of SAMTOOLS for Windows PCs. 

 These are the steps that I followed:

I read through the build instructions at:

In the section that says: "Windows MSYS2/MINGW64" I then realized I need a Compilation environment which is called MSYS, which contains a compiler: mingw64.exe


I then proceeded to download the file: msys2-x86_64-20241208.exe



I started the installation on my Windows PC:




Then you RUN: MSYS2 MINGW64
This will open the build environment.

First thing to do is to check if the MSYS is up to date by typing: pacman -Syuu


And respond with Y


Then you have to run it again after it has been upgraded: MSYS2 MINGW64

Also check that $MSYSTEM equals MINGW64 by typing: echo $MSYSTEM




Then update more components by typing the following one after the other and confirming by pressing enter.

pacman -S base-devel mingw-w64-x86_64-toolchain
pacman -S mingw-w64-x86_64-libdeflate mingw-w64-x86_64-zlib mingw-w64-x86_64-bzip2
pacman -S mingw-w64-x86_64-xz mingw-w64-x86_64-curl mingw-w64-x86_64-autotools
pacman -S mingw-w64-x86_64-tools-git






Now download the C language source code base for SAMTOOLS from 

These contain full source repositories which has the required libraries inside them:

Then I unzipped it. Probably a good idea to put it in a folder without spaces in the path, otherwise you will have to use "SAM\ TOOLS" with a backslash escape character to escape the spaces in the path.




Next step is to ensure that the folder where the compiler resides is in your WINDOWS Path variable:
(Depending on where you installed it initially)

E:\msys64\mingw64\bin

This is also where the SAMTOOLS will be put when it is compiled and installed.






Now you change to the folder where you have put the SAMTOOLS source code using the UNIX syntax in the console window of MINGW64
The following procedure is explained in : https://github.com/samtools/samtools/blob/develop/INSTALL

Then you have to change the current directory to where it was unzipped.

cd /F/Home/Development/SAMTOOLS/samtools-1.21

You can type  LS to get the directory listing

You then run autoreconf to create the configure file if it does not exist

Next step is to type: ./configure



Next is to compile and link the exe:
just type: make



This will use the Makefile to build the EXE



At the end of this build process you will end up with the SAMTOOLS.EXE



You can now "install" this to the E:\msys64\mingw64\bin folder which is in your path by typing:
make install

This will result in the SAMTOOLS.exe being put in the folder together with all of its dependencies:


To test if it works you can open a normal windows command line window and change to the folder:
F:\Home\Development\SAMTOOLS\samtools-1.21\test\quickcheck

Then you simply check if it can open a BAM file :
samtools view 3.quickcheck.ok.bam



That is what is needed to compile SAMTOOLS.EXE

If you want to re-run the make for some reason, you will have to remove the config.mk file first otherwise it will not work. (rm = remove)
rm config.mk
rm /e/msys64/mingw64/bin/samtools.exe
./configure

and then proceed to run
make 
and 
make install 


You can follow a similar procedure to build BCFTOOLS as obtained from:
https://www.htslib.org/download/

When you want to view a SAM file:
samtools view "F:\Home\Development\samtools-1.21\htslib-1.21\test\ce#1000.bam" 

When you want to view a CRAM file:
samtools view "F:\Home\Development\samtools-1.21\htslib-1.21\test\ce#1000.cram"

When you want to create an index to allow you to search within ranges:
samtools index "F:\Home\Development\samtools-1.21\htslib-1.21\test\ce#1000.cram"

When you only want to view a specific range:
samtools view "F:\Home\Development\samtools-1.21\htslib-1.21\test\ce#1000.cram" CHROMOSOME_I:3-3


















































Sunday, February 18, 2024

Using the Visual Genome Browser to compare the genome of the Tasmanian Devil (Sarcophilus harrisii) with that of the extinct Tasmanian Tiger/Wolf (Thylacine or Thylacinus cynocephalus)

It was early Spring of September 7, 1936 and Benjamin, the last remaining Tasmanian Tiger/Wolf (scientific name: Thylacinus cynocephalus) was breathing his last breath in an enclosure of the Beaumaris Zoo in Hobart of the Australian island state Tasmania.  It was a sad day for Australia, so much so that 7 September is still commemorated as National Threatened Species Day every year.  

Below are two photos of the Thylacine:



You can also watch a Colourised video of the Tasmanian Tiger Youtube video of it over here.

That was almost 88 years ago, but the Thylacine's voice is not completely silent.  With the power of DNA sequencing one can still discover its secrets by comparing its genome sequence against the genome of other living Australian mammals like the Tasmanian Devil (Sarcophilus harrisii).

Here is an image of a Tasmanian Devil


Both of these animals are carnivorous, both are marsupial mammals.  The gestation period of the Thylacine was 28 days while that of the Tasmanian Devil is 21 days.  When they are born in a relatively immature state they have to crawl into the pouch of the mother where they have to find one of the available teats where they will complete their further development . Both animals has/had a rear facing pouch and both had four teats.

Looking at the images one would be excused for thinking that the Thylacine/Tasmanian Tiger is more closely related to the Dog or the Wolf, but the Thylacine and the Tasmanian Devil only have 7 pairs of chromosomes (12 autosomal and 2 sex chromosomes) while the Dog and the Wolf have 39 pairs of chromosomes (76 autosomal and 2 sex chromosomes) which are considerably different from that of the Thylacine and the Tasmanian Devil.

You can read more on the relatedness of the Thylacine with Australian animals in the following articles:

Genome sequence expands on the story of the extinct Tasmanian tiger
We’ve decoded the numbat genome – and it could bring the thylacine’s resurrection a step closer
https://wwwThe mitochondrial genome sequence of the Tasmanian tiger (Thylacinus cynocephalus).ncbi.nlm.nih.gov/pmc/articles/PMC2652203/



What I really want to get to is to show you how to do this comparison for yourself with the use of my Visual Genome Browser.

The first step in doing the comparison is to install the genomes of both of these marsupial mammals in the browser.

The first step is to click on the following button which will take you to the NCBI Genbank repository:


For the Tasmanian Devil you will be downloading the genome at NCBI.







For the Thylacine you will download its genome from the University of Melbourne's Thylacine genome repository here.


Both these files need to be unzipped into a folder such as :
FastaThylacine or FastaTasmanianTiger or FastaTasmanianWolf below where your genomes are in the data folder.

The next step is to ask the Genome Browser to refresh its detected folders:



This action will rebuild the 2bit genome file from the fasta/fsa file and the GFF annotation file.  On my PC it looks as follows after this step:





When you have completed these steps you will have the 2 animals' genomes available for selection in the folder combo box:


When you select the genome of the Tasmanian Wolf and also click on the Load genes button, the chromosomes will be shown as scaffolds and you will see a view like this:


Now you need to open the Control panel by clicking on the following button:



Next step is to set the Thylacine genome as the "Other genome" which you will be comparing by double clicking on the bottom Overlay text box.


It should set the position as well as turn pink.


Next step is to load the Tasmanian Devil genome by selecting it from the Folder combo box.

Then, after it is loaded and you have clicked the "Load Genes" button, you double click in the top overlay text box next to the red "This".

Now both genomes will be loaded and ready for further comparison.

You should see something like this:



 Now that both genomes are ready, we are going to use the KmerDb tool to do a k-mer comparison between all the chromosomes of the Thylacine and all the chromosomes of the Tasmanian Devil to determine which chromosome from each animal corresponds to the same chromosome in the other animal.  It will essentially calculate the Jaccard similarity between each chromosome and then provide the results in an Overlay table.


This might take a considerable time, and when it is finished it will show you the table with the corresponding chromosomes of this genome (Tasmanian Devil) in the Rows and the Other genome (Thylacine) in the Columns.



The next time you click on the "Create or Read table" button it should take much less tie as it saved the table in a CSV file for later use.

Next step is to create a DOT Plot display using Minimap2 for all the likely matching chromosomes between the 2 animals.  This can be done in 2 ways:

If you want to do it manually, you can simply double click on the matching column and then click on Full Align which will proceed to use Minimap2 to create a PAF file that is used to create a DOT plot for the full alignment between the 2 sequences.  Make sure you have enough RAM, otherwise you will need to select the "Low mem" option which runs much longer but uses less memory.




When it is finished building a dot plot, you will see it in the Dot Plot tab of the bottom panel




You can move the mouse over the DOT Plot to see which genes in both genomes are found at that position as well as get a local alignment of the DNA at that position.  

The Current (This) genome runs along the Y axis and the Other genome sequence runs along the X-axis.

Green is used to indicate a match in the forward direction and purple indicates a match in the reverse complement.  



When you double click in the DOT Plot at the start of a region and CTRL + double click at the end of the region it will display the genes from both genome sequences found in that region.




The dot plot display also allows you to "ALIGN" the sequences from the 2 genomes based on the CIGAR strings for the alignments found in the PAF file generated by the Minimap2 process.

You can do this by holding in the SHIFT key and double clicking at a specific position in the DOT plot.

This action will correctly set the positions in both genomes in order that the Zoom DNA view is at the spot where the 2 sequences can be exactly overlaid on top of each other.  You can switch the Zoom DNA view between 3 modes: Overlay, No Overlay and Aligned Overlay (where local alignment is used to display the 2 genome sequences with colour coded letters). But the gene annotations will only be displayed in the Overlay mode.  



When you are in the Overlay mode of the Zoom DNA view, you can even right click to access the genes in both genomes at that position.



If you now select the "Copy Protein to clipboard + Comparisons" option for both "Zinc finger proteins" in the 2 genomes it will copy the protein sequences to the Comparisons so you can do further analysis on them.

A very powerful feature after you have calculated the Minimap2 Full alignment Dot Plot between the 2 sequences is to Switch to the Comparisons Tab and then 


Next select Fast Global Alignment


It is important to select this Fast option, as it uses optimised code to align sequences at speeds which are orders of magnitude faster than the other options.  The other options are only used to align individual sequences against each other when you want to obtain colour coded output or want to use PAM substitution matrices with protein sequences. 

Then check the "Use full pairwise" checkbox and after that click the "Alignments" button to find the genes from both genomes from the positions with best alignment.


The Identity and Coverage text boxes can be used to set the stringency of the matches. (Coverage is only used when you selected Fast Local Alignment in the previous selection instead of Fast Global Alignment)

The Alignments button is usually used when you have  copied several Comparison entries against each other (when you have unchecked Use Layers) or when you want to Align the currently selected Comparison entries against the genes in the currently displayed genome sequence (when you have checked Use Layers). This normally does an ALL to ALL comparison between them all to find the best alignments across the entire sequence's annotated genesThe Use full pairwise option attempts to speed up this search by only searching within the Minimap2 alignment output where we have already determined that there is a high degree of similarity.

You will get the following popup:


The search process will take several minutes, but still much quicker than if you had to do a full brute force search between all of the annotated genes of the Tasmanian Devil and all of the comparisons loaded from the Thylacine.

The result is a list of 1240 genes from the Tasmanian Devil with high similarity with genes in the Thylacine.  The output is displayed in the Gene Search Results tree.

You can zoom the tree by holding down the CTRL key and rolling the mouse wheel.  The results are ordered in descending order from highest score to lowest score. (The score is a measure of both the alignment length and the identity %)


When you double click on the main tree node, it will position the Zoom DNA view at the position of the gene in the current Tasmanian Devil's genome. 


You can also right click on the entries to get various menu options:


And most importantly, when you double-click on the entry containing the =>98.4%, the Browser will take you to a pairwise alignment between the 2 protein sequences in the Comparisons Tab.
This is the output of the fast method.



This is now where you can select one of the slower alignment methods such as Blosum90Global which will relax the matching criteria to take into account that amino acids can normally be substituted in nature for others and still result in a similar folding pattern in the protein.

The BLOSUM90 matrix assigns scores to amino acid substitutions based on their observed frequencies in related protein sequences. It provides higher scores for more similar amino acids and lower scores for less similar ones. Substitution matrices come in various forms, with the most common ones being BLOSUM (BLOcks SUbstitution Matrix) and PAM (Point Accepted Mutation) matrices.

When you want to re-run the alignment, you can click on the Edit Distance button or simply double click on the =>98.4% node again.

Now you get a colour coded pairwise alignment where Hydrophobic amino acids are pink, Polar (Uncharged) ones are blue, Polar (+) ones are a Cyan colour and Polar (-) ones are green. It will sometimes also show the start of the Exons in Red. It helps you to see when similar amino acids were substituted for others.


Another quick way to get an alignment is to simply start typing the gene name in the Gene Search box and the Genome Browser will match the corresponding genes in both genomes.




You can now just press CTRL + ENTER to do the same alignment with different genes.

After you have navigated to any gene position you can instantly overlay their sequences by clicking on the 'A' button, which will use the CIGAR strings in the Minimap2 alignment to position the genomes so that the overlay in the Zoom DNA view.


When you now switch to the "Align" view, you can see the DNA regions on the corresponding chromosomes aligned in the Zoom DNA view.


Getting back to the DOT Plot images for the Minimap2 full pairwise alignments, they can be found in the following temporary folder for the Tasmanian devil in case you want to use them elsewhere:

Genomes\FastaTasmanianDevil\Temp_FastaTasmanianDevil\Minimap


 

When you need to create all of the DOT PLOT images for all of the likely matching sequences, you can use a batch mode which will take some time but will eventually pre-calculate them all for you.



This concludes what I want to explain on how to use the Visual Genome Browser to compare the genomes of the Tasmanian Tiger/Wolf with that of the Tasmanian Devil.

You can download the Visual Genome Browser at this link.

To read more about this go to this link about the University of Melbourne's project to de-extinct the Thylacine.