============================================================
Satsuma version 3.1.0 (June 2014)
Software for analysis of large genomic data sets
Satsuma copyright (c) Manfred Grabherr, Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Sweden
FFTReal copyright (c) Laurent de Soras
============================================================
Licensing
Spines is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.
You should have received a copy of the Lesser GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
1.
Contents
IMPORTANT: the executables provided with the package require the gcc 4.6.0 runtime libraries. For all other gcc versions, you need to cleanly re-compile all executables on your system via
>
make clean
>
make
2. Supported Platforms
Satsuma exclusively runs on 64-bit Linux and has been tested on the Suse and Ubuntu distributions (note: while not actively supported and tested, the code compiles and runs on MacOS X 10.4.11 (Intel), gcc 4.0.1, when compiled with ‘make clean UNSUPPORTED=yes’ followed by ‘make UNSUPPORTED=yes’).NOTE: the make file system requires csh to be installed.
3. Modules
- Satsuma: high-sensitivity alignments through cross-correlation.
- SatsumaSynteny: Satsuma in a battleship-style search framework.
4.
References and credits
For
Satsuma and SatsumaSynteny, please reference:
Grabherr
MG, Russell P, Meyer M, Mauceli E, Alfoldi J, Di Palma F,
Lindblad-Toh K. Genome-wide synteny through highly sensitive
sequence
alignment: Satsuma. Bioinformatics. 2010 May
1;26(9):1145-51. Epub 2010 Mar 5.
5. Satsuma
Satsuma aligns two fasta sequences exhaustively. For a small example, see the script ./test_Satsuma which runs on small sequences provided with the distribution for testing purposes.Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-n<int> : number of blocks (def=1)
-lsf<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-same_only<bool> : only align sequences that have the same name. (def=0)
-self<bool> : ignore self-matches. (def=0)
Note that Satsuma calls other executables (HomologyByXCorr, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./Satsuma” (see test_Satsuma).
Notes:
If the output directory is not empty, Satsuma will not overwrite any files but exit with an error message.
The option “-n” specifies the number of processes, which will each take chunks of the target sequence of size –t_chunk * ¾. If the number of processes exceeds the available target sequence, this number is adjusted down.
6. SatsumaSynteny
SatsumaSynteny aligns two fasta sequences in a battleship fashion syntenically. For a small example, see the script ./test_SatsumaSynteny which runs on sequences provided with the distribution for testing purposes.Command line arguments (and defaults):
-q<string> : query fasta sequence
-t<string> : target fasta sequence
-o<string> : output directory
-l<int> : minimum alignment length (def=0)
-t_chunk<int> : target chunk size (def=4096)
-q_chunk<int> : query chunk size (def=4096)
-t_chunk_seed<int> : target chunk size (seed) (def=8192)
-q_chunk_seed<int> : query chunk size (seed) (def=8192)
-n<int> : number of blocks (def=1)
-ni<int> : number of initial search blocks (def=-1)
-lsf<bool> : submit jobs to LSF (def=0)
-nosubmit<bool> : do not run jobs (def=0)
-nowait<bool> : do not wait for jobs (def=0)
-chain_only<bool> : only chain the matches (def=0)
-refine_only<bool> : only refine the matches (def=0)
-min_prob<double> : minimum probability to keep match (def=0.99999)
-proteins<bool> : align in protein space (def=0)
-cutoff<double> : signal cutoff (def=1.8)
-cutoff<double> : signal cutoff (seed) (def=3)
-m<int> : number of jobs per block (def=8)
-resume<string> : resumes w/ the output of a previous run (xcorr*data) (def=)
-seed<string> : loads seeds and runs from there (xcorr*data) (def=)
-pixel<int> : number of blocks per pixel (def=24)
-nofilter<bool> : do not pre-filter seeds (slower runtime) (def=0)
-dups<bool> : allow for duplications in the query sequence (def=0)
Note that SatsumaSynteny calls other executables (FilterGridSeeds, HomologyByXCorr, HomologyByXCorrSlave, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./SatsumSynteny” (see test_SatsumaSynteny).
Notes:
If the output directory is not empty, SatsumaSynteny will not overwrite any files but exit with an error message.
Idling processes self-terminate after two minutes. The overall alignments will still complete, but using fewer processes.
If alignment runs locally but not on the server farm, check whether processes on the farm can communicate via TCP/IP.
Currently, the entire sequences are loaded into RAM by each process. For comparison of large genomes, we strongly recommend to make sure that the CPUs have enough RAM available (~ the size of both genomes in bytes).
The default parameters should work well for most genomes.
SatsumaSynteny runs most efficiently on either multi-processor machines or on clusters that are tightly coupled (fast access to files shared by the control process and the slaves)
Especially for larger genomes, we recommend leaving one CPU dedicated to the control process SatsumaSynteny.
For larger genomes (>1.5 Gb), we recommend using one chromosome of one genome as the target sequence and the entire other genome as the query sequence, and process alignments one query chromosome at a time. We tested this strategy successfully on a mammalian genome pair.
To include large-scale duplications in the query sequence (in addition to the target sequence), use the option –dups.
If using the option –nofilter, the number of initial searches (-ni) should be higher than the number of processes (-n) to ensure that subsequent processes have sufficient seeds. Note that initial searches will be queued to a number of processes specified by -n.
When many processes search a tight space, the number of pixels per CPU (-m) should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid unbalanced load (i.e. some processes get all the pixels while others are starved, since they overlap). However, a small value for –m increases inter-process communication, which should be a consideration when deploying hundreds of processes.
7. Output files
Alignment coordinates:
<outdir>/satsuma_summary.out: all alignment coordinates (Satsuma only)
<outdir>/satsuma_summary.refined.out: final coordinates (Satsuma and SatsumaSynteny)
Contents:
Target sequence name (provided by fasta)
First target base
Last target base
Query sequence name (provided by fasta)
First query base
Last query base
Identity
Orientation
EXAMPLE:
chrX 5947 6164 chrX 9153 9360 0.626728 +
chrX 6270 6452 chrX 9472 9654 0.576923 +
Other output:
<outdir>/MergeXCorrMatches.out: readable alignments (Satsuma only)
<outdir>/MergeXCorrMatches.refined.out: final readable alignments (Satsuma and
SatsumaSynteny)
Use ./MicroSyntenyPlot –i <satsuma_summary.txt> to create a postscript dot plot (color coded by target chromosomes).
Use ./ChromosomePaint to create a postscript file that colors chromosomes by color.