Manual

============================================================

Satsuma version 3.1.0 (June 2014)

Software for analysis of large genomic data sets

Satsuma copyright (c) Manfred Grabherr, Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Sweden

FFTReal copyright (c) Laurent de Soras

============================================================

Licensing

Spines is free software: you can redistribute it and/or modify it under the terms of the Lesser GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public License for more details.

You should have received a copy of the Lesser GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

1. Contents

IMPORTANT: the executables provided with the package require the gcc 4.6.0 runtime libraries. For all other gcc versions, you need to cleanly re-compile all executables on your system via

> make clean
> make

2. Supported Platforms

Satsuma exclusively runs on 64-bit Linux and has been tested on the Suse and Ubuntu distributions (note: while not actively supported and tested, the code compiles and runs on MacOS X 10.4.11 (Intel), gcc 4.0.1, when compiled with ‘make clean UNSUPPORTED=yes’ followed by ‘make UNSUPPORTED=yes’).

NOTE: the make file system requires csh to be installed.

3. Modules

- Satsuma: high-sensitivity alignments through cross-correlation.

- SatsumaSynteny: Satsuma in a battleship-style search framework.

4. References and credits
For Satsuma and SatsumaSynteny, please reference:
Grabherr MG, Russell P, Meyer M, Mauceli E, Alfoldi J, Di Palma F, Lindblad-Toh K. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics. 2010 May 1;26(9):1145-51. Epub 2010 Mar 5.

5. Satsuma

Satsuma aligns two fasta sequences exhaustively. For a small example, see the script ./test_Satsuma which runs on small sequences provided with the distribution for testing purposes.

Command line arguments (and defaults):

-q<string> : query fasta sequence

-t<string> : target fasta sequence

-o<string> : output directory

-l<int> : minimum alignment length (def=0)

-t_chunk<int> : target chunk size (def=4096)

-q_chunk<int> : query chunk size (def=4096)

-n<int> : number of blocks (def=1)

-lsf<bool> : submit jobs to LSF (def=0)

-nosubmit<bool> : do not run jobs (def=0)

-nowait<bool> : do not wait for jobs (def=0)

-chain_only<bool> : only chain the matches (def=0)

-refine_only<bool> : only refine the matches (def=0)

-min_prob<double> : minimum probability to keep match (def=0.99999)

-proteins<bool> : align in protein space (def=0)

-cutoff<double> : signal cutoff (def=1.8)

-same_only<bool> : only align sequences that have the same name. (def=0)

-self<bool> : ignore self-matches. (def=0)

Note that Satsuma calls other executables (HomologyByXCorr, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./Satsuma” (see test_Satsuma).

Notes:

If the output directory is not empty, Satsuma will not overwrite any files but exit with an error message.
The option “-n” specifies the number of processes, which will each take chunks of the target sequence of size –t_chunk * ¾. If the number of processes exceeds the available target sequence, this number is adjusted down.

6. SatsumaSynteny

SatsumaSynteny aligns two fasta sequences in a battleship fashion syntenically. For a small example, see the script ./test_SatsumaSynteny which runs on sequences provided with the distribution for testing purposes.

Command line arguments (and defaults):

-q<string> : query fasta sequence

-t<string> : target fasta sequence

-o<string> : output directory

-l<int> : minimum alignment length (def=0)

-t_chunk<int> : target chunk size (def=4096)

-q_chunk<int> : query chunk size (def=4096)

-t_chunk_seed<int> : target chunk size (seed) (def=8192)

-q_chunk_seed<int> : query chunk size (seed) (def=8192)

-n<int> : number of blocks (def=1)

-ni<int> : number of initial search blocks (def=-1)

-lsf<bool> : submit jobs to LSF (def=0)

-nosubmit<bool> : do not run jobs (def=0)

-nowait<bool> : do not wait for jobs (def=0)

-chain_only<bool> : only chain the matches (def=0)

-refine_only<bool> : only refine the matches (def=0)

-min_prob<double> : minimum probability to keep match (def=0.99999)

-proteins<bool> : align in protein space (def=0)

-cutoff<double> : signal cutoff (def=1.8)

-cutoff<double> : signal cutoff (seed) (def=3)

-m<int> : number of jobs per block (def=8)

-resume<string> : resumes w/ the output of a previous run (xcorr*data) (def=)

-seed<string> : loads seeds and runs from there (xcorr*data) (def=)

-pixel<int> : number of blocks per pixel (def=24)

-nofilter<bool> : do not pre-filter seeds (slower runtime) (def=0)

-dups<bool> : allow for duplications in the query sequence (def=0)

Note that SatsumaSynteny calls other executables (FilterGridSeeds, HomologyByXCorr, HomologyByXCorrSlave, MergeXCorrMatches), and thus has to be invoked by either supplying the full path of the executable, or “./SatsumSynteny” (see test_SatsumaSynteny).

Notes:

If the output directory is not empty, SatsumaSynteny will not overwrite any files but exit with an error message.
Idling processes self-terminate after two minutes. The overall alignments will still complete, but using fewer processes.
If alignment runs locally but not on the server farm, check whether processes on the farm can communicate via TCP/IP.
Currently, the entire sequences are loaded into RAM by each process. For comparison of large genomes, we strongly recommend to make sure that the CPUs have enough RAM available (~ the size of both genomes in bytes).

Parameter choice, execution and data preparation

The default parameters should work well for most genomes.
SatsumaSynteny runs most efficiently on either multi-processor machines or on clusters that are tightly coupled (fast access to files shared by the control process and the slaves)
Especially for larger genomes, we recommend leaving one CPU dedicated to the control process SatsumaSynteny.
For larger genomes (>1.5 Gb), we recommend using one chromosome of one genome as the target sequence and the entire other genome as the query sequence, and process alignments one query chromosome at a time. We tested this strategy successfully on a mammalian genome pair.
To include large-scale duplications in the query sequence (in addition to the target sequence), use the option –dups.
If using the option –nofilter, the number of initial searches (-ni) should be higher than the number of processes (-n) to ensure that subsequent processes have sufficient seeds. Note that initial searches will be queued to a number of processes specified by -n.
When many processes search a tight space, the number of pixels per CPU (-m) should be small (e.g. ‘–m 1’ as in the sample script/data set) to avoid unbalanced load (i.e. some processes get all the pixels while others are starved, since they overlap). However, a small value for –m increases inter-process communication, which should be a consideration when deploying hundreds of processes.

7. Output files

Alignment coordinates:

<outdir>/satsuma_summary.out: all alignment coordinates (Satsuma only)

<outdir>/satsuma_summary.refined.out: final coordinates (Satsuma and SatsumaSynteny)

Contents:

Target sequence name (provided by fasta)

First target base

Last target base

Query sequence name (provided by fasta)

First query base

Last query base

Identity

Orientation

EXAMPLE:

chrX 5947 6164 chrX 9153 9360 0.626728 +

chrX 6270 6452 chrX 9472 9654 0.576923 +

Note: ‘space’ in fasta names is permissible for alignment, but all spaces will be replaced with “_” in the output files.

Other output:

<outdir>/MergeXCorrMatches.out: readable alignments (Satsuma only)

<outdir>/MergeXCorrMatches.refined.out: final readable alignments (Satsuma and

SatsumaSynteny)

8. Visualization

Use ./MicroSyntenyPlot –i <satsuma_summary.txt> to create a postscript dot plot (color coded by target chromosomes).

Use ./ChromosomePaint to create a postscript file that colors chromosomes by color.

Chromosomes

Use ./BlockDisplaySatsuma to create a file that can be shown in the interactive multi-level synteny browser MizBee.