S-MIG++ is a sampling based, memory and runtime efficient algorithm for the whole-genome LD-based haplotype blocks recognition. It uses the haplotype block definition proposed by Gabriel et al. 2002, which is the most commonly used definition and was implemented in software like Haploview (Barrett et al. 2005) and PLINK (Purcell et al. 2007).
The S-MIG++ algorithm is significantly faster than its predecessor MIG++, which was implemented in the LDExplorer R package. It is specifically designed to process huge datasets with millions of SNPs and/or thousands of samples.
The integrated support for distributed computations in S-MIG++ makes this algorithm especially scalable.
The runtime efficiency and scalability in S-MIG++ were achieved by using two steps approach:
- sample a small proportion of SNP pairs within a chromosome and estimate upper limits for haplotype block boundaries (Figure 1);
- refine the exact haplotype blocks boundaries within their estimated upper limits (Figure 2).
| || || || || |
| ||Figure 1. Sampled SNP pairs and estimated haplotype block boundaries (gray line). Red, green, and blue colors reflect strong, moderate and low LD between SNPs, respectively.|| ||Figure 2. Refined exact haplotype block boundaries (black line). Red, green, and blue colors reflect strong, moderate and low LD between SNPs, respectively.|| |
Our experiments showed, that it is sufficient to sample only 1%-5% of all SNP pairs within a chromosome. The probability of error in estimations is proved to be not greater than 0.01 and in practice is very close to 0.
The source code of the S-MIG++ algorithm is available below:
To compile the S-MIG++ algorithm:
1) decompress the SMIGPP_X.Y.Z.tar.gz (or SMIGPP_X.Y.Z_MPI.tar.gz);
2) execute make command.
The software accepts both HapMap II and VCF format files with phased genotypes.
The detailed description of the command line arguments and output format can be obtained by executing:
Below are listed requirements for the S-MIG++ compilation and use:
- Linux operating system.
- C++ compiler with C++11 support. Preferably, from GNU Compiler Collection (GCC) version 4.9.1 or higher.
- GNU Scientific Library (GSL).
- zlib compression library.
- Open MPI (for distributed computations only).
Copyright © 2014 by Daniel Taliun, Johann Gamper and Cristian Pattaro. All rights reserved.
S-MIG++ is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
S-MIG++ is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with S-MIG++. If not, see http://www.gnu.org/licenses/.
For questions, comments, or any other help regarding the S-MIG++, please contact us through email firstname.lastname@example.org.