MSA-PAD: High accuracy DNA Multiple Sequence Alignments Guided by Protein Domains

Multiple sequence alignment (MSA) is one of the critical challenges of bioinformatics as it represents a key step in sequence analysis applications such as phylogeny inference, population genetics and comparative genomics. These applications are considered of paramount importance in molecular biodiversity studies, mainly because their results are directly linked to ecosystems analyses and monitoring. Understanding the population structures based on genetic markers would lead to establish clear correlations between these populations and their surrounding environment or habitat. In addition, the phylogenetic relationship among individuals of the same population, or between different populations, draws a relevant indication on species interactions within a given environment and their history and ability to adaptation.

In this context, in order to reach appropriate conclusions from the above-mentioned applications, it is important to generate a highly accurate multiple sequence alignment. This can be achieved exploiting the protein domains information embedded in their respective DNA coding sequences. MSA-PAD uses either public protein resource (PFAM) or user-defined protein domain/s to guide the construction of DNA multiple sequence alignment. Moreover, MSA-PAD has two alignment strategies: (i) Gene and (ii) Genome. Gene mode alignment respects domains order organization from 5′ to 3′, and resolves the alignment of repetitive domains even when they are in tandem. Genome mode provides a supergene-like alignment ignoring domains order constraint accounting for genomic rearrangements (i.e. mitochondrial or viral genomes). MSA-PAD is freely available as a web application running in a cloud environment at ReCaS data center (https://recasgateway.cloud.ba.infn.it) and available on LifeWatch Catalogue of Services.

Citation: Balech, B., et al. (2015) MSA-PAD: DNA multiple sequence alignment framework based on PFAM accessed domain information, Bioinformatics, 31, 2571-2573