Home:ALL Converter>Adjusting the MAFFT command line algorithm to better account for gaps

Adjusting the MAFFT command line algorithm to better account for gaps

Ask Time:2017-07-27T01:00:32         Author:Ghoti

Json Formatter

I've been attempting to use the MAFFT command line tool as a means to identify coding regions within a genome. My general process is to align the amino acid consensus sequence of a gene to a translated reading frame of a target sequence. My method has been largely successful. However, I've noticed some peculiar alignments which will unfortunately impede my annotation method. The following is one such example (Note - I've also included a pairwise alignment from the Pairwise2 Biopython module to demonstrate my desired output. Unfortunately, the computation time for Pairwise2 is nearly 20 times slower than MAFFT command line):

from time import *
from Bio.SubsMat import MatrixInfo as matlist
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio.Align.Applications import MafftCommandline

startTime = time()

sample_tList = [['>Frame 1', 'RIGVGSIPRHLYCQELPLAQPKTCCAETPFRDSPLQGRLGVCPHLASGVALLYGLSTPLTMSGILDRCTCTPNARVFMAEGQVYCTRCLSARSLLPLNLQVPELGVLGLFYRPEEPLRWTLPRAFPTVECSPAGACWLSAIFPIARMTSGNLNFQQRMVRVAAEIYRAGQLTPAVLKVLQVYERGCRWYPIVGPVPGVGVYANSLHVSDKPFPGATHVLTNLPLPQRPKPEDFCPFECAMADVYDIGHGAVMFVAGGKVSWAPRGGDEVRFETVPEELKLIANRLHISFPPHHLVDMSKFAFIVPGSGVSLRVEHQHGCLPADIVPKGNCWWCLFDLLPPGVQNREIRYANQFGYQTKHGVSGKYLQRRLQINGLRAVTDTHGPIVVQYFSVKESWIRHFRLAGEPSLPGFEDLLRIRVESNTSPLADKDEKIFRFGSHKWYGAGKRARKARSGATTTVAHRASSARETRQAKKHEGVDANNAAHLEHYSPPAEGNCGWHCISAIVNRMVNSNFETTLPERVRPSDDWATDEDFVNTIQILRLPAALDRNGACKSAKYVLKLEGEHWTVSVAPGMSPSLLPLECVQGCCEHKGGLGSPDAVEVSGFDPTCLDRLAEVMHLPSSVIPAALAEMSNNSDRPASLVNTAWTVSQFYARHTGGNHRDQVRLGKIISLCQVIEECCCHQNKTNRATPEEVAAKIDQYLRGATSLEECLIKLERVSPPSAADTSFDWNVVLPGVEAAGPTTEQPHANQCCAPVPVVTQEPLDKDSVPLTAFSLSNCYYPAQGDEVRHRERLNSVLSKLEEVVLEEYGLMPTGLGPRPVLPSGLDELKDQMEEDLLKLANAQATSEMMALAAEQVDLKAWVKSYPRWIPPPPPPKVQPRRMKPVKSLPENKPVPAPRRKVRSDPGKSILAVGGPLNFSTPSELVTPLGEPVLMPASQHVSRPVTPLSEPAPVPAPRRIVSRPMTPLSEPTFVFAPWRKSQQVEEANPAAATLTCQDEPLDLSASSQTEYEAYPLAPLENIGVLEAGGQEAEEVLSGISDILDNTNPAPVSSSSSLSSVKITRPKYSAQAIIDSGGPCSGHLQKEKEACLRIMREACDAARLGDPATQEWLSHMWDRVDVLTWRNTSVYQAFRTLDGRFGFLPKMILETPPPYPCGFVMLPHTPTPSVSAESDLTIGSVATEDVPRILGKTENTGNVLNQKPLALFEEEPVCDQPAKDSRTLSRESGDSTTAPPVGTGGAGLPTDLPPLDGVDADGGGLLRTAKGKAERFFDQLSRQVFNIVSHLPVFFSHLFKSDSGYSPGDWGFAAFTLFCLFLCYSYPFFGFAPLLGVFSGSSRRVRMGVFGCWLAFAVGLFKPVSDPVGAACEFDSPECRNILHSFELLKPWDPVRSLVVGPVGLGLAILGRLLGGARYIWHFLLRLGIVADCILAGAYVLSQGRCKKCWGSCIRTAPNEIAFNVFPFTRATRSSLIDLCDRFCAPKGMDPIFLATGWRGCWTGQSPIEQPSEKPIAFAQLDEKRITARTVVSQPYDPNQAVKCLRVLQAGGAMVAEAVPKVVKVSAIPFRAPFFPTGVKVDPECRIVVDPDTFTTALRSGYSTTNLVLGVGDFAQLNGLKIRQISKPSGGGPHLIAALHVACSMVLHMLAGVYVTAVGSCGTGTSDPWCANPFAVPGYGPGSLCTSRLCISQHGLTLPLTALVAGFGLQEIALVVLIFVSIGGMAHRLSCKADMLCILLAIASYVWVPLTWLLCVFPCWLRWFSLHPLTILWLVFFLISVNMPSGILAVVLLVSLWLLGRYTNIAGLVTPYDIHHYTSGPRGVAALATAPDGTYLAAVRRAALTGRTMLFTPSQLGSLLEGAFRTRKPSLNTVNVVGSSMGSGGVFTIDGRIKCVTAAHVLTGNSARVSGVGFNQMLDFDVKGDFAIADCPNWQGVAPKTQFCGDGWTGRAYWLTSSGVEPGVIGDGFAFCFTACGDSGSPVITEAGELVGVHTGSNKQGGGIVTRPSGQFCNVTPIKLSELSEFFAGPKVPLGDVKVGSHIIKDTSEVPSDLCALLAAKPELEGGLSTVQLLCVFFLLWRMMGHAWTPLVAVGFFILNEVLPAVLVRSVFSFGMFALSWLTPWSAQVLMIRLLTAALNRNRVSLIFYSLGAVTGFVADLATTQGHPLQAVMNLSTYAFLPRMMVVTSPVPAIACGVVHLLAIILYLFKYRCLHHVLVGDGAFSAAFFLRYFAEGKLREGVSQSCGMSHESLTGALAIKLSDEDLDFLTKWTDFKCFVSASNMRNAAGQFIEAAYAKALRIELAQLVQVDKVRGTLAKLEAFADTVAPQLSPGDIVVALGHTPVGSIFDLKVGSTKHTLQAIETRVLAGSKMTVARVVDPTPAPPPAPVPIPLPPKVLENGPNAWGGEDRLNKRKRRRMEAVGIFVMDGKKYQKFWDKNSGDVFYEEVHNSTDEWECLRAGDPADFDPETGIQCGHVTIEDKVYNVFTSPSGRRFLVPANPENRRIQWEAARLSVEQALGMMNVDGELTAKELEKLKRIIDKLQGLTKEQCLNCPPVAPAVVAAAWLLLRQRKNFTTGPSPDLTKWPVRLSRTRSSTTNIRLPNRLMVVLCSCAPLFLRLMSSPALMHLLSYLPATGRETLGLMARFGILRPRPPKRKSHLVRKYRLVTLGAVTHLKLVSLISCTLLGATLSGKEFYRIQGLETYLTEPPVTLEAQCMRLPASRPMLLRLMGVPSWPQPCPPVLSCMYRPFQRPSLIILILGLTALNSQSTVVRMLLGTSPNTICPPKALFCLEFFALCGSTCLPMWVSARPFIGLPLTLPRILWLEMGTDFQPRIFRASLKSTFCAHRLCEKTGKLLLLVPSRSSIVGRRRLGQYLALITLRWPTGQRVVLPRASKRHSTRPSPSEKTNLRNYILQFAGALKLILHPAIDPHLQLSAGSLPIFFMNSPVLKSIYRRTCLTAVTTYWLRSPARLREAACRLATRLPPCQTPFTAYMHSTWCSVTLKVVTLMAFCFCKTSSLRTCSRFNPSSIQTTSCCMPSLPPCQITTGGLNITLCVSKRTQRRQPQTRHHFVAGMGVSSLTVTGFLRPSPTIRQAMSLNTTPRRLQYLWTAVLVSMILSGLKSSWLVRSAPARTVTASQARRSSCPCGKNSGPIMKGRSPECAGTAEPRLRTPLPVASTSVLTTPISTSIVLSSGVATRRVLALVVSVNLPWEKAQVLWMRCNKSRISLRGLSCMWSRVSPLLTQVDTKLAADSPLGVASGETKLTCQTVIMPVPPCSPLVKRSTWSLSPPTCCAAGSSSVPPALGKHTGSSNRSRMVMSFTRQLTRPCLTLGLWGCAGSTSQRVRRCNSLPPLVPARGFASWPAVGVLVRIPFWTKQRIAITLMSGFLAKPPLPAEISNNSTRWVLTLIAMFLTSCLRPNRPSGDSDRISVMPSNQITGTNLCPWSTQPVPRWTNLSGMGKSSPPTTGTERTAPSLSTPVKVPHLMWLHCICPLKIHSTGNEPLLLSPGQDMQSSCMTHTGNCRACLIFLRKAHPSTSQCSVTSSSYIEITKNARLLRLAMEINSGLQTSALILSAPFVQIWKGRAPRSPKLHITWGSISHLIHSLLNSQQNSHPTGPWQPRTMKSGLIGWLPAFAPSINIAARALVQAIWWAPRCFAPQGLCHTTSQNLLGARLKCFLRQSSAPAELRIAGSTSMIGSEKLLSPSHMPSLATSKALPVGDVITSPPDTFRASFLRNQLRSGFLAPEKLQRQFAHQMCTSQILKRTSTQRPSPSAGKCWILEKSDWSGKTRRPIFNLKAAISPGINLQATPHTSEFLLILQCIWTPAWALPFATGGLLGPPIGELTSRSPLMITVPKSFCLVHTMVKCLQGTKFWRARSSRLTTQGTNTLGDLNRIQRICTSLLGMVRTGRIIMKRFGRARKGKFIRLLPPASFIFPRALSLNQLATEMKWGLCRASLTKLVNFLWMLSRNFWCPLLISSYFWPFCLASPSPAGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGILWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISNLDVVAHFQHLAAIEAETYKYLASRLPMLHNLRMTGSNVTIVYNSTLNQVFAIFPTSGSRPRLHDSQQWLIAVHSSIFSSVVASCTLFVVLWLRIPMLRSVFGFRWLGAIFLLNSRITRCVRLASPGRPLLRSMNPVGLFGAGGMTDAVRTTMTNGSWFRLASAKATPVFTPGWRSCHSATRPSSIPRYLGGTVKFMLTSRTNSFAPSTTGRTPPCLAMTTFQPYFRPTTNIRSTAVIGFTNGCAPSFPLGWFMFRGFSGVRLQAMFQFKSFRHQDQHYRSIRLCCPPGHQLPVWRLAPSDGSQELSVPHGDRDTRVHHHHSQCHRELFTFFSPHAFLLPFLCFDEKGIQSGIWQCVRHRGCVCLYQLRPTCQGVHPTLLGSRSCATASFHDTDHEVGNRFSLSFCHPTGNLNVQVCWGNAPRAVTRNCFLCGVSCRSVLLCSSTPAATAALIFSFITRYVSMAQIGWQKDLTGQWRLLSFFLCLTLFPMEHSPPAIFLTRLVSLCPPPGSITGGMSVVSMRSVLWLRFASSLGLRRTACPGATLVLDTPTSFWTLRADSIVGGRPLLRKGVRLKSRVTSTSKELCLMVPWQPLPEFQRNNGVVSRRLLPHGSTKGAFGVFHYLYASDDICSKGKSRPTARASAPFDLPELCFYLRVHDIRALSEHKGRAHYGGSSCTSLGGVLSHRNLEIHHLQMPFVLARPQVHSGPCPPRRKCRGLSSDCGKPRICRPASRLHYGRHIGARVEKPRVGWQKSCTGSGKPCQICQITTASSKRERRGTASQSISCARCWVRSSPNKTSPEARDRGRKIIREARRSPIFLRLKKMSGTTSPLVSGNCVCRRSRLPLTRAPGHVPCQIQGGVTLWSLVCRRIILCASASQHHPQHDELAFFGHLGVMIGRMCGEWHLTLCLVTYSIRATVWGSLIGENHAAAIKKKKKKKK'], ['>ORF2_GP2', 'MKWGLCKASLTKLANFLWMLSRSFWCPLLISSYFWPFCLASQSPVGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGVLWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISGLDVVAHFQHLAAIEAETCKYLASRLPMLHNLRLTGSNVTIVYNSTLDQVFAIFPTPGSRPKLHDFQQWLIAVHSSIFSSVAASCTLFVVLWLRIPMLRSVFGFRWLGATFLLNSW']]

ex_file = open("newTempFile112233.fasta", "w")

for items in sample_tList:
    ex_file.write(items[0] + "\n")
    ex_file.write(items[1] + "\n")

ex_file.close()

in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'

mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
#mafft_cline = MafftCommandline(mafft_exe, input=in_file, localpair=True, lexp=-1.5, lop=0.5)

stdout, stderr = mafft_cline()
print(stdout)
test_align = AlignIO.read(io.StringIO(stdout), "fasta")
#print(test_align)

os.remove("newTempFile112233.fasta")

print('Total time = ' + str(time() - startTime))
startTime = time()

matrix = matlist.blosum62 

pWise_align = pairwise2.align.localds(sample_tList[0][1], sample_tList[1][1], matrix, -6, -1)
print(format_alignment(*pWise_align[0]))

print('Total time = ' + str(time() - startTime))

I've attempted to change the MAFFT command line alignment algorithm by referencing the help document (http://mafft.cbrc.jp/alignment/software/manual/manual.html). I don't get any error messages, but the alignment output does not change. I'm unsure what adjustments need to be made. I believe that by increasing the gap extension penalty (which is zero by default), the alignment will be improved. I haven't been able to find many documentation examples where custom variables are used when using MAFFT command line on this forum or through Google search. Help is much appreciated. For reference, documentation on the Pairwise2 alignment parameters can be found here: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

Author:Ghoti,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/45333216/adjusting-the-mafft-command-line-algorithm-to-better-account-for-gaps
yy