A very simple pipeline to perform a phylogenetic analysis of your gene of interest

Well, it’s been a while since I decided to change my blog from a personal diary to a one with science-related content. I realized that writing is not made for me. Still, I think I should not give up on writing things from time to time (and will actually make an effort to write more often).

I am by no means a phylogenetic analysis expert! But I am able to do my own small analyses. Until recently, a typical evo-devo study consisted of finding the orthologous gene of an important developmental gene of vertebrates (or let’s better say mmamals) in your laboratory model (name it amphioxus, lamprey, shark or your favourite pet), analyze its expression patterns and, more recently, its function if possible, and compare with its known properties in vertebrates. And thus the first question to resolve in this kind of studies was whether the gene that you have found is actually the orthologous of the gene of interest. Solving this requires building a phylogenetic tree with genes from different species as well as related families, and if orthologous, your gene should fall within the group formed by the family of interest. Of course, sometimes is not that simple, and different phenomena can affect the interpretation of a tree, such as hidden paralogy (see work by Shigehiro Kuraku: REF).

Since I’ve got asked several times how I do to make my small phylogenetic trees, I decided on making my  simple “protocol” available here. The first thing that you should do is to download a bunch of sequences of your gene of interest and related genes from different species and format them into a multifasta file. For example, if you want to know if your Hox gene is a Tal-1 gene, you should download genes from the family (Tal-1, Tal-2 and Lyl-1) as well as other bHLH genes, such as MyoD or NeuroD (that’s exactly what I did here).

After you have prepared your file, you can just follow the following steps: (NOTE: this step-by-step turorial is mostly based on the book “Phylogenetic Trees Made Easy”, by Barry G. Hall -which I highly recommend- and conversations with Jordi Paps).ñalsjdfañslkdfjañslfkdja

Multiple Sequence Alignment

I use MUSCLE (Edgar, 2004) as implemented in MEGA v6 (Tamura et al., 2013).

  1. In the main window of MEGA, Click Align → Edit/Build Alignment → Retrieve sequences from a file and open fasta file with the sequences to be aligned.
  2. Align them with MUSCLE, by codon (default parameters).
  3. Save mas
  4. Optional: Inspect, and edit if necessary, the alignment in aminoacidic sequences (for editing: select a position(s) and press ⌘ + arrows to move). This manual editing is actually not recommended, since it is very subjective to the perception that you might have of what a reliable aligned sequence is, and thus impel reproducibility
  5. Save alignment in edited.mas (always save a .mas file after editing).
  6. Save the nucleotide alignment in edited.meg
  7. Export alignment in edited.fas
  8. Save the aminoacidic alignment in edited.meg
  9. Export alignment in edited.fas. Then, replace stop codons (*), by a dash (-), for example using MS Word’s Replace option.

Eliminate duplicate sequences:

  1. In the main MEGA window File → Open A File/Session… and open the edited.meg
  2. Click Distances à Compute Pairwise Distances… In the Analysis Preferences window, select Model/Method à N of differences and press Compute. Check that there are no 0.00 values.

Trimming alignments

  1. To trim the alignment (discard spurious sequence alignments and increment the phylogenetic signal to noise ratio), I follow two strategies: Gblocks (Castresana 2000) tool to trim nucleotide alignments by codon, and trimAl (Capella-Gutierrez et al., 2009) for amino acids alignment, using automated1 option

a) Nucleotides

Gblocks namefile_nts_muscle.edited.fas \

-t=c \        # Type of sequence. p: protein; d: dna; c: codon.

-b2=<50%+1> \         # Minimum Number Of Sequences For A Flank Position

-b4=5 \       # Minimum Length Of A Block. Any integer ≥ 2

-b5=h         # Allowed Gap Positions. n: none; h: with half; a: all

 

NOTE: according to the Gblocks server (http://molevol.cmima.csic.es/castresana/Gblocks_server.html), a less stringent selection consists of: -b2 should be 50% of sequences + 1, -b4=5 and –b5=h.

b) Proteins

trimal -in namefile_prot_muscle.edited.fas \

-out namefile_prot_muscle.edited.trimal.fas \

-htmlout namefile_prot_muscle.edited.trimal.html -automated1

 

  1. Format trimmed fasta alignments into nexus file for MrBayes, using readAl, from the trimAl package

 

readal -in namefile_nts_muscle.edited.fas-gb –out namefile_nts_muscle.edited.gb.nxs -nexus

 

PHYLOGENETIC TREE CONSTRUCTION

Nucleotides

 

[begin trees;]

[Newick tree file: “Hox4_nts_withLamprey_muscle.edited.gb.nwk”, obtained using ML in MEGA6 without bootstrapping]

[tree user = (tree);]

 

[end;]

 

begin mrbayes;

log start replace filename = Hox1_nts_withLamprey_muscle.gb.bay.log;

set autoclose = no nowarn=no;

charset 1st_pos = 1-.\3;

charset 2nd_pos = 2-.\3;

charset 3rd_pos = 3-.\3;

partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;

set partition = by_codon;

lset applyto = (all) nst=6 rates = invgamma;

prset applyto = (all); [For JC add option statefreqpr = fixed(equal]

unlink revmat=(all) shape=(all) pinvar=(all) statefreq=(all) tratio= (all);

showmodel;

[startvals tau = user V = user;]

mcmc ngen=4000000 printfreq=1000 samplefreq=100 nchains=4 temp=0.2 checkfreq = 50000 diagnfreq = 1000;

sumt relburnin = yes burninfrac = 0.25 contype = halfcompat conformat = simple;

sump relburnin = yes burninfrac = 0.25;

log stop;

end;

 

Capella-Gutierrez S, Silla-Martinez JM, and Gabaldon T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972-1973.

Castresana, J. (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17, 540-552.

Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792-1797.

Tamura K, Stecher G, Peterson D, Filipski A, and Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0. Molecular Biology and Evolution 30: 2725-2729.

 

 

Compiling and working with MPI version of MrBayes 3.2.2

A couple of days ago I was able, after solving some troubles (namely lack of dependencies…), to compile from source a multithreaded version of MrBayes in my Ubuntu server (an old 12-core Mac Pro running with Ubuntu 12.04 LTS, with 64 GB of RAM). That means that now I can run MrBayes with more than 1 CPU, with the subsequent gain in analysis speed. And I have just finished, in about 10 minutes, a phylogenetic analysis of Hox4 genes including lamprey and hagfish orthologues. A couple of months ago, the same analysis took almost 15 hours to finish in a Windows 7-based computer using one core. Here’s what I did to set it up:
(the parallel version of MrBayes 3.2.2 has a bug that produces almost completely poytomic trees; check the edit at the end of the post to solve this problem)

First, download the source files:

cd ~/Software       # I have a Software folder in my $HOME path where I keep all the Bioinformatics software
wget http://sourceforge.net/projects/mrbayes/files/mrbayes/3.2.2/mrbayes-3.2.2.tar.gz
tar zxvf mrbayes-3.2.2.tar.gz
cd mrbayes-3.2.2

Whithin the mrbayes-3.2.2 folder an ls command shows you the contents of the folder: documentation, examples and src subfolders. src stands for “source”, and is where all the instructions and other necessary objects to build the program are. But before compiling, I checked in the MrBayes manual the possibilities of configuration that the developers included in the package, where I found out about the option of configuring the MPI version, which can be compiled only under a Unix/Linux system (including Mac computers). But surprise! The chapter 5, where how to compile the MPI version should be explained is lacking from the manual, which is a draft version since 2011… time to update it I’d say. I found also an on-line version in the wiki page of the MrBayes project here, but that was last modified in 2010, before the pdf I linked to before. Anyway, I found a CompileInstructions.txt file within the src folder where the instructions about compiling the MPI version are, which I considered the most updated. There, you can see that we would need to install certain dependencies, among them, the autotools, the BEAGLE library and the some MPI compiler, such as OpenMPI (the one I actually installed). None of these dependencies were installed in my server.

A simple Google search shows you how to install the BEAGLE library, which I finally did as indicated here

sudo apt-get install libtool subversion # I did not have these packages installed, and we need them
cd ~/Downloads          # I usually get some temporal files into the Downloads folder. Once installed, I just delete them.
svn checkout http://beagle-lib.googlecode.com/svn/trunk/ beagle-lib
cd beagle-lib
./autogen.sh
./configure --prefix=$HOME
make install
cd ..
rm -rf beagle-lib       # Once is installed, we don't need the downloaded folder anymore

Edit: In Ubuntu, for some reason I had some problems compiling the BEAGLE library. I got an error that you can see reported also here by me. The workaround, if you see that, is to compile as follow: after ./configure --prefix=$HOME, type the following:

export HB_BUILD_CONTRIB_DYN="no"
export HB_USER_CFLAGS="-fPIC" 
make clean 
make install

We need to set the environment variable for LD_LIBRARY_PATH. Adding an environment variable to your account is easy. Just go to your home directory, and open either the .bashrc or .profile file, and add the following line at the end:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/lib

You need to source this file in order to set the new value for the variable

cd ~
. .profile         # or . .bashrc

And that’s it.

Next, I installed the OpenMPI compiler. Credit to Longhao Zou’s reply to this thread of a Google group

sudo apt-get install openmpi-bin openmpi-doc libopenmpi-dev

There was no problem installing OpenMPI. Great so far (actually, I had some problems with the Kernel’s graphics that precluded the installation of some of the dependencies, but that’s beyond the scope of this post).

So, I went for the MrBayes compilation per sé according to the abovementioned CompileInstructions.txt file.

cd ~/Software/mrbayes-3.2.2/src
autoconf

Ups! Command not found. I forgot to install autoconf. I got it doing:

sudo apt-get install autoconf

And there we go. We are now finally ready to compile MrBayes. We want to compile the MPI version, so we use the --enable-mpi option:

# Be sure that you are in ~/Software/mrbayes-3.2.2/src
autoconf     
./configure --enable-mpi=yes
make

And now you have already compiled MrBayes. You can type an ls -l command and check that now you have an executable file called mb. What I usually do, when not using make install, is to create a bin folder within the software’s parent folder or within my main Software folder at $HOME, and copy the executable there. Eventually, I add the path to the executable to my .profile file

cd ..       # We are leaving src folder, just one level up
mkdir bin
cp src/mb bin/

Add this line into your .profile:

export PATH=$PATH:~/Software/mrbayes-3.2.2/bin

And source it to take effect:

cd ~
. .profile

You could have also done sudo make install instead of just make, and MrBayes would have been installed in /usr/local/bin, so it would be accessible to other users of the server. But you can’t do that if you don’t have administrative permissions on the server. And even so, in my opinion is better to keep locally all things you install, that helps you copy all your stuff to another server if you need, and to keep control of what you have there. Plus, playing with root permissions is always dangerous…

Now, you are ready to run MrBayes, using either one thread (just typing mb on your command terminal), or all the ones you set. In my case, I usually do 2 runs with 4 chains each, and according the MrBayes wiki “the optimal number of processors is therefore equal to the total number of chains”. So, I will run MrBayes with 8 processors. To do so, you need to type

mpirun -np 8 mb       # Where np stands for number of processors, I guess...

And you should get something like this:


                            MrBayes v3.2.2 x64

                      (Bayesian Analysis of Phylogeny)

                             (Parallel version)
                         (8 processors available)

              Distributed under the GNU General Public License


               Type "help" or "help " for information
                     on the commands that are available.

                   Type "about" for authorship and general
                       information about the program.


MrBayes > 

Now, we are ready to use MrBayes at super-speed! How to use MrBayes is a completely different story, which I learned from my good friend Nacho Maeso and the book “Phylogenetic Trees Made Easy”, which I recommend for beginners (I’m still a beginner, actually 🙂 )

Enjoy your trees!!


Edited on October 4th, 2014.

The parallel version of MrBayes 3.2.2 had a bug precluding the resolution of the phylogenetic trees in comparison with the serial version. This was reported almost a year ago in the sourceforge.net page of MrBayes. Although there is reported in a case where the BEAGLE library was not installed, I had the same problem with the BEAGLE library. So, to solve this we need to install the development version of MrBayes.

To do that, delete the folder with the old installation of MrBayes

cd ~/Software/
rm -rf mrbayes-3.2.2

And download and install the dev version:

svn checkout svn://svn.code.sf.net/p/mrbayes/code/trunk mrbayes-code
cd mrbayes-code/src
autoconf
./configure --enable-mpi=yes
make

Now, you have compiled again an executable file called mb, and you can proceed as before to make a bin folder, copying the executable into it and adding the path to your .bashrc or .profile file. Then, when you run it as above, you should see the new version of MrBayes:


$ mpirun -np 8 mb


                        MrBayes v3.2.3-svn(r883) x64

                      (Bayesian Analysis of Phylogeny)

                             (Parallel version)
                         (8 processors available)

              Distributed under the GNU General Public License


               Type "help" or "help " for information
                     on the commands that are available.

                   Type "about" for authorship and general
                       information about the program.


MrBayes > 

.

Renewing the blog

Well, I think it’s time to resume my blog, don’t you think? It has been a while since I last opened it, and although recently I have been (lightly) thinking on several ways to continue the blog, somehow in a more or less professional way, I never got the sufficient will to do it. Today, as my thought-finished PCR was actually 24 min away to end, I just got the impulse to revise my blog and restart writing… and as you can see, in English. Continue reading to find out the reasons.

I’ve reached to the conclusion that I want to open my blog from my very personal experience of my life in Japan to a wider, more professionally-focused readership. I mean, science colleagues. That’s why I am changing from Spanish to English. This does not mean I won’t write in Spanish in the future, I will. I would like also to start writing posts in simple Japanese, as my language skills are significantly improving. I have been reading great blogs during the last year, and although most of them are written by bioinformatics scientists (meaning that you suppose them to be seated in front of the computer almost all the time, developing wonderful arteriosclerosis in their leg arteries and writing one post after another while the high-memory server or the cluster of clusters do the tough work for them), others are managed by palaeontologists, molecular biologists, etc., people who are not so often sitting down in front of a computer. So, why me, a mixture of a bad, illiterate, newbie bioinformatician and a not-so-bad-but-not-so-good molecular biologist, working in Evo-Devo (or so I think), cannot maintain a blog on what I do as a scientist, where I express my humble opinion about current topics and debates, try to help people letting the world know how I got to solve (most of the time simple) problems, and why not, also write about my life in Japan so my friends and others can follow my experiences here (as a Facebook-hater you can imagine I closed mine long ago…)? Well, I think there’s no reason not do it.

So, I have some ideas to format my blog into a more attractive scheme for everyone, scientist and non-scientist friends. First, I need a new title: a title that is original, fresh and attractive (any ideas?). Second, I will reorganise the blog, for example dividing the posts into Personal, Science and Opinion sections or something like that. And third, I need to advertise it somehow. A blog without readers is like a gene without polymerase and TFs working on it… It’s laying there, lost in a web of information for nothing (although I could give it to the ENCODE team and surely they’d find out something…). For those who do not know, I have a Twitter account and I found that Twitter is a wonderful means to stay updated on several issues, like new software releases for your assemblies and mapping, new publications on evolution, interesting opinions from lesser and greater scientists, and more importantly, it’s a great tool to meet people. I will share my new posts, and also comments and opinions that I get from you on the blog via Twitter. You can follow me @PascualAnaya (I know what you think, but I wanted people to remember me by my real name, the one I use in publications, and not Champi…).

And last but not least, I am drafting my own website. This as well will be professionally focused. I will upload my CV, write about my research topics and interests, and a list of publications (the few of them…). This is just an idea right now that I have to develop. I will anyway keep you updated. Just stay connected to my blog 🙂

That’s all for now! I hope you’re looking forward to the next post!