Well, it’s been a while since I decided to change my blog from a personal diary to a one with science-related content. I realized that writing is not made for me. Still, I think I should not give up on writing things from time to time (and will actually make an effort to write more often).
I am by no means a phylogenetic analysis expert! But I am able to do my own small analyses. Until recently, a typical evo-devo study consisted of finding the orthologous gene of an important developmental gene of vertebrates (or let’s better say mmamals) in your laboratory model (name it amphioxus, lamprey, shark or your favourite pet), analyze its expression patterns and, more recently, its function if possible, and compare with its known properties in vertebrates. And thus the first question to resolve in this kind of studies was whether the gene that you have found is actually the orthologous of the gene of interest. Solving this requires building a phylogenetic tree with genes from different species as well as related families, and if orthologous, your gene should fall within the group formed by the family of interest. Of course, sometimes is not that simple, and different phenomena can affect the interpretation of a tree, such as hidden paralogy (see work by Shigehiro Kuraku: REF).
Since I’ve got asked several times how I do to make my small phylogenetic trees, I decided on making my simple “protocol” available here. The first thing that you should do is to download a bunch of sequences of your gene of interest and related genes from different species and format them into a multifasta file. For example, if you want to know if your Hox gene is a Tal-1 gene, you should download genes from the family (Tal-1, Tal-2 and Lyl-1) as well as other bHLH genes, such as MyoD or NeuroD (that’s exactly what I did here).
After you have prepared your file, you can just follow the following steps: (NOTE: this step-by-step turorial is mostly based on the book “Phylogenetic Trees Made Easy”, by Barry G. Hall -which I highly recommend- and conversations with Jordi Paps).ñalsjdfañslkdfjañslfkdja
Multiple Sequence Alignment
I use MUSCLE (Edgar, 2004) as implemented in MEGA v6 (Tamura et al., 2013).
- In the main window of MEGA, Click Align → Edit/Build Alignment → Retrieve sequences from a file and open fasta file with the sequences to be aligned.
- Align them with MUSCLE, by codon (default parameters).
- Save mas
- Optional: Inspect, and edit if necessary, the alignment in aminoacidic sequences (for editing: select a position(s) and press ⌘ + arrows to move). This manual editing is actually not recommended, since it is very subjective to the perception that you might have of what a reliable aligned sequence is, and thus impel reproducibility
- Save alignment in edited.mas (always save a .mas file after editing).
- Save the nucleotide alignment in edited.meg
- Export alignment in edited.fas
- Save the aminoacidic alignment in edited.meg
- Export alignment in edited.fas. Then, replace stop codons (*), by a dash (-), for example using MS Word’s Replace option.
Eliminate duplicate sequences:
- In the main MEGA window File → Open A File/Session… and open the edited.meg
- Click Distances à Compute Pairwise Distances… In the Analysis Preferences window, select Model/Method à N of differences and press Compute. Check that there are no 0.00 values.
- To trim the alignment (discard spurious sequence alignments and increment the phylogenetic signal to noise ratio), I follow two strategies: Gblocks (Castresana 2000) tool to trim nucleotide alignments by codon, and trimAl (Capella-Gutierrez et al., 2009) for amino acids alignment, using automated1 option
Gblocks namefile_nts_muscle.edited.fas \
-t=c \ # Type of sequence. p: protein; d: dna; c: codon.
-b2=<50%+1> \ # Minimum Number Of Sequences For A Flank Position
-b4=5 \ # Minimum Length Of A Block. Any integer ≥ 2
-b5=h # Allowed Gap Positions. n: none; h: with half; a: all
NOTE: according to the Gblocks server (http://molevol.cmima.csic.es/castresana/Gblocks_server.html), a less stringent selection consists of: -b2 should be 50% of sequences + 1, -b4=5 and –b5=h.
trimal -in namefile_prot_muscle.edited.fas \
-out namefile_prot_muscle.edited.trimal.fas \
-htmlout namefile_prot_muscle.edited.trimal.html -automated1
- Format trimmed fasta alignments into nexus file for MrBayes, using readAl, from the trimAl package
readal -in namefile_nts_muscle.edited.fas-gb –out namefile_nts_muscle.edited.gb.nxs -nexus
PHYLOGENETIC TREE CONSTRUCTION
[Newick tree file: “Hox4_nts_withLamprey_muscle.edited.gb.nwk”, obtained using ML in MEGA6 without bootstrapping]
[tree user = (tree);]
log start replace filename = Hox1_nts_withLamprey_muscle.gb.bay.log;
set autoclose = no nowarn=no;
charset 1st_pos = 1-.\3;
charset 2nd_pos = 2-.\3;
charset 3rd_pos = 3-.\3;
partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;
set partition = by_codon;
lset applyto = (all) nst=6 rates = invgamma;
prset applyto = (all); [For JC add option statefreqpr = fixed(equal]
unlink revmat=(all) shape=(all) pinvar=(all) statefreq=(all) tratio= (all);
[startvals tau = user V = user;]
mcmc ngen=4000000 printfreq=1000 samplefreq=100 nchains=4 temp=0.2 checkfreq = 50000 diagnfreq = 1000;
sumt relburnin = yes burninfrac = 0.25 contype = halfcompat conformat = simple;
sump relburnin = yes burninfrac = 0.25;
Capella-Gutierrez S, Silla-Martinez JM, and Gabaldon T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972-1973.
Castresana, J. (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17, 540-552.
Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792-1797.
Tamura K, Stecher G, Peterson D, Filipski A, and Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0. Molecular Biology and Evolution 30: 2725-2729.