OMA standalone is a piece of software which makes it possible to run the OMA algorithm for inferring homology information on your custom data. This includes generating pairwise orthologs, Hierarchical Orthologous Groups, as well as OMA Groups. It takes as input the coding sequences of genomes or transcriptomes, in FASTA format. The recommended input type is amino acid sequences, but OMA also supports nucleotide sequences. With amino acid sequences, users can combine their own data with publicly available genomes from the OMA database, including pre-computed all-against-all sequence comparisons (the first and computationally most intensive step), using the export function on the OMA website.
In this exercise, we will run OMA standalone to obtain gene families and other orthology information for a few bacterial species. We will download four genomes from the OMA browser, before adding our own custom genome as an example.For more information on OMA standalone, please see this blog post and the extensive documentation available here.
tar -zxvf AllAll-...
Note: it is important to know that when using your own genomes, or adding a genome to the exported all-against-all data, the name of the proteome file will be used as the name of the genome throughout the rest of the analysis.
3. Add the following dummy bacterial genome to your dataset: my_bacterial_genome.fa
grep doc to select the header lines (starting with ">") and count them.Bio.SeqIO or similar.parameters.drw file. This file is located in the main OMA directory and should be edited by the user. There are many options that can be tweaked, but there are two options to specifically pay attention to: SpeciesTree and OutgroupSpecies.
Note: here, we shall not edit the SpeciesTree parameter. Instead, we shall let OMA estimate it. For future reference, this estimation should be used with extreme caution and the resulting EstimatedSpeciesTree.nwk file should be examined.
5. Edit the parameters.drw file and specify the outgroup species to be Magnetococcus marinus
The OMA algorithm runs in three main steps: 1) Quality and consistency checks of the genomes that will be used to run OMA Standalone; 2) All-against-all alignments of every protein sequence to all other protein sequences; and 3) Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see OMA: A Primer (Zahn-Zabal et al. 2020). The all-against-all step is the most computationally intensive and takes the longest amount of time. This is why it is beneficial to export the precomputed all-against-all for genomes in the OMA browser.
Cache/AllAll
`bin/oma`
Now that OMA standalone is complete, the Output folder should be created - have a look at the contents. Note: Familiarity with command line scripting is preferable to complete this section.
wc -l *
cat STRZN-BACAA.txt | cut -f 5 | sort | uniq -c
HOGFasta folder and loop through each file to count the number of genes.
ls -1 | grep ".fa" | sed "s/.\*/grep -c \">\" &/" | bash | awk '{ total += $1 } END { print total/NR }'
OrthologousGroups.txt. In the file, each line is an OMA group and each tab-separated columm the whole gene description (From the FASTA header).