Before installation, please make sure that the required programs are properly installed. Links for the installation of the requirements are listed in the download page.
Java Runtime Environment with a version higher than 8
Maven with a version higher than 3.9.4
AUGUSTUS and MMSeqs2 are required for running profile module.
Requirements - profile
To run profile module, following binaries should run properly on your system: augustusfastBlockSearchmmseqs
You can either record the path of these binaries on $PATH, or provide them as arguments of ufcg program.
MAFFT and IQ-TREE are required for running tree module.
Requirements - tree
To run tree module, following binaries should run properly on your system: mafftiqtree
You should record the path of these binaries on a simple text file named tree.cfg such as:
You can also use raxml or fasttree as a tree-builder; but you should specify the paths of these programs as well if you want to use them.
Contents
Config files
seq/ contains reference protein/nucleotide sequences of pre-defined genetic markers.
model/ contains hidden Markov models, defined by protein sequences in seq/.
Detailed information of the genes are provided here.
ppx.cfg is a config file used in gene prediction stage, utilized with AUGUSTUS.
tree.cfg is a config file for tree module, containing path information of dependant binaires.
Samples
Sample input and output files are prepared in sample/ directory. Visit tutorial page to learn how to use them.
meta_*.tsv is a sample of TSV formatted metadata, which can be provided to the pipeline for annotating genomes.
More information about metadata is written in the Input files - Metadata section below.
Pipeline
ufcg or ufcg.jar is an executable of the pipeline.
Input files
Genome assembly
UFCG profile extracts the core gene profile from a FASTA formatted genome assembly. A single assembly or a directory containing multiple assemblies are accepted as an input.
Metadata
You can provide a metadata for your input genome assemblies by converting them in a proper format. UFCG pipeline recieves seven entries that represents the taxonomic label of the genome.
To provide the metadata to the pipeline, you should create a TSV (tab-separated values) formatted file including the data and specify it as an argument. TSV file must include the header line indicating entries, and respectively ordered metadata of the genomes one per line.
There are sample metadata TSV files prepared for your guidance. Visit tutorial page to find out how to use them.
sample/meta_full.tsv contains fully constructed metadata of genomes in sample/seq/ directory.
sample/meta_simple.tsv contains minumum data of the genomes.
Running profile module
Interactive mode
Run following command on your terminal to run UFCG pipeline interactively. Interactive mode will guide you through the options that pipeline requires, and automatically create the command to run the pipeline.
Single line command
You can also run the pipeline with a classic one-liner with options and arguments.
Required options
-i <DIR> : Locate the path of the input file/directory with fungal genome(s)
-o <DIR> : Locate the path of the output directory to store result files
Additional options
Following options are not mandatory, but maybe useful to configurate your run.
-m <FILE> : Locate the path to the TSV file containing metadata
-t <INT> : Number of CPU thread(s) to use
-f <BOOL> : Force to overwrite the result files in output directory
-v : Make program verbose
To check the entire available options, run the pipeline with -h option.
Results
The pipeline will extract the core gene profiles of given genome assemblies and store them as .ucg files.
.ucg file format
Files with the extension .ucg are JSON-formatted profiles containing extracted sequences of core genes, along with the metadata of the genome. These files can also be read and edited via any text editor.
Note the path to the output directory, or copy its contents into the other directory, to use it as an input of tree module.
Running time
With 32 CPU threads, profile module requires about 55 seconds to extract the UFCG marker genes from a fungal whole genome assembly.
Running tree module
Align genes and infer tree
Run following command on your terminal to align the genes and infer phylogenetic tree with UFCG pipeline.
Required options
-i <DIR> : Locate the path of the input .ucg profiles to align and infer tree
-l <LIST> : Name the leaves of the phylogenetic tree from the metadata
-l option
Argument list
Select at least one from the following options and concatenate them with comma:
uid : Include unique integer ID
acc : Include accession number
label : Include full label
taxon : Include taxon name
strain : Include strain name
taxonomy : Include taxonomic relationship
Note that the options given must be included in the .ucg profile as a metadata.
Examples
-l uid : Include unique IDs only
-l acc,label,taxon : Include accession, label and taxon names
-l uid,acc,label,taxon,strain,taxonomy : Include all metadata
Additional options
-o <DIR> : Locate the path of the directory for results
-n <STR> : Runtime name for this analysis
-a <STR> : Select sequence to align
nucleotide : Use given nucleotide sequence (default)
protein : Use given amino acid sequence
codon : Use codons encrypting given amino acid sequence
codon12 : Use codons without third bases
-t <INT> : Number of CPU threads to use
-p <BINARY> : Use different tree building program
To check the entire available options, run the module with -h option.
Results
tree module will produce following result files. You may further analyze the trees with various phylogenetic tools that can handle Newick files. (MEGA, ETE, ape, etc.)
Type
Name
Description
Tree
concatenated.nwk
UFCG tree inferred from concatenated alignment
concatenated_gsi_[N].nwk
UFCG tree with GSI values computed with [N] genes, replacing bootstrap values
tree/[GENE].nwk
Gene tree inferred from each single gene
Align
aligned_concatenated.fasta
Concatenated sequences of entire core gene alignments
align/aligned_[GENE].fasta
Aligned sequences of each single gene
Misc
[run_id].log
Log file containing information about this run
[run_id].trm
JSON file containing entire trees and their metadata
Running time
With 32 CPU threads, tree module requires about 413 seconds to produce the trees from 30 UFCG profiles.
Replace labels of tree
Run following command to replace the name of leaves with a different format.
Required options
-i <TRM> : Locate the path of the .trm file from your run
-g <STR> : Specify a gene name to replace
Put UFCG to replace UFCG tree.
Put the name of gene (ex. RPB2) to replace the corresponding single gene tree.
-l : Leaf option identical to tree module
Examples
To replace the UFCG tree by using taxon names and strain names as leaves from the run myRun, execute:
To replace the RPB2 gene tree by using accessions and taxonomic relationships as leaves from the run Lorem_ipsum, execute:
Troubleshooting
Q. UFCG emits error "AUGUSTUS_CONFIG_PATH undefined or improperly defined". How can I solve it?
You have to define a local variable $AUGUSTUS_CONFIG_PATH to run AUGUSTUS properly. Run following code to define the variable.
If you are using bash, run this to semi-permanently add the variable on your system.