UFCG Pipeline - Manual

Current version:

Quick start


Conda

conda install -y -c bioconda ufcg
ufcg download -t minimum
ufcg -h

Docker

docker pull endix1029/ufcg:latest
docker run -it endix1029/ufcg:latest
ufcg -h

Manual installation

git clone "https://github.com/steineggerlab/ufcg.git"
cd ufcg
mvn clean package appassembler:assemble
./target/bin/ufcg download -t minimum
./target/bin/ufcg -h

Requirements (manual installation)

Before installation, please make sure that the required programs are properly installed. Links for the installation of the requirements are listed in the download page.

  • Java Runtime Environment with a version higher than 8
  • Maven with a version higher than 3.9.4
  • AUGUSTUS and MMSeqs2 are required for running profile module.
  • MAFFT and IQ-TREE are required for running tree module.

Contents


Config files

  • seq/ contains reference protein/nucleotide sequences of pre-defined genetic markers.
  • model/ contains hidden Markov models, defined by protein sequences in seq/.
    • Detailed information of the genes are provided here.
  • ppx.cfg is a config file used in gene prediction stage, utilized with AUGUSTUS.
  • tree.cfg is a config file for tree module, containing path information of dependant binaires.

Samples

  • Sample input and output files are prepared in sample/ directory. Visit tutorial page to learn how to use them.
  • meta_*.tsv is a sample of TSV formatted metadata, which can be provided to the pipeline for annotating genomes.
    • More information about metadata is written in the Input files - Metadata section below.

Pipeline

  • ufcg or ufcg.jar is an executable of the pipeline.

Input files


Genome assembly  

UFCG profile extracts the core gene profile from a FASTA formatted genome assembly.
A single assembly or a directory containing multiple assemblies are accepted as an input.

Metadata

You can provide a metadata for your input genome assemblies by converting them in a proper format. UFCG pipeline recieves seven entries that represents the taxonomic label of the genome.

Entry Description Example
Filename Name of the file
bakers_yeast.fasta
Label Full label of the genome
Saccharomyces_cerevisiae_S288C
Accession Accession code of the assembly (NCBI)
GCA_000146405.2
Taxon name Name of the species
Saccharomyces cerevisiae
NCBI name Name of the assembly provided in NCBI
sacCer3
Strain name Name of the strain
S288C
Taxonomy Full taxonomy of the species Fungi;Ascomycota;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces;Saccharomyces cerevisiae
: Required entry

To provide the metadata to the pipeline, you should create a TSV (tab-separated values) formatted file including the data and specify it as an argument. TSV file must include the header line indicating entries, and respectively ordered metadata of the genomes one per line.

There are sample metadata TSV files prepared for your guidance. Visit tutorial page to find out how to use them.

  • sample/meta_full.tsv contains fully constructed metadata of genomes in sample/seq/ directory.
  • sample/meta_simple.tsv contains minumum data of the genomes.

Running profile module


Interactive mode

Run following command on your terminal to run UFCG pipeline interactively. Interactive mode will guide you through the options that pipeline requires, and automatically create the command to run the pipeline.

$ ufcg profile -u

Single line command

You can also run the pipeline with a classic one-liner with options and arguments.
$ ufcg profile -i <PATH> -o <PATH> [options]
Required options
  • -i <DIR> : Locate the path of the input file/directory with fungal genome(s)
  • -o <DIR> : Locate the path of the output directory to store result files
Additional options

Following options are not mandatory, but maybe useful to configurate your run.

  • -m <FILE> : Locate the path to the TSV file containing metadata
  • -t <INT> : Number of CPU thread(s) to use
  • -f <BOOL> : Force to overwrite the result files in output directory
  • -v : Make program verbose

To check the entire available options, run the pipeline with -h option.

Results

The pipeline will extract the core gene profiles of given genome assemblies and store them as .ucg files.

  • Note the path to the output directory, or copy its contents into the other directory, to use it as an input of tree module.

Running time

With 32 CPU threads, profile module requires about 55 seconds to extract the UFCG marker genes from a fungal whole genome assembly.

Running tree module


Align genes and infer tree

Run following command on your terminal to align the genes and infer phylogenetic tree with UFCG pipeline.

$ ufcg tree -i <DIR> -l <LIST> [options]
Required options
  • -i <DIR> : Locate the path of the input .ucg profiles to align and infer tree
  • -l <LIST> : Name the leaves of the phylogenetic tree from the metadata
Additional options
  • -o <DIR> : Locate the path of the directory for results
  • -n <STR> : Runtime name for this analysis
  • -a <STR> : Select sequence to align
    • nucleotide : Use given nucleotide sequence (default)
    • protein : Use given amino acid sequence
    • codon : Use codons encrypting given amino acid sequence
    • codon12 : Use codons without third bases
  • -t <INT> : Number of CPU threads to use
  • -p <BINARY> : Use different tree building program

To check the entire available options, run the module with -h option.

Results

tree module will produce following result files. You may further analyze the trees with various phylogenetic tools that can handle Newick files. (MEGA, ETE, ape, etc.)

Type Name Description
Tree
concatenated.nwk
UFCG tree inferred from concatenated alignment
concatenated_gsi_[N].nwk
UFCG tree with GSI values computed with [N] genes, replacing bootstrap values
tree/[GENE].nwk
Gene tree inferred from each single gene
Align
aligned_concatenated.fasta
Concatenated sequences of entire core gene alignments
align/aligned_[GENE].fasta
Aligned sequences of each single gene
Misc
[run_id].log
Log file containing information about this run
[run_id].trm
JSON file containing entire trees and their metadata

Running time

With 32 CPU threads, tree module requires about 413 seconds to produce the trees from 30 UFCG profiles.

Replace labels of tree

Run following command to replace the name of leaves with a different format.

$ ufcg prune -i <TRM> -g <STR> -l <LIST>
Required options
  • -i <TRM> : Locate the path of the .trm file from your run
  • -g <STR> : Specify a gene name to replace
    • Put UFCG to replace UFCG tree.
    • Put the name of gene (ex. RPB2) to replace the corresponding single gene tree.
  • -l : Leaf option identical to tree module
Examples

To replace the UFCG tree by using taxon names and strain names as leaves from the run myRun, execute:

$ ufcg prune -i myRun.trm -g UFCG -l taxon,strain

To replace the RPB2 gene tree by using accessions and taxonomic relationships as leaves from the run Lorem_ipsum, execute:

$ ufcg prune -i Lorem_ipsum.trm -g RPB2 -l acc,taxonomy

Troubleshooting


Q. UFCG emits error "AUGUSTUS_CONFIG_PATH undefined or improperly defined". How can I solve it?

You have to define a local variable $AUGUSTUS_CONFIG_PATH to run AUGUSTUS properly. Run following code to define the variable.

$ export AUGUSTUS_CONFIG_PATH=/path/to/augustus/config/

If you are using bash, run this to semi-permanently add the variable on your system.

$ echo "export AUGUSTUS_CONFIG_PATH=/path/to/augustus/config/" >> ~/.bash_profile
$ source ~/.bash_profile

About us


    __  __ _____ _____ _____
   / / / // ___// ___// ___/
  / / / // /_  / /   / / __
 / /_/ // __/ / /___/ /_/ /
 \____//_/    \____/\____/

UFCG pipeline was developed by Daniel Dongwook Kim.
UFCG project is under instruction of professor Martin Steinegger and Jon Jongsik Chun

© All rights reserved to Steinegger Lab, Seoul National University.