UFCG Pipeline - Manual

Current version: v1.0.5

Quick start

Conda

conda install -y -c bioconda ufcg
ufcg download -t minimum
ufcg -h

Docker

docker pull endix1029/ufcg:latest
docker run -it endix1029/ufcg:latest
ufcg -h

Manual installation

git clone "https://github.com/steineggerlab/ufcg.git"
cd ufcg
mvn clean package appassembler:assemble
./target/bin/ufcg download -t minimum
./target/bin/ufcg -h

Requirements (manual installation)

Before installation, please make sure that the required programs are properly installed. Links for the installation of the requirements are listed in the download page.

Java Runtime Environment with a version higher than 8
Maven with a version higher than 3.9.4
AUGUSTUS and MMSeqs2 are required for running profile module.

Requirements - profile

To run profile module, following binaries should run properly on your system: augustus fastBlockSearch mmseqs

You can either record the path of these binaries on $PATH, or provide them as arguments of ufcg program.
MAFFT and IQ-TREE are required for running tree module.
Requirements - tree
To run tree module, following binaries should run properly on your system: mafft iqtree

You should record the path of these binaries on a simple text file named tree.cfg such as:

mafft=/path/to/mafft
iqtree=/path/to/iqtree

You can also use raxml or fasttree as a tree-builder; but you should specify the paths of these programs as well if you want to use them.

seq/ contains reference protein/nucleotide sequences of pre-defined genetic markers.
model/ contains hidden Markov models, defined by protein sequences in seq/.
- Detailed information of the genes are provided here.
ppx.cfg is a config file used in gene prediction stage, utilized with AUGUSTUS.
tree.cfg is a config file for tree module, containing path information of dependant binaires.

Samples

Sample input and output files are prepared in sample/ directory. Visit tutorial page to learn how to use them.
meta_*.tsv is a sample of TSV formatted metadata, which can be provided to the pipeline for annotating genomes.
- More information about metadata is written in the Input files - Metadata section below.

Pipeline

ufcg or ufcg.jar is an executable of the pipeline.

Input files

Genome assembly

UFCG profile extracts the core gene profile from a FASTA formatted genome assembly.
A single assembly or a directory containing multiple assemblies are accepted as an input.

Metadata

You can provide a metadata for your input genome assemblies by converting them in a proper format. UFCG pipeline recieves seven entries that represents the taxonomic label of the genome.

: Required entry
Entry	Description	Example
Filename	Name of the file	bakers_yeast.fasta
Label	Full label of the genome	Saccharomyces_cerevisiae_S288C
Accession	Accession code of the assembly (NCBI)	GCA_000146405.2
Taxon name	Name of the species	Saccharomyces cerevisiae
NCBI name	Name of the assembly provided in NCBI	sacCer3
Strain name	Name of the strain	S288C
Taxonomy	Full taxonomy of the species	Fungi;Ascomycota;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Saccharomyces;Saccharomyces cerevisiae

To provide the metadata to the pipeline, you should create a TSV (tab-separated values) formatted file including the data and specify it as an argument. TSV file must include the header line indicating entries, and respectively ordered metadata of the genomes one per line.

There are sample metadata TSV files prepared for your guidance. Visit tutorial page to find out how to use them.

sample/meta_full.tsv contains fully constructed metadata of genomes in sample/seq/ directory.
sample/meta_simple.tsv contains minumum data of the genomes.

Running profile module

Interactive mode

Run following command on your terminal to run UFCG pipeline interactively. Interactive mode will guide you through the options that pipeline requires, and automatically create the command to run the pipeline.

$ ufcg profile -u

Single line command

You can also run the pipeline with a classic one-liner with options and arguments.

$ ufcg profile -i <PATH> -o <PATH> [options]

Required options

-i <DIR> : Locate the path of the input file/directory with fungal genome(s)
-o <DIR> : Locate the path of the output directory to store result files

Additional options

Following options are not mandatory, but maybe useful to configurate your run.

-m <FILE> : Locate the path to the TSV file containing metadata
-t <INT> : Number of CPU thread(s) to use
-f <BOOL> : Force to overwrite the result files in output directory
-v : Make program verbose

To check the entire available options, run the pipeline with -h option.

Results

The pipeline will extract the core gene profiles of given genome assemblies and store them as .ucg files.

Note the path to the output directory, or copy its contents into the other directory, to use it as an input of tree module.

Running time

With 32 CPU threads, profile module requires about 55 seconds to extract the UFCG marker genes from a fungal whole genome assembly.

Running tree module

Align genes and infer tree

Run following command on your terminal to align the genes and infer phylogenetic tree with UFCG pipeline.

$ ufcg tree -i <DIR> -l <LIST> [options]

Required options

-i <DIR> : Locate the path of the input .ucg profiles to align and infer tree
-l <LIST> : Name the leaves of the phylogenetic tree from the metadata

Additional options

-o <DIR> : Locate the path of the directory for results
-n <STR> : Runtime name for this analysis
-a <STR> : Select sequence to align
- nucleotide : Use given nucleotide sequence (default)
- protein : Use given amino acid sequence
- codon : Use codons encrypting given amino acid sequence
- codon12 : Use codons without third bases
-t <INT> : Number of CPU threads to use
-p <BINARY> : Use different tree building program

To check the entire available options, run the module with -h option.

Results

tree module will produce following result files. You may further analyze the trees with various phylogenetic tools that can handle Newick files. (MEGA, ETE, ape, etc.)

Type	Name	Description
Tree	concatenated.nwk	UFCG tree inferred from concatenated alignment
	concatenated_gsi_[N].nwk	UFCG tree with GSI values computed with [N] genes, replacing bootstrap values
	tree/[GENE].nwk	Gene tree inferred from each single gene
Align	aligned_concatenated.fasta	Concatenated sequences of entire core gene alignments
Align	align/aligned_[GENE].fasta	Aligned sequences of each single gene
Misc	[run_id].log	Log file containing information about this run
Misc	[run_id].trm	JSON file containing entire trees and their metadata

Running time

With 32 CPU threads, tree module requires about 413 seconds to produce the trees from 30 UFCG profiles.

Replace labels of tree

Run following command to replace the name of leaves with a different format.

$ ufcg prune -i <TRM> -g <STR> -l <LIST>

Required options

-i <TRM> : Locate the path of the .trm file from your run
-g <STR> : Specify a gene name to replace
- Put UFCG to replace UFCG tree.
- Put the name of gene (ex. RPB2) to replace the corresponding single gene tree.
-l : Leaf option identical to tree module

Examples

To replace the UFCG tree by using taxon names and strain names as leaves from the run myRun, execute:

$ ufcg prune -i myRun.trm -g UFCG -l taxon,strain

To replace the RPB2 gene tree by using accessions and taxonomic relationships as leaves from the run Lorem_ipsum, execute:

$ ufcg prune -i Lorem_ipsum.trm -g RPB2 -l acc,taxonomy

Troubleshooting

Q. UFCG emits error "AUGUSTUS_CONFIG_PATH undefined or improperly defined". How can I solve it?

You have to define a local variable $AUGUSTUS_CONFIG_PATH to run AUGUSTUS properly. Run following code to define the variable.

$ export AUGUSTUS_CONFIG_PATH=/path/to/augustus/config/

If you are using bash, run this to semi-permanently add the variable on your system.

$ echo "export AUGUSTUS_CONFIG_PATH=/path/to/augustus/config/" >> ~/.bash_profile

$ source ~/.bash_profile

About us

    __  __ _____ _____ _____
   / / / // ___// ___// ___/
  / / / // /_  / /   / / __
/ /_/ // __/ / /___/ /_/ /
\____//_/    \____/\____/

UFCG pipeline was developed by Daniel Dongwook Kim.
UFCG project is under instruction of professor Martin Steinegger and Jon Jongsik Chun

UFCG Pipeline - Manual

Current version: v1.0.5

Quick start

Conda

Docker

Manual installation

Requirements (manual installation)

Contents

Config files

Samples

Pipeline

Input files

Genome assembly

Metadata

Running profile module

Interactive mode

Single line command

Required options

Additional options

Results

Running time

Running tree module

Align genes and infer tree

Required options

Additional options

Results

Running time

Replace labels of tree

Required options

Examples

Troubleshooting

Q. UFCG emits error "AUGUSTUS_CONFIG_PATH undefined or improperly defined". How can I solve it?

About us