2 Identification of Secondary Metabolite Gene Clusters

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (11.94 MB, 352 trang )

Genome Mining for Secondary Metabolites

33

Fig. 2 Example antiSMASH *.html output for A. orientalis HCCB10007. (a) Section from the gene cluster overview page. (b) Vancomycin biosynthesis gene cluster detected by antiSMASH. antiSMASH indicates glycopeptide biosynthesis clusters as NRPS-PKS-III hybrids. While the glycopeptide backbone is assembled via an NRPS

mechanism, one of the incorporated amino acids (dihydroxyphenylglycine) is synthesized by the action of a

type-III PKS

biosynthesis mechanism is linear to the order of the functional

domains in a module. Since plenty of exceptions from this rule are

known [35–37], the predicted core structure should be considered

as only one possibility for the assembly.

3.2.1 Alternative Tools

for Secondary Metabolite

Gene Cluster Prediction

A recently published tool for the prediction of NRPS, type-I-,

and type-II-PKS clusters is PRISM (Prediction Informatics for

Secondary Metabolomes) [38]. Similar to antiSMASH, PRISM

analyses open reading frames based on a large HMM library and

groups them into clusters. The major difference compared to antiSMASH is that for the final structure prediction in PRISM the

functions of trans-acting AT domains, deoxy sugar combinations,

34

Martina Adamek et al.

tailoring, and cyclization reactions are taken into account. For the

identification of known compounds, a multilocus sequence typing

inspired approach, generating scaffold structures, is implemented

and a database containing about 50,000 known secondary metabolites is available for comparison [38]. PRISM can be used as an

alternative or in addition to antiSMASH for the prediction of PKS

and NRPS clusters. It should be noted that in contrast to antiSMASH, the general cluster size determined using PRISM is usually

underestimated and it is therefore recommended to take a look at

the regions flanking the determined clusters. The most powerful

implementation of the PRISM algorithm is provided in the GNP

(Genomes to Natural Products) application, which offers the

unique possibility of connecting genome sequence data with mass

spectrometry data [39]. Other programs available, with more

specific applications, are: NP.searcher [40], ClustScan [41] and

SBSPKS [42] for PKS and NRPS clusters, or BAGEL [43] for

bacteriocins.

3.2.2 Alternative

Approaches for Discovery

of Novel Secondary

Metabolite Gene Clusters

The high confidential predictive tools are limited to the detection

of already known, well-characterized gene cluster classes. Cluster

Finder is an algorithm that aims to identify gene clusters of both

known and unknown classes [44]. ClusterFinder is included as an

optional plug-in in the most recent version of antiSMASH [33]

and should be enabled if the detection of novel pathways or

unknown mechanisms is desired. This tool is based on the assumption that even novel biosynthetic pathways, which are very different from known ones, utilize the same broad enzyme families for

the catalysis of key reactions. ClusterFinder detects certain PFAM

domains, which are located outside of a comprehensive set of

known biosynthetic gene cluster classes and thereby predicts putative novel clusters. This search will increase the number of candidate secondary metabolite gene clusters (longer runtime!) to the

cost of a lower confidence for some of the predicted clusters.

For example, the amount of detected clusters in A. orientalis

HCCB10007 was increased from 31 to 108 by enabling the Cluster

Finder algorithm in addition to antiSMASH. If the ClusterFinder

algorithm is used, the results should be evaluated critically.

A novel idea for the detection of unknown biosynthetic

pathway classes is utilized by the INBEKT (Identification of

Natural compound Biosynthesis pathways by Exploiting Knowledge

of Transcriptional regulation) progression [45]. INBEKT has the

basic concept to detect novel secondary metabolite genes by utilizing knowledge of gene regulation instead of detecting biosynthetic

enzymes. The INBEKT concept follows the idea that global, environmental signal-sensing regulators control the production of

certain secondary metabolites. Such regulators promote or repress gene transcription by their binding to specific DNA motifs

Genome Mining for Secondary Metabolites

35

upstream of their target genes. These regulators may sense environmental signals such as nutrient starvation, oxidative stress, or

the presence of competitive organisms, which can trigger the

production of secondary metabolites. The computational screening of genome sequences for known DNA-binding sequences of

such regulators will provide a number of candidate gene regions.

The list of candidate gene regions can be minimized by excluding

hits inside a comprehensive set of known biosynthetic gene clusters

or hits that are associated with primary metabolism. The residual

numbers of gene regions possibly direct synthesis of secondary

metabolites by new pathway classes.

The preliminary step in the INBEKT workflow is the generation of 5′ upstream regions (UTRs) of annotated open reading

frames for the successive screening for regulator binding sequences.

GetFeature (Table 2) is a web-based application, which can be

used to generate 5′UTRs of your genome of interest. Sequence

files have to be uploaded as EMBL or GenBank files with annotated open reading frames. Before submitting data, a 5′UTR and

“all” locus tags need to be selected (Fig. 3). The GetFeature output data should be saved in a FASTA format (see Note 3).

Scanning these nucleotide data for the presence of provided

regulator binding sequences will deliver candidate gene regions,

which have to be critically sorted and evaluated. Some published

consensus sequences can be accessed easily from databases like

PRODORIC® (PROcaryotIC Database Of Gene Regulation) [46],

a database that organizes information about gene regulation and

gene expression in prokaryotes or CollecTF, a database for transcription factor binding sites in bacteria [47]. A more comprehensive list of databases is provided at the MEME-Suite web portal

[48]. In general, it is advisable to use an accessible regulator binding sequence of an organism, which is highly related to the organism of interest. For example, if screening for iron-repressed genes

is wanted, it has to be considered that gram-negative and low-GC

gram-positive bacteria use the ferric uptake regulator (Fur) as iron

responsive repressor, while high-GC gram-positive bacteria use its

functional ortholog belonging to the DtxR protein family.

Scanning the nucleotide data for the presence of rationally chosen

and user-provided motifs can be performed by using tools like FIMO

(Find Individual Motif Occurrences) [49] or PatScanUI [50].

Here, we present an exemplary INBEKT workflow, where the

A. orientalis HCCB10007 genome was screened for the presence

of zinc uptake regulator (Zur) binding sequences. Zur is the major

bacterial regulator sensing zinc concentrations. It represses the

transcription of genes encoding zinc uptake and zinc mobilization

functions by binding to palindromic A/T-rich sequences found in

the promoters of its DNA targets [51]. All 8121 A. orientalis

HCCB10007 5′UTRs were uploaded to PatScanUI as FASTA file.

36

Martina Adamek et al.

Fig. 3 Overview of the INBEKT workflow

The described Zur binding sequence (TCATGAAAATC

ATTTTCANNA) of Streptomyces coelicolor [52] was chosen as a motif

to screen for zinc repressed genes. The maximum of allowed mismatches was set to 5 (Fig. 3). The optional settings for mismatches,

insertions, and deletions should be chosen empirically to find a range

wide enough to detect a suitable amount of genes but exclude a lot

of false positives. To estimate if the total amount of detected genes is

plausible, it can be compared to the known amount of genes, which

are included in corresponding regulons. Zur regulons that have been

characterized so far, e.g., comprise usually 10–30 genes.

Genome Mining for Secondary Metabolites

37

The Zur screening in A. orientalis HCCB10007 revealed a set

of 11 genes (Fig. 3), which are putatively zinc regulated. The predicted functions of the candidate genes were assigned to known

pathways when possible. Hit number 9 (AORI_6197), which is

neither detected by the latest version of antiSMASH nor by the

ClusterFinder algorithm, represents a protein, proposed to be

involved in the synthesis of a nonproteinogenic amino acid. Such

nonproteinogenic amino acids are common building blocks of various secondary metabolites. The identified AORI_6197 is highly

similar to AesA of Amycolatopsis japonica [45] which was shown to

be essential for the synthesis of an unusual zinc-responsively produced chelating agent. To date, A. orientalis HCCB10007 has not

been described to produce such a compound.

3.3 Determining

the Boundaries

of a Gene Cluster

Clusters predicted by antiSMASH are most probably not displaying the correct boundaries, because the antiSMASH pipeline is

designed to set the cluster boundaries automatically at 5, 10, or

20 kb on each side of the last signature gene, dependent on the

type of gene. Without experimental validation, the real cluster

boundaries cannot be exactly predicted, but they can be estimated

by comparing SMGCs in different genomes of related bacterial

species. In the following section, we present a set of tools to help

with the estimation of gene cluster boundaries.

3.3.1 Comparison

of antiSMASH Results

with MIBiG

If the cluster of interest is similar to a known SMGC, cluster

boundaries can be deduced from the additional antiSMASH output data. Thereby, the “find homologous gene clusters” and “find

known homologous gene clusters” views in antiSMASH may be

helpful. A comparison of A. orientalis HCCB10007 cluster 3 with

the respective MIBiG entry for vancomycin reveals high similarity

in the modular structure as well as high similarity in the set of

flanking genes and therefore allows estimating the gene cluster

boundaries by simple comparison. Although antiSMASH usually

overestimates the gene cluster size, sometimes the known cluster is

even bigger than the cluster determined by antiSMASH. In this

case, the raw sequence should be carefully inspected.

Most of the SMGCs predicted by antiSMASH will not have a

high similarity with clusters from the MIBiG database. In the following section, we describe the different tools that can be used to

detect the SMGC boundaries by comparing the region of interest

with similar regions in other, closely related bacterial strains.

3.3.2 JGI-IMG/ER:

Comparative Genomics

for Genomes Published

in the JGI Database

An easy to use web tool is the JGI-IMG/ER (Integrated Microbial

Genomes—Expert Review) genome viewer on the JGI webpage

(Table 2) [53]. JGI-IMG/ER offers different genomics applications, including a genome viewer, metabolic pathway identification, annotation tools, or phylogenetic clustering programs. It is

necessary to register on the JGI webpage to get access to the JGI-

38

Martina Adamek et al.

Fig. 4 JGI IMG/ER “Ortholog Neighborhoods” view. Part of the vancomycin gene cluster of A. orientalis

HCCB10007 (GCA_000400635.2) compared to the homologous regions from A. decaplanina DSM 44595

(GCA_000342005.1), A. alba DSM 44262 (GCA_000384215.1), and A. balhimycina DSM 44591

(GCA_000384295.1) clusters (top down). The cluster is highlighted in gray to indicate the boundaries. Genes

conserved only in closely related strains are highlighted in orange

IMG/ER applications. One helpful tool is the gene neighborhood

view, which could be difficult to find among the diverse applications. First, several genomes of interest should be loaded into the

genome cart. Using the “Find Genes” function, it is possible to

search for specific genes by name, function, or locus tag. For each

gene an overview page with some general annotation information,

such as gene families, clusters of orthologous groups (COGs), or

protein family (PFAM) domains, is given. Scrolling down, the

gene is shown in its direct neighborhood. Choosing the option

“Show neighborhood with this gene’s bidirectional best hits” will

open a view of all selected genomes from the cart that share roughly

the same sized orthologs in the region of interest. By comparing

the strain of interest with other strains that share the same or similar biosynthetic genes, in particular by comparing the gene cluster

flanking regions, the putative boundaries of the cluster can be

determined (Fig. 4).

3.3.3 Genome

Comparison

with MultiGeneBlast

MultiGeneBlast is an open source tool for the identification and

comparison of multigene regions, such as gene clusters (Table 2)

[54]. First, a database comprising the genome sequences of several

closely related species must be built. This can either be done from

GenBank entries on the NCBI server or from *.gbk or *.fasta files

stored on a personal hard drive (see Note 7). As an example, the

GenBank file of the vancomycin cluster (cluster 3) downloaded

from the antiSMASH results for A. orientalis HCCB10007 can be

used as an input file. To create an example database the “whole

genome assembly” files from the NCBI server for Amycolatopsis

balhimycina FH1894, Amycolatopsis decaplanina DSM 44594,

A. orientalis B-37, Amycolatopsis alba DSM 44242, Amycolatopsis

mediterranei S699, and A. mediterranei U32 can be downloaded

by choosing “Database” →“Create from online GenBank entries.”

By choosing “Make raw nucleotide database for tblastn-search,”

FASTA instead of GenBank files are used. Using the option tab

“File” it is possible to select the created database and open an input

Genome Mining for Secondary Metabolites

39

Fig. 5 MultiGeneBlast example *.xhtml output for the vancomycin biosynthesis gene cluster of A. orientalis

HCCB10007 (GCA_000400635.2) compared to the homologous gene clusters from A. decaplanina DSM 44595

(GCA_000342005.1), A. alba DSM 44262 (GCA_000384215.1), and A. balhimycina DSM 44591 (GCA_

000384295.1) clusters. The cluster is highlighted in gray to indicate the boundaries. Genes conserved only in

closely related strains are highlighted in orange

file. For a standard approach the default settings are sufficient to

determine the gene cluster boundaries. The results are displayed in

a *.xhtml format. Gene clusters similar to the vancomycin cluster

are present in the A. decaplanina, A. alba, and A. balhimycina

genome. The minimum set of homologous genes shared by all

strains harboring the SMGC can be seen best in A. balhimycina.

Based on this information the boundaries can be estimated (Fig. 5).

Conversely, it is sometimes possible to define a cluster by the exclusion of cluster parts that are present in all genomes, not only those

harboring the SMGC of interest (see Note 8).

3.4 Working

with Draft Genomes

To date, most of the published bacterial genomes are not completed, but separated on short contigs or larger scaffolds. Contigs

are overlapping next-generation sequencing reads assembled into

DNA sequences of high confidence, which can vary in length from

a few hundred bases up to several Mb. For simple usage, a measure for the genome quality can be estimated by the number of

contigs (fewer contigs = better assembly) and from the N50 value,

a measure for the mean contig length, with greater weight given

to longer contigs (higher N50 = better assembly). Scaffolds are

assemblies of contigs, for which the relative position, but not the

connecting sequence, is known. These “gaps” are expressed as

larger stretches of “N”s in the genome sequence. A more sophisticated description of genome assembly methods for the interested

reader is given in [55, 56].

Unfortunate for the natural product researcher, contig ends

are often located in the middle of a secondary metabolite gene

cluster, notably in the highly repetitive, large type-I-PKS or NRPS

clusters. When working with draft genomes one could be suspicious when either the antiSMASH gene cluster ends abruptly with

a type-I-PKS or NRPS like structure, or some flanking enzymes

present in related SMGCs are missing. The cluster position on the

contig should be taken into account, to ensure that the observed

cluster is not only representing a part of the cluster, while the other

part is located on a different contig.

40

Martina Adamek et al.

Fig. 6 (a) JGI-IMG/ER “Ortholog neighborhoods” view of A. orientalis DSM 40040 and the homologous regions

in the A. azurea DSM 43854 genome which are located on two different contigs. (b) MultiGeneBlast comparison of ortholog gene clusters. A. orientalis HCCB10007 serves as a query. The respective region in the A.

azurea DSM 43854 genome is located on two different contigs

The easiest method to connect gene clusters on different contigs is to map the complete draft genome to a reference genome

using mapping software such as CONTIGuator (Table 2) [57].

The drawback of this approach is that it only works for highly similar SMGCs on highly similar genomes, e.g., different strains from

the same bacterial species. The risk is that the mapping program

misassembles the highly similar modules of type-I-PKS or NRPS.

The resulting mapped cluster will be much shorter than the

real cluster. Mapped clusters should therefore always be carefully

validated.

The previously described JGI-IMG/ER and MultiGeneBlast

applications can be of use to assign partial clusters on different

contigs to one gene cluster. With the JGI-IMG/ER viewer “show

neighborhood” option, contigs with sequence homology to a

complete reference sequence are automatically mapped (Fig. 6a).

If a draft genome is included in the MultiGeneBlast database, the

separated SMGCs will be displayed as a match with a complete

reference SMGC (Fig. 6b). As an example, the draft genome

sequence of A. azurea DSM 43854 has been used. When analyzing the A. azurea DSM 43854 genome with antiSMASH, gene

cluster 1 appeared suspicious, because cluster and contig ended

right within an NRPS gene. When comparing the gene cluster with

similar clusters in the JGI-IMG/ER and with MultiGeneBlast,

it appeared that the SMGC is spread over two distinct contigs

(Fig. 6).

Even without a complete reference gene cluster for comparison,

it is possible to figure out which contigs should be merged into

one cluster. Within type-I-PKS and NRPS clusters KS-domains or

Genome Mining for Secondary Metabolites

41

C-domains of the same type usually share 85–90 % homology. This

information can be used to deduce which KS-/C-domains belong

to the same cluster and, when comparing similar clusters on different genomes, which gene clusters are probably encoding highly

similar products. A useful tool to accomplish this is NaPDoS, the

Natural Product Domain Seeker (Table 2). NaPDoS is a web-based

bioinformatic tool that can be used to identify and extract KS- and

C-domains according to a BLAST search against a database of different PKS, NRPS, and hybrid clusters [58]. Furthermore, with

NaPDoS it is possible to construct a phylogenetic tree based on the

pairwise sequence similarity of KS- or C-domains. Another program

for the construction of phylogenetic trees, which allows choosing

between different tree finding algorithms and is more flexible with

the choice of parameters, is MEGA 6.0 [59]. For the assignment of

SMGCs according to their phylogeny, this means that KS- or

C-domains belonging to the same cluster, but present on different

contigs, are likely to fall in the same phylogenetic clade.

With the information gained from all three approaches it is

feasible to merge the contigs. If the sequences on both contigs

overlap the contigs can be assembled manually; otherwise, it is

common to indicate a gap of unknown length as a stretch of “N”s

in the sequence.

3.5 Prioritization

of Gene Clusters

Finally, when all desired gene clusters are identified, and the

boundaries have been estimated, one of the main final questions is:

Which of the secondary metabolite gene clusters are worth investigating? For searching variations of already known compounds or

for completely new compounds, the procedure is quite the same.

It is necessary to compare the new clusters to clusters encoding

already known compounds, to identify which known SMGCs show

similar gene composition and therefore are likely to produce similar compounds. Based on this information, it is possible to classify

gene cluster families. In the next section, the secondary metabolite

databases and recent approaches for the classification of SMGC

families are listed.

3.5.1 Dereplication

by Comparison of Genes

and Predicted Products

with Secondary Metabolite

Databases

To start, a comparison with the continuously growing MIBiG

database [34] is a good option. A link to MIBiG is directly included

in the antiSMASH output data in the “known homologous gene

clusters” view. antiSMASH furthermore provides a prediction of

the substrate specificity for the modular compounds of PKSs and

NRPSs. A comparison of the predicted monomers can indicate if

the produced compounds of two SMGCs are the same, slightly

different or if they are completely different. A final similarity check

can be performed by BLAST analysis.

A different approach for dereplication, which is based on the

phylogeny of typical SMGC domains, is used in PRISM. Thereby,

42

Martina Adamek et al.

possible variants for each product are predicted and subsequently

each variant is compared to a large natural product library [38].

Further natural product databases are, ClustScan [41], DoBISCUIT

[60], ClusterMine360 [61], and IMG-ABC [62].

3.5.2 Classification

of Gene Cluster Families

The currently discussed approach for the prioritization of biosynthetic gene clusters is the classification of all so far discovered secondary metabolite gene clusters in gene cluster families. A SMGC

classification should help to prevent replication and to predict

structure and function of novel secondary metabolites by comparison with related clusters. This process is still in its early stages.

Nevertheless, we would like to shortly introduce the different

approaches toward solving this task.

1. A first approach was based on the definition of operational

biosynthetic units (OBUs) for PKS and NRPS biosynthetic

gene clusters. OBUs were classified according to a similar gene

content and organization. Thereby, clusters with an amino

acid sequence identity of 90 % for KS-domains and 85 % for

C-domains were grouped [63]. However, this approach is limited to PKS and NPRS clusters and was so far only applied for

Salinispora spp. Our recent experiences (data not published)

show that for some other bacterial genera these thresholds are

not applicable.

2. Another approach was based on the combination of three different similarity metrics: (a) the number of orthologous genes

shared by two biosynthetic gene clusters, (b) the amount of

each cluster shared in a PROmer alignment, and (c) simplified,

the number of corresponding signature genes in two clusters

expressed as percentage values. Creating a similarity matrix

giving different weights for the different similarity indices

(a: 25 %, b: 25 %, c: 50 %) allowed the implementation of these

data in a distance network, clearly visualizing distinct gene

cluster families [64]. The drawback of this method is the possibility that clusters producing only precursor molecules or

smaller subunits can cluster in the same groups as larger and

more complex secondary metabolites.

3. The third approach combined two similarity metrics: (a) the

Jaccard index to measure the similarity (presence or absence)

of PFAM domains from all vs. all SMGCs (weighed: 36 %) and

(b) the domain duplication index to measure similarities in

the numbers of PFAM domains (weighed: 64 %) [44]. For the

graphical visualization of similarity values Cytoscape was

used [65]. However, this method is not well suited for the

comparison of highly repetitive multimodular clusters, such as

type-1-PKS and NPRS clusters.

Finally, deciding which clusters are worth investigating is

highly user dependent. If the detection of a completely new

Genome Mining for Secondary Metabolites

43

compound is desired, clusters not belonging to any known SMGC

family should be prioritized. If the objective is to find a structural

and functional variant of a known compound, members of a specific SMGC family might be of interest.

For future approaches, it is necessary to define generally applicable standard rules for the classification of secondary metabolite

cluster families. Furthermore, specified algorithms for the different

types of gene clusters are needed.

4

Notes

1. Genome Mining is not only limited to genomic sequence data,

but can also be performed for sequenced cosmid libraries and

metagenomic data. However, the programs presented in this

chapter are not recommended for raw sequence data or very

short assembled sequences, as they need sequences of a certain

length as input data. For genome mining of raw metagenomic

data eSNaPD [66] is recommended.

2. When converting files, using annotation tools, etc. the same

headers/identifiers should be used, so that it is always possible

to identify the respective region of interest within a larger dataset. This is specifically of interest when the genome is separated

on multiple contigs. Headers should be kept short, because

some programs have problems with headers longer than 20

characters.

3. Filename extensions can make a difference. There are several

common filename extensions for FASTA (*.fa, *.fas, *.fasta,

*.fna for nucleotide, *.faa for amino acid) or GenBank (*.gbk,

*.gb, *.gbf) files in use. For example, if a program does not

accept FASTA data with a *.fas extension as input file, changing the extension to *.fa or *.fasta could help.

4. Poor genome annotation influences the antiSMASH results.

When using antiSMASH with FASTA sequences the detected

SMGCs are additionally annotated using an up-to-date annotation pipeline. Therefore, it is recommended to run antiSMASH

with GenBank and FASTA sequences and to compare the

results concerning missing genes or entire missing clusters.

5. antiSMASH often merges neighboring clusters into one large

cluster. These false hybrid clusters can be distinguished from

real hybrid clusters by comparison and detection of the cluster

boundaries with MultiGeneBlast and JGI-IMG/ER.

6. Sometimes, MbtH-like structures (MbtH proteins bind to

NRPS proteins to stimulate adenylation reactions) are wrongly

classified as lantipeptide core peptides in antiSMASH. This is

often seen in NRPS clusters. To distinguish real lantipeptide

Xem Thêm

2 Identification of Secondary Metabolite Gene Clusters

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về