Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (11.94 MB, 352 trang )
Genome Mining for Secondary Metabolites
33
Fig. 2 Example antiSMASH *.html output for A. orientalis HCCB10007. (a) Section from the gene cluster overview page. (b) Vancomycin biosynthesis gene cluster detected by antiSMASH. antiSMASH indicates glycopeptide biosynthesis clusters as NRPS-PKS-III hybrids. While the glycopeptide backbone is assembled via an NRPS
mechanism, one of the incorporated amino acids (dihydroxyphenylglycine) is synthesized by the action of a
type-III PKS
biosynthesis mechanism is linear to the order of the functional
domains in a module. Since plenty of exceptions from this rule are
known [35–37], the predicted core structure should be considered
as only one possibility for the assembly.
3.2.1 Alternative Tools
for Secondary Metabolite
Gene Cluster Prediction
A recently published tool for the prediction of NRPS, type-I-,
and type-II-PKS clusters is PRISM (Prediction Informatics for
Secondary Metabolomes) [38]. Similar to antiSMASH, PRISM
analyses open reading frames based on a large HMM library and
groups them into clusters. The major difference compared to antiSMASH is that for the final structure prediction in PRISM the
functions of trans-acting AT domains, deoxy sugar combinations,
34
Martina Adamek et al.
tailoring, and cyclization reactions are taken into account. For the
identification of known compounds, a multilocus sequence typing
inspired approach, generating scaffold structures, is implemented
and a database containing about 50,000 known secondary metabolites is available for comparison [38]. PRISM can be used as an
alternative or in addition to antiSMASH for the prediction of PKS
and NRPS clusters. It should be noted that in contrast to antiSMASH, the general cluster size determined using PRISM is usually
underestimated and it is therefore recommended to take a look at
the regions flanking the determined clusters. The most powerful
implementation of the PRISM algorithm is provided in the GNP
(Genomes to Natural Products) application, which offers the
unique possibility of connecting genome sequence data with mass
spectrometry data [39]. Other programs available, with more
specific applications, are: NP.searcher [40], ClustScan [41] and
SBSPKS [42] for PKS and NRPS clusters, or BAGEL [43] for
bacteriocins.
3.2.2 Alternative
Approaches for Discovery
of Novel Secondary
Metabolite Gene Clusters
The high confidential predictive tools are limited to the detection
of already known, well-characterized gene cluster classes. Cluster
Finder is an algorithm that aims to identify gene clusters of both
known and unknown classes [44]. ClusterFinder is included as an
optional plug-in in the most recent version of antiSMASH [33]
and should be enabled if the detection of novel pathways or
unknown mechanisms is desired. This tool is based on the assumption that even novel biosynthetic pathways, which are very different from known ones, utilize the same broad enzyme families for
the catalysis of key reactions. ClusterFinder detects certain PFAM
domains, which are located outside of a comprehensive set of
known biosynthetic gene cluster classes and thereby predicts putative novel clusters. This search will increase the number of candidate secondary metabolite gene clusters (longer runtime!) to the
cost of a lower confidence for some of the predicted clusters.
For example, the amount of detected clusters in A. orientalis
HCCB10007 was increased from 31 to 108 by enabling the Cluster
Finder algorithm in addition to antiSMASH. If the ClusterFinder
algorithm is used, the results should be evaluated critically.
A novel idea for the detection of unknown biosynthetic
pathway classes is utilized by the INBEKT (Identification of
Natural compound Biosynthesis pathways by Exploiting Knowledge
of Transcriptional regulation) progression [45]. INBEKT has the
basic concept to detect novel secondary metabolite genes by utilizing knowledge of gene regulation instead of detecting biosynthetic
enzymes. The INBEKT concept follows the idea that global, environmental signal-sensing regulators control the production of
certain secondary metabolites. Such regulators promote or repress gene transcription by their binding to specific DNA motifs
Genome Mining for Secondary Metabolites
35
upstream of their target genes. These regulators may sense environmental signals such as nutrient starvation, oxidative stress, or
the presence of competitive organisms, which can trigger the
production of secondary metabolites. The computational screening of genome sequences for known DNA-binding sequences of
such regulators will provide a number of candidate gene regions.
The list of candidate gene regions can be minimized by excluding
hits inside a comprehensive set of known biosynthetic gene clusters
or hits that are associated with primary metabolism. The residual
numbers of gene regions possibly direct synthesis of secondary
metabolites by new pathway classes.
The preliminary step in the INBEKT workflow is the generation of 5′ upstream regions (UTRs) of annotated open reading
frames for the successive screening for regulator binding sequences.
GetFeature (Table 2) is a web-based application, which can be
used to generate 5′UTRs of your genome of interest. Sequence
files have to be uploaded as EMBL or GenBank files with annotated open reading frames. Before submitting data, a 5′UTR and
“all” locus tags need to be selected (Fig. 3). The GetFeature output data should be saved in a FASTA format (see Note 3).
Scanning these nucleotide data for the presence of provided
regulator binding sequences will deliver candidate gene regions,
which have to be critically sorted and evaluated. Some published
consensus sequences can be accessed easily from databases like
PRODORIC® (PROcaryotIC Database Of Gene Regulation) [46],
a database that organizes information about gene regulation and
gene expression in prokaryotes or CollecTF, a database for transcription factor binding sites in bacteria [47]. A more comprehensive list of databases is provided at the MEME-Suite web portal
[48]. In general, it is advisable to use an accessible regulator binding sequence of an organism, which is highly related to the organism of interest. For example, if screening for iron-repressed genes
is wanted, it has to be considered that gram-negative and low-GC
gram-positive bacteria use the ferric uptake regulator (Fur) as iron
responsive repressor, while high-GC gram-positive bacteria use its
functional ortholog belonging to the DtxR protein family.
Scanning the nucleotide data for the presence of rationally chosen
and user-provided motifs can be performed by using tools like FIMO
(Find Individual Motif Occurrences) [49] or PatScanUI [50].
Here, we present an exemplary INBEKT workflow, where the
A. orientalis HCCB10007 genome was screened for the presence
of zinc uptake regulator (Zur) binding sequences. Zur is the major
bacterial regulator sensing zinc concentrations. It represses the
transcription of genes encoding zinc uptake and zinc mobilization
functions by binding to palindromic A/T-rich sequences found in
the promoters of its DNA targets [51]. All 8121 A. orientalis
HCCB10007 5′UTRs were uploaded to PatScanUI as FASTA file.
36
Martina Adamek et al.
Fig. 3 Overview of the INBEKT workflow
The described Zur binding sequence (TCATGAAAATC
ATTTTCANNA) of Streptomyces coelicolor [52] was chosen as a motif
to screen for zinc repressed genes. The maximum of allowed mismatches was set to 5 (Fig. 3). The optional settings for mismatches,
insertions, and deletions should be chosen empirically to find a range
wide enough to detect a suitable amount of genes but exclude a lot
of false positives. To estimate if the total amount of detected genes is
plausible, it can be compared to the known amount of genes, which
are included in corresponding regulons. Zur regulons that have been
characterized so far, e.g., comprise usually 10–30 genes.
Genome Mining for Secondary Metabolites
37
The Zur screening in A. orientalis HCCB10007 revealed a set
of 11 genes (Fig. 3), which are putatively zinc regulated. The predicted functions of the candidate genes were assigned to known
pathways when possible. Hit number 9 (AORI_6197), which is
neither detected by the latest version of antiSMASH nor by the
ClusterFinder algorithm, represents a protein, proposed to be
involved in the synthesis of a nonproteinogenic amino acid. Such
nonproteinogenic amino acids are common building blocks of various secondary metabolites. The identified AORI_6197 is highly
similar to AesA of Amycolatopsis japonica [45] which was shown to
be essential for the synthesis of an unusual zinc-responsively produced chelating agent. To date, A. orientalis HCCB10007 has not
been described to produce such a compound.
3.3 Determining
the Boundaries
of a Gene Cluster
Clusters predicted by antiSMASH are most probably not displaying the correct boundaries, because the antiSMASH pipeline is
designed to set the cluster boundaries automatically at 5, 10, or
20 kb on each side of the last signature gene, dependent on the
type of gene. Without experimental validation, the real cluster
boundaries cannot be exactly predicted, but they can be estimated
by comparing SMGCs in different genomes of related bacterial
species. In the following section, we present a set of tools to help
with the estimation of gene cluster boundaries.
3.3.1 Comparison
of antiSMASH Results
with MIBiG
If the cluster of interest is similar to a known SMGC, cluster
boundaries can be deduced from the additional antiSMASH output data. Thereby, the “find homologous gene clusters” and “find
known homologous gene clusters” views in antiSMASH may be
helpful. A comparison of A. orientalis HCCB10007 cluster 3 with
the respective MIBiG entry for vancomycin reveals high similarity
in the modular structure as well as high similarity in the set of
flanking genes and therefore allows estimating the gene cluster
boundaries by simple comparison. Although antiSMASH usually
overestimates the gene cluster size, sometimes the known cluster is
even bigger than the cluster determined by antiSMASH. In this
case, the raw sequence should be carefully inspected.
Most of the SMGCs predicted by antiSMASH will not have a
high similarity with clusters from the MIBiG database. In the following section, we describe the different tools that can be used to
detect the SMGC boundaries by comparing the region of interest
with similar regions in other, closely related bacterial strains.
3.3.2 JGI-IMG/ER:
Comparative Genomics
for Genomes Published
in the JGI Database
An easy to use web tool is the JGI-IMG/ER (Integrated Microbial
Genomes—Expert Review) genome viewer on the JGI webpage
(Table 2) [53]. JGI-IMG/ER offers different genomics applications, including a genome viewer, metabolic pathway identification, annotation tools, or phylogenetic clustering programs. It is
necessary to register on the JGI webpage to get access to the JGI-
38
Martina Adamek et al.
Fig. 4 JGI IMG/ER “Ortholog Neighborhoods” view. Part of the vancomycin gene cluster of A. orientalis
HCCB10007 (GCA_000400635.2) compared to the homologous regions from A. decaplanina DSM 44595
(GCA_000342005.1), A. alba DSM 44262 (GCA_000384215.1), and A. balhimycina DSM 44591
(GCA_000384295.1) clusters (top down). The cluster is highlighted in gray to indicate the boundaries. Genes
conserved only in closely related strains are highlighted in orange
IMG/ER applications. One helpful tool is the gene neighborhood
view, which could be difficult to find among the diverse applications. First, several genomes of interest should be loaded into the
genome cart. Using the “Find Genes” function, it is possible to
search for specific genes by name, function, or locus tag. For each
gene an overview page with some general annotation information,
such as gene families, clusters of orthologous groups (COGs), or
protein family (PFAM) domains, is given. Scrolling down, the
gene is shown in its direct neighborhood. Choosing the option
“Show neighborhood with this gene’s bidirectional best hits” will
open a view of all selected genomes from the cart that share roughly
the same sized orthologs in the region of interest. By comparing
the strain of interest with other strains that share the same or similar biosynthetic genes, in particular by comparing the gene cluster
flanking regions, the putative boundaries of the cluster can be
determined (Fig. 4).
3.3.3 Genome
Comparison
with MultiGeneBlast
MultiGeneBlast is an open source tool for the identification and
comparison of multigene regions, such as gene clusters (Table 2)
[54]. First, a database comprising the genome sequences of several
closely related species must be built. This can either be done from
GenBank entries on the NCBI server or from *.gbk or *.fasta files
stored on a personal hard drive (see Note 7). As an example, the
GenBank file of the vancomycin cluster (cluster 3) downloaded
from the antiSMASH results for A. orientalis HCCB10007 can be
used as an input file. To create an example database the “whole
genome assembly” files from the NCBI server for Amycolatopsis
balhimycina FH1894, Amycolatopsis decaplanina DSM 44594,
A. orientalis B-37, Amycolatopsis alba DSM 44242, Amycolatopsis
mediterranei S699, and A. mediterranei U32 can be downloaded
by choosing “Database” →“Create from online GenBank entries.”
By choosing “Make raw nucleotide database for tblastn-search,”
FASTA instead of GenBank files are used. Using the option tab
“File” it is possible to select the created database and open an input
Genome Mining for Secondary Metabolites
39
Fig. 5 MultiGeneBlast example *.xhtml output for the vancomycin biosynthesis gene cluster of A. orientalis
HCCB10007 (GCA_000400635.2) compared to the homologous gene clusters from A. decaplanina DSM 44595
(GCA_000342005.1), A. alba DSM 44262 (GCA_000384215.1), and A. balhimycina DSM 44591 (GCA_
000384295.1) clusters. The cluster is highlighted in gray to indicate the boundaries. Genes conserved only in
closely related strains are highlighted in orange
file. For a standard approach the default settings are sufficient to
determine the gene cluster boundaries. The results are displayed in
a *.xhtml format. Gene clusters similar to the vancomycin cluster
are present in the A. decaplanina, A. alba, and A. balhimycina
genome. The minimum set of homologous genes shared by all
strains harboring the SMGC can be seen best in A. balhimycina.
Based on this information the boundaries can be estimated (Fig. 5).
Conversely, it is sometimes possible to define a cluster by the exclusion of cluster parts that are present in all genomes, not only those
harboring the SMGC of interest (see Note 8).
3.4 Working
with Draft Genomes
To date, most of the published bacterial genomes are not completed, but separated on short contigs or larger scaffolds. Contigs
are overlapping next-generation sequencing reads assembled into
DNA sequences of high confidence, which can vary in length from
a few hundred bases up to several Mb. For simple usage, a measure for the genome quality can be estimated by the number of
contigs (fewer contigs = better assembly) and from the N50 value,
a measure for the mean contig length, with greater weight given
to longer contigs (higher N50 = better assembly). Scaffolds are
assemblies of contigs, for which the relative position, but not the
connecting sequence, is known. These “gaps” are expressed as
larger stretches of “N”s in the genome sequence. A more sophisticated description of genome assembly methods for the interested
reader is given in [55, 56].
Unfortunate for the natural product researcher, contig ends
are often located in the middle of a secondary metabolite gene
cluster, notably in the highly repetitive, large type-I-PKS or NRPS
clusters. When working with draft genomes one could be suspicious when either the antiSMASH gene cluster ends abruptly with
a type-I-PKS or NRPS like structure, or some flanking enzymes
present in related SMGCs are missing. The cluster position on the
contig should be taken into account, to ensure that the observed
cluster is not only representing a part of the cluster, while the other
part is located on a different contig.
40
Martina Adamek et al.
Fig. 6 (a) JGI-IMG/ER “Ortholog neighborhoods” view of A. orientalis DSM 40040 and the homologous regions
in the A. azurea DSM 43854 genome which are located on two different contigs. (b) MultiGeneBlast comparison of ortholog gene clusters. A. orientalis HCCB10007 serves as a query. The respective region in the A.
azurea DSM 43854 genome is located on two different contigs
The easiest method to connect gene clusters on different contigs is to map the complete draft genome to a reference genome
using mapping software such as CONTIGuator (Table 2) [57].
The drawback of this approach is that it only works for highly similar SMGCs on highly similar genomes, e.g., different strains from
the same bacterial species. The risk is that the mapping program
misassembles the highly similar modules of type-I-PKS or NRPS.
The resulting mapped cluster will be much shorter than the
real cluster. Mapped clusters should therefore always be carefully
validated.
The previously described JGI-IMG/ER and MultiGeneBlast
applications can be of use to assign partial clusters on different
contigs to one gene cluster. With the JGI-IMG/ER viewer “show
neighborhood” option, contigs with sequence homology to a
complete reference sequence are automatically mapped (Fig. 6a).
If a draft genome is included in the MultiGeneBlast database, the
separated SMGCs will be displayed as a match with a complete
reference SMGC (Fig. 6b). As an example, the draft genome
sequence of A. azurea DSM 43854 has been used. When analyzing the A. azurea DSM 43854 genome with antiSMASH, gene
cluster 1 appeared suspicious, because cluster and contig ended
right within an NRPS gene. When comparing the gene cluster with
similar clusters in the JGI-IMG/ER and with MultiGeneBlast,
it appeared that the SMGC is spread over two distinct contigs
(Fig. 6).
Even without a complete reference gene cluster for comparison,
it is possible to figure out which contigs should be merged into
one cluster. Within type-I-PKS and NRPS clusters KS-domains or
Genome Mining for Secondary Metabolites
41
C-domains of the same type usually share 85–90 % homology. This
information can be used to deduce which KS-/C-domains belong
to the same cluster and, when comparing similar clusters on different genomes, which gene clusters are probably encoding highly
similar products. A useful tool to accomplish this is NaPDoS, the
Natural Product Domain Seeker (Table 2). NaPDoS is a web-based
bioinformatic tool that can be used to identify and extract KS- and
C-domains according to a BLAST search against a database of different PKS, NRPS, and hybrid clusters [58]. Furthermore, with
NaPDoS it is possible to construct a phylogenetic tree based on the
pairwise sequence similarity of KS- or C-domains. Another program
for the construction of phylogenetic trees, which allows choosing
between different tree finding algorithms and is more flexible with
the choice of parameters, is MEGA 6.0 [59]. For the assignment of
SMGCs according to their phylogeny, this means that KS- or
C-domains belonging to the same cluster, but present on different
contigs, are likely to fall in the same phylogenetic clade.
With the information gained from all three approaches it is
feasible to merge the contigs. If the sequences on both contigs
overlap the contigs can be assembled manually; otherwise, it is
common to indicate a gap of unknown length as a stretch of “N”s
in the sequence.
3.5 Prioritization
of Gene Clusters
Finally, when all desired gene clusters are identified, and the
boundaries have been estimated, one of the main final questions is:
Which of the secondary metabolite gene clusters are worth investigating? For searching variations of already known compounds or
for completely new compounds, the procedure is quite the same.
It is necessary to compare the new clusters to clusters encoding
already known compounds, to identify which known SMGCs show
similar gene composition and therefore are likely to produce similar compounds. Based on this information, it is possible to classify
gene cluster families. In the next section, the secondary metabolite
databases and recent approaches for the classification of SMGC
families are listed.
3.5.1 Dereplication
by Comparison of Genes
and Predicted Products
with Secondary Metabolite
Databases
To start, a comparison with the continuously growing MIBiG
database [34] is a good option. A link to MIBiG is directly included
in the antiSMASH output data in the “known homologous gene
clusters” view. antiSMASH furthermore provides a prediction of
the substrate specificity for the modular compounds of PKSs and
NRPSs. A comparison of the predicted monomers can indicate if
the produced compounds of two SMGCs are the same, slightly
different or if they are completely different. A final similarity check
can be performed by BLAST analysis.
A different approach for dereplication, which is based on the
phylogeny of typical SMGC domains, is used in PRISM. Thereby,
42
Martina Adamek et al.
possible variants for each product are predicted and subsequently
each variant is compared to a large natural product library [38].
Further natural product databases are, ClustScan [41], DoBISCUIT
[60], ClusterMine360 [61], and IMG-ABC [62].
3.5.2 Classification
of Gene Cluster Families
The currently discussed approach for the prioritization of biosynthetic gene clusters is the classification of all so far discovered secondary metabolite gene clusters in gene cluster families. A SMGC
classification should help to prevent replication and to predict
structure and function of novel secondary metabolites by comparison with related clusters. This process is still in its early stages.
Nevertheless, we would like to shortly introduce the different
approaches toward solving this task.
1. A first approach was based on the definition of operational
biosynthetic units (OBUs) for PKS and NRPS biosynthetic
gene clusters. OBUs were classified according to a similar gene
content and organization. Thereby, clusters with an amino
acid sequence identity of 90 % for KS-domains and 85 % for
C-domains were grouped [63]. However, this approach is limited to PKS and NPRS clusters and was so far only applied for
Salinispora spp. Our recent experiences (data not published)
show that for some other bacterial genera these thresholds are
not applicable.
2. Another approach was based on the combination of three different similarity metrics: (a) the number of orthologous genes
shared by two biosynthetic gene clusters, (b) the amount of
each cluster shared in a PROmer alignment, and (c) simplified,
the number of corresponding signature genes in two clusters
expressed as percentage values. Creating a similarity matrix
giving different weights for the different similarity indices
(a: 25 %, b: 25 %, c: 50 %) allowed the implementation of these
data in a distance network, clearly visualizing distinct gene
cluster families [64]. The drawback of this method is the possibility that clusters producing only precursor molecules or
smaller subunits can cluster in the same groups as larger and
more complex secondary metabolites.
3. The third approach combined two similarity metrics: (a) the
Jaccard index to measure the similarity (presence or absence)
of PFAM domains from all vs. all SMGCs (weighed: 36 %) and
(b) the domain duplication index to measure similarities in
the numbers of PFAM domains (weighed: 64 %) [44]. For the
graphical visualization of similarity values Cytoscape was
used [65]. However, this method is not well suited for the
comparison of highly repetitive multimodular clusters, such as
type-1-PKS and NPRS clusters.
Finally, deciding which clusters are worth investigating is
highly user dependent. If the detection of a completely new
Genome Mining for Secondary Metabolites
43
compound is desired, clusters not belonging to any known SMGC
family should be prioritized. If the objective is to find a structural
and functional variant of a known compound, members of a specific SMGC family might be of interest.
For future approaches, it is necessary to define generally applicable standard rules for the classification of secondary metabolite
cluster families. Furthermore, specified algorithms for the different
types of gene clusters are needed.
4
Notes
1. Genome Mining is not only limited to genomic sequence data,
but can also be performed for sequenced cosmid libraries and
metagenomic data. However, the programs presented in this
chapter are not recommended for raw sequence data or very
short assembled sequences, as they need sequences of a certain
length as input data. For genome mining of raw metagenomic
data eSNaPD [66] is recommended.
2. When converting files, using annotation tools, etc. the same
headers/identifiers should be used, so that it is always possible
to identify the respective region of interest within a larger dataset. This is specifically of interest when the genome is separated
on multiple contigs. Headers should be kept short, because
some programs have problems with headers longer than 20
characters.
3. Filename extensions can make a difference. There are several
common filename extensions for FASTA (*.fa, *.fas, *.fasta,
*.fna for nucleotide, *.faa for amino acid) or GenBank (*.gbk,
*.gb, *.gbf) files in use. For example, if a program does not
accept FASTA data with a *.fas extension as input file, changing the extension to *.fa or *.fasta could help.
4. Poor genome annotation influences the antiSMASH results.
When using antiSMASH with FASTA sequences the detected
SMGCs are additionally annotated using an up-to-date annotation pipeline. Therefore, it is recommended to run antiSMASH
with GenBank and FASTA sequences and to compare the
results concerning missing genes or entire missing clusters.
5. antiSMASH often merges neighboring clusters into one large
cluster. These false hybrid clusters can be distinguished from
real hybrid clusters by comparison and detection of the cluster
boundaries with MultiGeneBlast and JGI-IMG/ER.
6. Sometimes, MbtH-like structures (MbtH proteins bind to
NRPS proteins to stimulate adenylation reactions) are wrongly
classified as lantipeptide core peptides in antiSMASH. This is
often seen in NRPS clusters. To distinguish real lantipeptide