
Because of its breadth MUMmer can, at first glance, be an overwhelming sea of scripts and subroutines. This document attempts to walk the user through some of the more useful modules of the package, and provides example data and expected outputs to assure the correct and productive operation of MUMmer. All example data is real DNA sequence from various eukaryotic and prokaryotic organisms, and can be found in its entirety in the data directory. Although the input sequences are only subsections of their respective genomes, they have been carefully selected to permit speedy and informative walk-throughs. It is not necessary to download all of the data at once, as each subsection will have separate links to the relevant files.
For further information regarding any of the MUMmer programs or their output formats, please refer to the online MUMmer manual.
MapView is a utility script for displaying sequence alignments as provided
by NUCmer or PROmer. It takes the output from show-coords and converts
it to a FIG, PDF or PS image file. By default, it produces FIG files which can
be viewed with the common system utility xfig or converted to PDF
or PS with the fig2dev utility (neither programs are included with
MUMmer). mapview is useful for mapping multiple query contigs (e.g.
from a draft sequencing project) against an annotated reference sequence. Exons
and other features can also be plotted with the NUCmer or PROmer alignments,
aiding in exon refinement and analysis. Individual MUMmer hits are plotted according
to their percent identity, making regions of high or low similarity easily distinguishable.
In the following sections, a short example is given that demonstrates how to
use mapview. Since nucmer and promer
have a near identical user interface, the alignments for this example will be
generated using promer. This example aligns a few query sequences
to a single reference sequence using promer, and then uses mapview
to plot the resulting areas of conservation and the reference sequence annotation.
D_melanogaster_2Rslice.cdsD_melanogaster_2Rslice.fastaD_melanogaster_2Rslice.utrD_pseudoobscura_contigs.fastaPlease complete the PROmer walk-though in order to generate
the alignment between the Drosophila melanogaster chromosome 2R segment
and the 2 contigs from Drosophila pseudoobscura. The PROmer walk-through
will generate the .coords file that is necessary to continue with
the rest of this tutorial. If already familiar with the promer
alignment script, simply continue this tutorial using the supplied promer.coords
file. Note that when generating the .coords file with show-coords
it is important to use the -l -r options (and optionally the -k
option) in order to generate the proper input format for mapview.
The output of show-coords is then used by MapView to create a
FIG, PDF or PS file.
mapview -n 1 -p mapview promer.coords
The -n option is used to set the number of output files to 1.
By default, MapView partitions its output among 10 files in order to keep the
figures for large comparisons small. Since we are only comparing a small slice
of the actual chromosome, only 1 file will be needed. The output of this command
will be a single file named mapview_0.fig. A more informative plot
can be generated by supplying a UTR and CDS coordinate file in GFF
format. These files contain annotation information that will be plotted
along side the PROmer alignments, thus making it possible to compare the conserved
regions with annotated exon positions.
mapview -n 1 -p mapview promer.coords D_melanogaster_2Rslice.utr D_melanogaster_2Rslice.cds
This will generate a single file, mapview_0.fig, that will have
the annotation information displayed above the blue reference rectangle. Below,
you can see this file displayed with the xfig viewer. The only difference between
this file and the file produced without the UTR and CDS files are the annotation
rectangles above the blue rectangle at the very top of the figure.
In order to generate a PDF format, use the same command plus the -f pdf
option.
mapview -n 1 -f pdf -p mapview promer.coords D_melanogaster_2Rslice.utr
D_melanogaster_2Rslice.cds
This will generate the same image, mapview_0.pdf, but in PDF format.
The above MapView FIG shows a 220 kbp slice of D. melanogaster chromosome 2L and its alignment to D. pseudoobscura. The alignment, generated by PROmer, shows all regions of conserved amino acid sequence. The blue rectangle spanning the figure represents the reference (D. melanogaster), with annotated genes shown above it and the PROmer alignments shown below it. Alternative splice variants of the same gene are stacked vertically. Exons are shown as boxes, with intervening introns connecting them. The 5' and 3' UTRs are colored pink and blue to indicate the gene's direction of translation. PROmer matches are shown twice, once just below the reference genome, where all matches are collapsed into red boxes, and in a larger display showing the separate matches within each contig, where the contigs are colored differently to indicate contig boundaries. The vertical position of the matches indicates their percent identity, ranging from 50% at the bottom of the display to 100% just below the red rectangles. Percent identity is of the amino acid translations used by PROmer. Matches from the same query sequence are connected by lines of the same color.
mummer is a suffix tree algorithm designed to find maximal exact
matches of some minimum length between two input sequences. The match lists
produced by mummer can be used alone to generate alignment dot
plots, or can be passed on to the clustering algorithms for the identification
of longer non-exact regions of conservation. These match lists have great versatility
because they contain huge amounts of information and can be passed forward to
other interpretation programs for clustering, analysis, searching, etc.
In the following sections, a short example is given that demonstrates how to
use mummer. This example compares a single query sequence to a
single reference sequence using mummer, and then uses mummerplot
to generate a dot plot representation of the comparison.
mummer can handle multiple reference and multiple query sequences,
however a dotplot of more that two sequences can be confusing, so for the case
of this example we will be dealing with a single reference and a single query
sequence.
mummer -mum -b -c H_pylori26695_Eslice.fasta H_pyloriJ99_Eslice.fasta
> mummer.mums
This command will find all maximal unique matches (-mum) between
the reference and query on both the forward and reverse strands (-b)
and report all the match positions relative to the forward strand (-c).
Output is to stdout, so we will redirect it into a file named mummer.mums.
This file lists all of the MUMs of the default length or greater between the
two input sequences.
A dotplot of all the MUMs between two sequences can reveal their macroscopic similarity.
mummerplot -x "[0,275287]" -y "[0,265111]" -postscript
-p mummer mummer.mums
This command will plot all of the MUMs in the mummer.mums file
in postscript format (-postscript) between the given ranges for
the X and Y axes. When plotting mummer output, it is necessary
to use the lengths of the input sequences to set the plot ranges, otherwise
the plot will be automatically scaled around the minimum and maximum data points.
The four output files are prefixed by the string specified with the -p
option. The plot files contains the data points, mummer.gp
is a gnuplot script for plotting the data points in the plot files,
and mummer.ps is the postscript plot generated by the gnuplot script.
Below, you can see the mummer.ps file displayed with ghostview.
Note that with newer versions of mummerplot the color and thickness
of the plot lines may be different.
Most image manipulation programs can edit the postscript output, or it can
be sent directly to a printer with the lpr command. If you would
rather use the default terminal for gnuplot, simply remove the -postscript
option from the mummerplot call.
The above postscript plot represents the set of all MUMs between the two input sequences used in this example. Forward MUMs are plotted as red lines/dots while reverse MUMs are plotted as green lines/dots (blue may be used for reverse matches in newer versions). A line of dots with slope == 1 represents an undisturbed segment of conservation between the two sequences, while a line of slope == -1 represents an inverted segment of conservation between the two sequences. The green segment in the upper left quadrant of the graph shows both an inversion and translocation, as it is of negative slope and inconsistently located relative to the rest of the plot which falls on a line approximated by f(x) = x. However the green segment in the upper right quadrant of the graph shows only an inversion, as it is of negative slope but is consistent in location with the rest of the plot. Generally, the closer a plot is to an imaginary line f(x) = x (or -x) the fewer macroscopic differences exist between the two sequences.
nucmer is the MUMmer's most user-friendly alignment script for
standard DNA sequence alignment. It is a robust pipeline that allows for multiple
reference and multiple query sequences to be aligned in a many vs. many fashion.
For instance, a very common use for nucmer is to determine the
position and orientation of a set of sequence contigs in relation to a finished
sequence, however it can be just as effective in comparing two finished sequences
to one another.
In the following sections, a short example is given that demonstrates how to
use nucmer. This example aligns a set of draft sequence contigs
to a finished sequence using nucmer; displays the alignment coordinates
using show-coords; and tiles them across the reference using show-tiling.
Like mummer, nucmer can handle multiple reference
and query sequences, however it is most commonly used to map a set of query
sequences to a single reference sequence. This example will demonstrate that
functionality, as a number of B. anthracis draft contigs will be mapped
to the final assembly.
nucmer -maxmatch -c 100 -p nucmer B_anthracis_Mslice.fasta B_anthracis_contigs.fasta
To assure all contigs were mapped, all maximal matches were used as alignment
anchors (-maxmatch) and because of the sequence similarity the
minimum cluster size was bumped up to 100 (-c 100). The two output
files are prefixed by the string specified with the -p option.
nucmer.delta is an
encoded file that represents the alignment between the two inputs. At this stage,
the alignment of the two inputs is complete, however it is necessary to parse
the nucmer.delta file with the provided utilities in order to extract
useful information from the comparison.
To view a summary of all the alignments produced by NUCmer, we need to run
the nucmer.delta file through the show-coords utility.
show-coords -r -c -l nucmer.delta > nucmer.coords
This command will list the coordinates, percent identities and other useful
statistics of each alignment in a table. Each line of the table represents an
individual pairwise alignment, and each line is sorted by its starting reference
coordinate (-r). Additional information, like alignment coverage
(-c) and sequence length (-l) can be added to the
table with the appropriate options. Output is to stdout, so we
have redirected it into the file, nucmer.coords.
To view a summary of all the SNPs and indels between the two sequence sets,
we need to run the nucmer.delta file through the show-snps
utility.
show-snps -C nucmer.delta > nucmer.snps
This will generate a report of all the SNPs internal to the alignments contained
in the nucmer.delta file. Each line of the table represents a single
mismatch in the pairwise alignment. With the -C option, only SNPs
from uniquely aligned regions will be reported. Additional information can be
added or removed with the command line switches described in the manual. Output
is to stdout, so we have redirected it into the file, nucmer.snps.
To produce a minimal tiling of contigs across the reference sequence, we need
to run the nucmer.delta file through the show-tiling
utility.
show-tiling nucmer.delta > nucmer.tiling
This command will list the contigs and positions that generate the maximal alignment coverage across the reference sequence using the fewest contigs possible. This output can aid the closure of a draft genome when a closely related organism has already be finished.
nucmer and show-tiling output can both be viewed
with mummerplot, however these plots would offer little more information
in regards to this example. mapview can also be used to display
the output of show-coords, as is shown in the mapview
walkthrough.
promer is a close relative to the NUCmer script. It follows the
exact same steps as NUCmer and even uses most of the same programs in its pipeline,
with one exception - all matching and alignment routines are performed on the
six frame amino acid translation of the DNA input sequence. This provides promer
with a much higher sensitivity than nucmer because protein sequences
tends to diverge much slower than their underlying DNA sequence. Therefore,
on the same input sequences, promer may find many conserved regions
that nucmer will not, simply because the DNA sequence is not as
highly conserved as the amino acid translation.
In the following sections, a short example is given that demonstrates how to
use promer. This example aligns a few query sequences to single
reference sequence using promer; displays the alignment coordinates
using show-coords; and prints a pairwise alignment of one of the
contigs using show-aligns.
Like mummer, promer can handle multiple reference
and query sequences, however it is most commonly used to map a set of query
sequences to a single reference sequence. This example will demonstrate that
functionality, as two D. pseudoobscura draft contigs will be mapped
to the final D. melanogaster assembly.
promer -p promer D_melanogaster_2Rslice.fasta D_pseudoobscura_contigs.fasta
Default parameters were used to align the two inputs, however if the alignment
is too sensitive or not sensitive enough the minimum match length and cluster
sizes can be adjusted accordingly. The two output files are prefixed by the
string specified with the -p option. promer.delta is an encoded file that represents
the alignment between the two inputs. At this stage, the alignment of the two
inputs is complete, however it is necessary to parse the promer.delta
file with the provided utilities in order to extract useful information from
the comparison.
To view a summary of all the alignments produced by PROmer, we need to run
the promer.delta file through the show-coords utility.
show-coords -r -c -l -L 100 -I 50 promer.delta > promer.coords
This command will list the coordinates, percent identities and other useful
statistics of each alignment in a table. Each line of the table represents an
individual pairwise alignment, and each line is sorted by its starting reference
coordinate (-r). Additional information, like alignment coverage
(-c) and sequence length (-l) can be added to the
table with the appropriate options. And minimum length (-L) and
minimum percent identity (-I) cutoffs can be specified to reduce
poor alignments. Output is to stdout, so we have redirected it
into the file, promer.coords. If this file is planned for input
to mapview, it is important to always use the -r -c
-l options.
To view all the pairwise alignments between two of the input sequences, we
need to run the promer.delta file through the show-coords
utility.
show-aligns promer.delta "D_melanogaster_2Rslice" "3214968"
> promer.aligns
This command will print all of the pairwise alignments stored in the promer.delta
file for the sequences "D_melanogaster_2Rslice" and "3214968".
Output is to stdout, so we have redirected it into the file, promer.aligns.
If the alignments do not fit within your screen width, or you would like them
to be printed on longer lines, the screen width can be adjusted with the -w
option. Since show-aligns only displays the alignments between
two sequences, it will have to be run separately for each desired pair of sequences.
promer and show-tiling output can both be viewed
with mummerplot, however these plots would offer little more information
in regards to this example. mapview can also be used to display
the output of show-coords, as is shown in the mapview
walkthrough which uses the promer.coords file generated in
this example to generate a plot of the alignment.
run-mummer1 is a legacy script from the original MUMmer1.0 release.
It has been updated to utilize the new suffix tree code of version 3.0, however
all other programs called from this script are identical to the original MUMmer
release back in 1999. Even though it is an outdated program, it still has some
advantages over the newer alignment scripts (nucmer, promer,
run-mummer3). Like all of the alignment scripts, run-mummer1
is a three step process - matching, clustering and extension. However, unlike
the newer alignment scripts, run-mummer1 uses the gaps
program for its clustering step. The gaps program does not allow
for rearrangements like mgaps, instead if finds the single longest
increasing subset of matches across the full length of both sequences. This
makes it well suited for SNP and small indel identification between small (<
10 Mbp), very similar sequences with few to no rearrangements.
In the following sections, a short example is given that demonstrates how to
use run-mummer1. This example aligns a single query sequence to
a single reference sequence using run-mummer1.
run-mummer1 is only suited for a single reference and query sequence
that have few to zero inversions or translocations. This example aligns two
such sequences.
run-mummer1 H_pylori26695_Bslice.fasta H_pyloriJ99_Bslice.fasta mummer1
To adjust the minimum match length for the comparison, the user must manually
edit the run-mummer1 script. Output files are prefixed by the string
specified at the end of the command line call. mummer1.align displays
the alignments of each gap between adjacent MUMs, mummer1.errorsgaps
lists each MUM and the number of errors between it and the previous MUM, mummer1.gaps
lists the ordered set of MUMs and the gap distance to the previous MUM, and
mummer1.out simply lists all of the MUMs greater than or equal
to the minimum match length.
There are no visualization tools designed for run-mummer1 output.
To view a MUM dotplot, run mummer by itself on two individual sequence
as demonstrated in the mummer walkthrough.
run-mummer3 is the simplest pipeline of the latest MUMmer3.0 programs.
It runs the same matching and clustering algorithm as nucmer and
promer, however it uses a different extension technique and does
not perform the important pre- and post-processing steps of NUC/PROmer. Because
of its simplistic form, run-mummer3 can only handle a single reference
sequence, but like run-mummer1 its error-focused output makes it
a handy tool for detecting SNPs and other small errors. The only major difference
between run-mummer3 and run-mummer1 is the new version's
ability to handle multiple query sequences and its tolerance of large rearrangements.
This makes run-mummer3 well suited for error detection between
highly similar sequences that may have large rearrangements, inversions etc.
In the following sections, a short example is given that demonstrates how to
use run-mummer3. This example aligns a single query sequence to
a single reference sequence using run-mummer3.
run-mummer3 can only handle a single reference sequence, but it
is capable of dealing with multiple query sequences. However, this example aligns
a single query sequence to a single reference sequence. Unlike run-mumer1,
run-mummer3 can handle inversions and translocations, but not with
the same grace as nucmer.
run-mummer3 H_pylori26695_Bslice.fasta H_pyloriJ99_Bslice.fasta mummer3
To adjust any of the alignment parameters, the user must manual edit the run-mummer3
scripts. Do not, however, add the -c option to the mummer
invocation, as it will confuse the next steps in the pipeline. It may be easier
to reverse complement the sequence yourself and run the script twice (once for
forward, second for reverse) with the -b option removed. Try adding
the -D option to the combineMUMs command line in the
script to output a format that is easier to parse for SNPs and small indels.
Output files are prefixed by the string specified at the end of the command
line call. mummer3.align displays the alignments of each gap between
adjacent MUMs, mummer3.errorsgaps lists each MUM and the number
of errors between it and the previous MUM, mummer3.gaps lists the
ordered set of MUMs and the gap distance to the previous MUM, and mummer3.out
simply lists all of the MUMs greater than or equal to the minimum match length.
The mummer3.out file is identical to the output of mummer
on a 1 vs many search, so it may be plotted as demonstrated in the mummer
walkthrough.
Please address questions and bug reports via Email to:
VERSION 3.17 - May 2005