Royal Academy of Sciences New Zealand Open Science
Open Science

MiFish, a set of universal PCR primers for metabarcoding environmental DNA from fishes: detection of more than 230 subtropical marine species

Published:

We developed a set of universal PCR primers (MiFish-U/E) for metabarcoding environmental DNA (eDNA) from fishes. Primers were designed using aligned whole mitochondrial genome (mitogenome) sequences from 880 species, supplemented by partial mitogenome sequences from 160 elasmobranchs (sharks and rays). The primers target a hypervariable region of the 12S rRNA gene (163–185 bp), which contains sufficient information to identify fishes to taxonomic family, genus and species except for some closely related congeners. To test versatility of the primers across a diverse range of fishes, we sampled eDNA from four tanks in the Okinawa Churaumi Aquarium with known species compositions, prepared dual-indexed libraries and performed paired-end sequencing of the region using high-throughput next-generation sequencing technologies. Out of the 180 marine fish species contained in the four tanks with reference sequences in a custom database, we detected 168 species (93.3%) distributed across 59 families and 123 genera. These fishes are not only taxonomically diverse, ranging from sharks and rays to higher teleosts, but are also greatly varied in their ecology, including both pelagic and benthic species living in shallow coastal to deep waters. We also sampled natural seawaters around coral reefs near the aquarium and detected 93 fish species using this approach. Of the 93 species, 64 were not detected in the four aquarium tanks, rendering the total number of species detected to 232 (from 70 families and 152 genera). The metabarcoding approach presented here is non-invasive, more efficient, more cost-effective and more sensitive than the traditional survey methods. It has the potential to serve as an alternative (or complementary) tool for biodiversity monitoring that revolutionizes natural resource management and ecological studies of fish communities on larger spatial and temporal scales.

1. Introduction

Environmental DNA (eDNA) in aquatic environments refers to genetic material found in the water column. In the case of multicellular organisms, eDNA originates from various sources, such as metabolic waste, damaged tissue or sloughed skin cells [1]. Ficetola et al. [2] was the first study demonstrating the use of eDNA for detecting an aquatic vertebrate species (invasive American bullfrog) from controlled environments and natural wetland, published in 2008. Subsequently, eDNA from fishes has been detected from various aquatic environments, including ponds [35], streams [6], rivers [710] and seawater [11,12]. Such ubiquitous presence of eDNA from fishes in the water column has led to the increasing use of this technique as a tool for detections of invasive [3,79], rare or threatened species [5,6], investigations of local fauna [10,13], or in a larger mesocosm [12] with known species composition. These pioneering studies have shown the use of eDNA to be appropriate as a non-invasive genetic monitoring tool in various fields of fish biology.

For monitoring the occurrence of a single or few fish species, short species-specific eDNA fragments (72–312 bp) have been used [3,59], with earlier studies detecting those species based on the presence/absence of PCR products by visually inspecting the products on an agarose gel stained with ethidium bromide [79]. More recently, quantitative PCR (qPCR) using probe-based chemistries has been employed for the detection of target species [36] owing to the method's sensitivity, specificity and potential to quantify the target DNA [6]. For example, Takahara et al. [4] estimated the biomass of common carp (Cyprinus carpio) in a natural freshwater lagoon, using the qPCR approach (real-time PCR), based on the positive relationships between eDNA concentrations and biomass in aquaria and experimental ponds.

For monitoring fish assemblages with broader taxonomic scopes, Minamoto et al. [10] designed degenerate PCR primers to amplify a short fragment of the mitochondrial cyt b gene (285 bp) with reference to those sequences from the local freshwater fish fauna. Based on PCR amplification of the fragment and subsequent subcloning and sequencing of the product, they successfully detected multiple species in eDNA from the controlled aquaria (one to five spp.) and three stations in the Yura River, central Japan (two to four spp.) [10]. Thomsen et al. [11] developed two generic and four species-specific PCR primer sets for amplifying short fragments of the cyt b gene (32–51 bp), in order to detect marine fish species from three sampling sites at a coastal zone in Denmark. Using a next-generation sequencing (NGS) platform (Roche 454 GS FLX), they detected 15 species in the amplicons, including both important commercial fishes as well as some species rarely recorded by conventional monitoring methods [11]. More recently, Kelly et al. [12] attempted to estimate the fish fauna in a large tank at the Monterey Bay Aquarium with known species composition by sequencing PCR amplicons from eDNA using an NGS platform (Illumina MiSeq). They used a set of published universal PCR primers to amplify a 106 bp fragment of the mitochondrial 12S rRNA gene [14] for metabarcoding fish species in the tank. Although they detected seven of the eight species of bony fishes present, they were able to identify those species only to taxonomic family or genus owing to the limited sequence variability within the amplicons. In addition, they failed to detect all three elasmobranchs (sharks and rays) contained in the tank [12].

These earlier studies on eDNA metabarcoding (high-throughput multispecies identification using degraded DNA extracted from an environmental sample [15]) have shown both potential and limitations. They are non-invasive and are demonstrably more efficient and cost-effective than the traditional monitoring methods, such as visual surveys, trawls and seines [11,12]. The former two studies [10,11], however, required development of PCR primers specifically designed with reference to DNA sequences from the known local fish fauna and those primers are of limited uses in future studies with little prior knowledge on the faunal composition. The latter study [12] employed PCR primers that have been developed using the computer software ‘ecoPrimers’ [14] and that are supposedly universal among vertebrates. Despite the use of universal primers, the successful detection in the aquarium tank was dependent on the taxonomic groups (e.g. no detection for ocean sunfish and all elasmobranchs), and the amplified products, if any, exhibited little sequence variability to correctly assign fish species in the same family or genus [12].

The primary objective of this study was to circumvent these problems associated with PCR primers. To achieve this goal, we: (i) developed universal primers for fish eDNA that amplify a short fragment (less than 200 bp) containing sufficient sequence variation to correctly assign fish species; (ii) tested versatility of the primers across a taxonomically and ecologically diverse range of fishes using eDNA from aquarium tanks with known species compositions; and (iii) preliminarily examined the use of the primers for detecting eDNA from fishes inhabiting natural seawater environments with unknown species composition and abundances in an open ecosystem.

The development of the universal primers (MiFish-U/E) was based on the aligned whole mitochondrial genome (mitogenome) sequences from 880 fish species, which was supplemented by partial mitogenome sequences from 160 elasmobranchs. The primers are targeted to amplify a hypervariable region of the 12S rRNA gene (163–185 bp), which contains sufficient information to unambiguously identify fishes we tested to taxonomic family, genus and species, with one exception (closely related congeners of Thunnus). We tested the versatility of those PCR primers using eDNA from four tanks in the Okinawa Churaumi Aquarium and from natural seawaters near the aquarium in the subtropical western North Pacific. Using a high-throughput Illumina MiSeq platform, we detected eDNA from 232 fish species from those seawaters, which are taxonomically diverse and are distributed across 70 families and 152 genera. In addition to eDNA, this metabarcoding approach is applicable to bulk samples (total DNA), such as those from net collections containing a diverse range of fish eggs, larvae, juveniles or damaged specimens with few diagnostic characters present for species identification.

2. Material and methods

2.1 Primer development

2.1.1 Selection of genetic marker

Mitochondrial DNA (mtDNA) was chosen as the genetic marker because copy number of mtDNA is greater than that of nuclear DNA per cell, and detection rate therefore is expected to be higher in the former, even where DNA is present at a low concentration and/or is degraded [16]. In order to select a suitable region in the mitogenome for species identification based on eDNA, 1044 whole mitogenome sequences were batch downloaded from the database MitoFish v. 2.80 [17] in a FASTA format as of 20 April 2013. After removing problematic sequences involving large-scale gene rearrangements [18], the remaining 880 sequences (electronic supplementary material, table S1) were subjected to multiple alignment using MAFFT v. 6.956 [19] with a default set of parameters. The aligned sequences were imported into Mesquite v. 2.75 [20] for visual inspection of the conservative and hypervariable regions. The search for a short hypervariable region (up to 200 bp for paired-end sequencing using the Illumina MiSeq) flanked by two conservative regions (ca 20–30 bp) across 880 species was performed on the entire set of aligned mitogenomes. The conservative and hypervariable regions were highlighted by a ‘Select’ function in Mesquite (a submenu ‘Variable among taxa’ in ‘Select Characters’) [20].

2.1.2 Primer design

To facilitate primer design based on comparisons of diverse sequences from 880 fish species, a base composition for a selected position in the conservative region was shown using a ‘Show Selection Summary Strip’ function in Mesquite[20]. The base compositions in selected characters were manually recorded in a spreadsheet for the primer design. In the primer design process, we considered a number of technical tips that enhance the primer annealing to the template without the uses of degenerate bases [21]: primers include some G/C at the 3′-ends to strengthen primer–template annealing at this position, but a string of either Gs or Cs at the 3′-end should be avoided; considering the unconventional base pairing in the T/G bond, the designed primers use G rather than A when the template is variably C or T, and T rather than C when the template is A or G; G/C contents of the primers fall between 40 and 60% with an almost identical melting temperature (Tm). Tm was calculated using a nearest-neighbour thermodynamic model implemented in OligoCalc [22].

The first universal primers for eDNA were designed on the 12S rRNA gene (for details, see Results and Discussion) and were named MiFish-U-F/R (with overhang adapter sequences for library preparation; U, F and R represent universal, forward and reverse, respectively). In addition, we had to design MiFish-E-F/R to accommodate sequence variations in the priming sites of elasmobranchs (E), with the primer designs based on newly assembled partial mitogenome sequences from 160 species (electronic supplementary material, table S2). For more accurate species assignments within closely related congeners, we also designed genus-specific primers that amplify a different mitogenomic gene (ND5) with significant variations across constituent species (e.g. MiFish-tuna).

2.1.3 Primer testing with extracted DNA

In order to test whether these newly designed PCR primers were universal or not, we first tested MiFish-U-F/R (no adapter sequences) using extracted DNA from 96 species representing all the four major lineages of fishes (Agnatha, Chondrichthyes, Actinopterygii and Sarcopterygii) placed in 47 orders and 96 different families (table 1). Double-stranded DNA concentrations from those fishes were measured with a NanoDrop Lite spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA) and the extracted DNA was diluted to 15 ng μl−1using Milli-Q water. PCR was carried out with 30 cycles of a 15 μl reaction volume containing 8.3 μl sterile distilled H2O, 1.5 μl 10×PCR buffer (Takara, Otsu, Japan), 1.2 μl dNTPs (4 mM), 1.5 μl of each primer (5 μM), 0.07 μl Taq polymerase (Z Taq; Takara) and 1.0 μl template. The thermal cycle profile after an initial 2 min denaturation at 94°C was as follows: denaturation at 98°C for 5 s; annealing at 50°C for 10 s; and extension at 72°C for 10 s with the final extension at the same temperature for 5 min.

Table 1.

A list of fish species for testing MiFish-U primers (without adapter sequences) using extracted DNA diluted to 15 ng μl−1, subsequently sequenced with a Sanger method.

Double-stranded PCR products were purified using Exo SAP-IT (USB, Cleveland, OH, USA) to remove redundant dNTPs and oligonucleotides from primers. Direct cycle sequencing was performed with dye-labelled terminators (BigDyeterminator v. 1.1; Applied Biosystems, Foster City, CA, USA) following the manufacturer's protocol and the purified PCR products were sequenced for both strands on the ABI 3130xl Genetic Analyzer (Life Technologies, Carlsbad, CA, USA). The DNA sequences were edited and assembled using GENETYX-MAC v. 17 (Genetyx, Tokyo, Japan) and deposited in DDBJ/EMBL/GenBank databases.

2.1.4 In silico evaluation of interspecific variation

Interspecific differences within the amplified DNA sequences are required for accurate assignments of taxonomic categories. To computationally evaluate levels of interspecific variation in the target region (hereafter called ‘MiFish sequence’) across different taxonomic groups of fishes, 1361 whole mitogenome sequences were batch downloaded from MitoFish v. 2.89 [17] as of 3 September 2014. After removing duplicate sequences (e.g. multiple sequences from subspecies), uncertain taxonomic status (e.g. hybrids) and possible erroneous sequences (e.g. unable to annotate using MitoAnnotator [17]), the MiFish sequences were extracted from the remaining 1324 sequences using custom Ruby scripts (available from: http://dx.doi.org/10.5061/dryad.54v2q) and they were subjected to calculation of pairwise edit distances. The edit distance quantifies dissimilarity of sequences in bioinformatics [23] and is defined as the minimum number of single-nucleotide substitutions, insertions or deletions that are required to transform one sequence into the other. For comparisons, metabarcode sequences amplified by 12S-V5 primers [14] (forward: 5′-ACTGGGATTAGATACCCC-3′; and reverse: 5′-TAGAACAGGCTCCTCTAG-3′) (hereafter called ‘ecoPrimer sequences’) were also extracted from the 1324 sequences and their interspecific variation was evaluated as described for MiFish sequences. The ecoPrimer pair amplifies the same gene (mitochondrial 12S rRNA gene) as that of the MiFish-U/E primers, but the two primer pairs are designed to amplify two different regions adjacent to each other (12S-V5-Fprimer is located within MiFish-U-R primer). The ecoPrimer pair was used in a metabarcoding study of fishes by Kelly et al. [12] who attempted to estimate an artificial fish fauna using eDNA in the large tank at the Monterey Bay Aquarium.

2.2 Primer testing with environmental DNA

2.2.1 Sampling sites

In order to test the versatility of the newly designed primers for metabarcoding eDNA from fishes, we sampled seawater from four tanks in the Okinawa Churaumi Aquarium, Okinawa, Japan (26°41′39′′ N, 127°52′41′′ E; figure 1). The aquarium was chosen because of the remarkable taxonomic diversity of fishes contained in a variety of tanks that resemble surrounding environments in the subtropical western North Pacific. The four selected tanks; Kuroshio (water volume =7500 m3), tropical fish (700 m3), deep-sea (230 m3) and mangrove (35.6 m3) tanks (figure 1ad) harbour diverse groups of fishes (ca 250 species) from elasmobranchs (sharks and rays) to higher teleosts that vary greatly in their ecology, including both pelagic and benthic species living in shallow coastal to deep waters. In addition to these four aquarium tanks, we also sampled seawaters from coral reefs nearby the aquarium (26°42′35′′ N, 127°52′48′′ E; figure 1e,f) to preliminarily examine the use of the primers for metabarcoding eDNA from natural environments with unknown fish composition and abundances in an open ecosystem.

Figure 1.

Figure 1. (ad) Four tanks used for water sampling in the Okinawa Churaumi Aquarium and (e,f) a sampling site in the coral reefs near the aquarium: (a) Kuroshio (water volume =7500 m3); (b) tropical fish (700 m3); (c) deep-sea (230 m3); and (d) mangrove (35.6 m3) tanks; (e,f) sampling site in Bise (arrow; 26°42′35′′ N, 127°52′48′′ E) and the Okinawa Churaumi Aquarium (star; 26°41′39′′ N, 127°52′41′′ E).

2.2.2 Water sampling and DNA extraction

All sampling and filtering equipment was exposed to a 10% bleach solution for at least 30 min before use. For water samplings in the aquarium, approximately 10 l of seawater was collected from the surface using multiple casts of an 8 l polyethylene bucket fastened to a 10 m rope. The bucket was thoroughly prewashed with tank water. The sampling was conducted between 10.00 and 13.00 before daily feeding on two consecutive days (2 and 3 June 2014). The sampled water was stored in a valve-equipped 10 l book bottle and immediately brought to the laboratory before subsequent filtering. For water samples from the coral reefs near the aquarium, 10 l of seawater was collected in a similar manner on 4 June and 7 November 2014.

One to three 2 l lots of seawater from the 10 l samples were vacuum-filtered onto 47 mm diameter glass-fibre filters (nominal pore size, 0.7 μm; Whatman, Maidstone, UK). Each filter was wrapped in commercial aluminium foil and stored in −20°C before eDNA extraction. Two litres of Milli-Q water was used as the negative control and treated identically to the eDNA samples, to monitor contamination during the filtering and subsequent DNA extraction.

DNA was extracted from the filters using the DNeasy Blood and Tissue Kit (Qiagen, Hilden, Germany) in combination with a spin column (EZ-10; Bio Basic, Markham, Ontario, Canada). After removing the attached membrane from the spin column (EZ-10), the filter was tightly folded into a small cylindrical shape and placed in the spin column. The spin column was centrifuged at 6000g for 1 min to remove redundant seawater for DNA extraction. The column was then placed in a new 2 ml tube and subjected to lysis using proteinase K. Before lysis, Milli-Q water (400 μl), proteinase K (20 μl) and buffer AL (180 μl) were mixed and the mixed solution was gently pipetted onto the folded filter in the spin column. The column was then placed on a 56°C preheated aluminium heat block and incubated for 30 min. The spin columns were covered with commercial aluminium foil and a clean blanket for effective incubation at the specified temperature. After the incubation, the spin column was centrifuged at 6000g for 1 min to collect the DNA. In order to increase DNA yields from the filter, 300 μl of sterilized TE buffer was gently pipetted onto the folded filter and the spin column was again centrifuged at 6000g for 1 min. The collected DNA solution (ca 900 μl) was purified using the DNeasy Blood and Tissue Kit following the manufacture's protocol.

2.2.3 Paired-end library preparation and MiSeq sequencing

Two to five eDNA samples from each of the four aquarium tanks (total 14 samples; figure 1ad) and four eDNA samples from the coral reefs (figure 1e,f) were used for multiplex PCR using two universal primer pairs (MiFish-U/E). Of these 18 eDNA samples, five samples from the Kuroshio tank were additionally used for multiplex PCR using two universal plus one genus-specific primer pairs (MiFish-U/E/tuna) for correct assignments of Thunnus species.

Prior to library preparation, work-space and equipment were sterilized, filtered pipet tips were used and separation of pre- and post-PCR was carried out to safeguard against contamination. We also employed controls to monitor contamination including PCR blanks for each experiment.

Massively parallel paired-end sequencing on the MiSeq platform (Illumina, San Diego, CA, USA) requires PCR amplicons to be flanked by: (i) primer-binding sites for sequencing; (ii) dual-index (i.e. barcode) sequences; and (iii) adapter sequences for binding to the flowcells of the MiSeq. We employed a two-step tailed PCR approach to construct the paired-end libraries (figure 2).

Figure 2.

Figure 2. Schematic representation of the paired-end library preparation using a two-step tailed PCR. The workflow is derived from a document ‘16S metagenomic sequencing library preparation: preparing 16S ribosomal gene amplicons for the Illumina MiSeq system’ distributed by Illumina (part no. 15044223 Rev. B) and the figure was drawn with reference to a website of the Genomics and Sequencing Center at the University of Rhode Island (http://web.uri.edu/gsc/next-generation-sequencing/).

The first-round PCR (first PCR; figure 2) amplified the target region using primers 5′-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNN + MiFish gene-specific sequences-3′ (forward) and 5′-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNN + MiFish gene-specific sequences-3′ (reverse). The first 33 and 34 nucleotides (nt) are partially used for primer-binding sites for sequencing and the following six random hexamers (N) are used to enhance cluster separation on the flowcells during initial base call calibrations on the MiSeq platform.

The first PCR was carried out with 35 cycles of a 12 μl reaction volume containing 6.0 μl 2×KAPA HiFi HotStart ReadyMix (including DNA polymerase, reaction buffer, dNTPs and MgCl2 (at a final concentration of 2.5 mM)) (KAPA Biosystems, Wilmington, MA, USA), 0.7 μl of each primer (5 μM), 2.6 μl sterile distilled H2O and 2.0 μl template. When the first PCR was multiplexed (simultaneous use of multiple primer pairs), the final concentration of each primer was 0.3 μM and sterile distilled H2O was added up to the total reaction volume of 12.0 μl. The thermal cycle profile after an initial 3 min denaturation at 95°C was as follows: denaturation at 98°C for 20 s; annealing at 65°C for 15 s; and extension at 72°C for 15 s with the final extension at the same temperature for 5 min.

The second-round PCR (second PCR; figure 2) used the first PCR products as a template and amplified the region using primers 5′-AATGATACGGCGACCACCGAGATCTACAXXXXXXXXACACTCTTTCCC TACACGACGCTCTTCCGATCT-3′ (forward) and 5′-CAAGCAGAAGACGGCATACGAGATXXXXXX XXGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT-3′ (reverse). The octo-X segments represent dual-index sequences (40 unique indices in total; A501–508, A701–712 and D501–508, D701–D712; Illumina); the 5′-end sequences are adapters that allow the final product to bind or hybridize to short oligos on the surface of the Illumina flowcell; and the 3′-end sequences are priming sites for the MiSeq sequencing.

The first PCR product was diluted 10 times using Milli-Q water and used as a template for the second PCR. The second PCR was carried out with 12 cycles of a 12 μl reaction volume containing 6.0 μl 2× KAPA HiFi HotStart ReadyMix, 0.7 μl each primer (5 μM), 3.6 μl sterile distilled H2O and 1.0 μl template. Different combinations of indices (chosen from A/D501–508 for forward primers and A/D701–712 for reverse primers) were used for different templates for a massively parallel sequencing using the MiSeq platform. The thermal cycle profile after an initial 3 min denaturation at 95°C was as follows: denaturation at 98°C for 20 s; annealing and extension combined at 72°C (shuttle PCR) for 15 s with the final extension at the same temperature for 5 min.

The indexed second PCR products were pooled in equal volumes and the pooled libraries (total 100 μl) were subjected to agarose gel electrophoresis using 2% L03 (Takara). A target size of the libraries (ca 370 bp) was excised from the gel and purified using a MinElute Gel Extraction kit (Qiagen) with an elution volume of 12 μl. The library concentration was estimated using a Qubit dsDNA HS assay kit and a Qubit fluorometer (Life Technologies). Double-stranded DNA concentration of the pooled library was adjusted to 4 nM (assuming 1 bp equals 660 g mol−1) using Milli-Q water and 5 μl of the 4 nM library was denatured with 5 μl of fresh 0.1 N NaOH. Including HT1 buffer (provided by the Illumina MiSeq v. 2 Reagent kit for 2×150 bp PE), the denatured library (10 μl; 2 nM) was diluted to the final concentration of 12 pM for sequencing on the MiSeq platform. A 30 μlof PhiX DNA spike-in control (12 pM) was added to improve data quality of low diversity samples such as single PCR amplicons used in this study.

2.2.4 Data pre-processing

An overall quality of the MiSeq reads was evaluated by the programs FastQC (available from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and SUGAR [24]. After confirming a lack of technical errors in the MiSeq sequencing, low-quality tails were trimmed from each read using DynamicTrim.pl from the SolexaQA software package [25] with a cut-off threshold set at a Phred score of 10 (=10−1 error rate) [26]. The tail-trimmed paired-end reads (reads 1 and 2) were assembled using the software FLASH [27] with a minimum overlap of 10 bp. The assembled reads were further filtered by custom Perl scripts in order to remove reads with either ambiguous sites (Ns) or those showing unusual lengths with reference to the expected size of the PCR amplicons (297 ± 25 bp). Finally, the software TagCleaner [28] was used to remove primer sequences with a maximum of three-base mismatches and to transform the FASTQ [29] format into FASTA.

2.2.5 Taxonomic assignment

The pre-processed reads from the above custom pipeline were dereplicated using a ‘derep_fulllength’ command in UCLUST [30], with the number of identical reads added to the header line of the FASTA formatted data file. Those sequences represented by more than or equal to 10 identical reads were subjected to the downstream analyses and the remaining under-represented sequences (with less than 10 identical reads) were subjected to pairwise alignment using a ‘usearch_global’ command in UCLUST. If the latter sequences observed from less than 10 reads showed more than or equal to 99% identity with one of the former reads (one or two nucleotide differences), they were operationally considered as identical (owing to sequencing or PCR errors and/or actual nucleotide variations in the populations) and they were added to the more than or equal to 10 reads.

The processed reads were subjected to local BLASTN searches [31] against a custom-made database. The latter was generated by downloading all whole and partial fish mitogenome sequences deposited in MitoFish [17] and whole mitogenome sequences from tetrapods deposited in NCBI Organelle Genome Resources (http://www.ncbi.nlm.nih.gov/genomes/OrganelleResource.cgitaxid=32523) to cover those tetrapods occurring in aquatic environments. In addition, the custom database was supplemented by assembling new sequences in M.M.'s laboratory (electronic supplementary material, table S3). As of 4 October 2014, the database covers approximately 4230 fish species distributed across 457 families and 1827 genera. According to the latest edition of ‘Fishes of the World’ [32], fishes comprise 515 families, 1827 genera and 27 977 species with our custom-made database covering 88.7% of the families, 40.6% of the genera and 15.1% of the species.

The top BLAST hit with a sequence identity of more than or equal to 97% and E-value threshold of 10−5 was applied to species assignments of each representative sequence. We found that this cut-off value maximally recovered the species composition from each tank, while avoiding erroneous taxonomic assignment. Reliability of the species assignments were evaluated based on a ratio of total alignment length and number of mismatch bases between the query and reference sequences. For example, if a query sequence was aligned to the top BLAST hit sequence with an alignment length of 150 bp with one mismatch present, the ratio was calculated as 150/(1+1). Value one is added to the denominator to avoid zero-divisors. This ratio was calculated for the top and second BLAST hit species, and a log of odds ratio (LOD) score between these ratios was used as the comparable indicator of the species assignment. Results from the BLAST searches were automatically tabulated, with scientific names, common names, total number of the reads and representative sequences noted in an HTML format. Moreover, biological information for each detected species is available from the hyperlink in the table, such as that of FishBase (http://fishbase.sinica.edu.tw), Barcode of Life (http://www.boldsystems.org), GBIF (http://data.gbif.org), MitoFish (http://mitofish.aori.u-tokyo.ac.jp) and NCBI (http://www.ncbi.nlm.nih.gov) for quick evaluation and credibility of the bioinformatic identification.

The above bioinformatic pipeline from data pre-processing through taxonomic assignment (including Perl scripts) is available from http://dx.doi.org/10.5061/dryad.n245j and the function will be publicly available in MitoFish (http://mitofish.aori.u-tokyo.ac.jp).

3. Results and discussion

3.1 Primer development

3.1.1 MiFish-U

We visually inspected the aligned sequences throughout the entire mitogenomes across the 880 species (electronic supplementary material, table S1) by highlighting variable and invariable sites using Mesquite [20]. After repeated inspections, we found a short hypervariable region (ca 170 bp) within the 12S rRNA gene, which was flanked by highly conservative regions (ca 20–30 bp) across the 880 species (table 2). Note that we were unable to find such a region within the barcoding region of the aligned COI gene sequences, which have been frequently used as the marker of choice also in fishes [33]. This observation is consistent with a recent argument against the use of the COI gene as a genetic marker for metabarcoding studies [34].

Table 2.

Nucleotide sequences of the universal primers (MiFish-U) and base compositions in the selected 880 fish species (see electronic supplementary material, table S1). (This forward (F) and reversal (R) primer pair amplifies the mid region of the mitochondrial 12S rRNA gene with a mean length of 172 bp (163–185 bp).)

The hypervariable region in the 12S rRNA gene includes multiple segments that are forming big loops in a proposed secondary structure of the molecule [35,36]. In particular, four segments of the loops were so variable in length (involving multiple insertions/deletions) that they were considered unalignable even among closely related gobioid fishes in a previous study [37]. The two highly conservative regions, on the other hand, exhibit no length variations among the 880 species and were located on the two stem regions (stem nos. 15/16 and 24/25 in [35,36]), which undergo secondary structural constraints through strong Watson–Crick base pairings [35]. Following these empirical and theoretical observations, we decided to design a new primer pair located on the two conservative regions, thereby amplifying the highly taxonomic informative hypervariable region in between.

In the initial stage of this study, we designed degenerate PCR primers to accommodate sequence variations among taxa, but found that such degenerate primers did not amplify the target eDNA when they were used with long adapter sequences in the tailed PCR (figure 2). We redesigned a new set of primers without degenerate sites (MiFish-U) using various technical methods related to construction of adequate primers (see Material and methods). The new forward (MiFish-U-F) and reverse (MiFish-U-R) primers consist of 21 and 27 bases (table 2) with G/C contents of 57% and 44% and Tm of 56.6°C and 56.5°C, respectively.

With the redesigned MiFish-U primers (without adapter sequences), we confirmed successful amplifications of the hypervariable regions using extracted DNA from 96 species representing all of the four major lineages of fishes (Agnatha, Chondrichthyes, Actinopterygii and Sarcopterygii) distributed across 47 orders and 96 different families (table 1). With these PCR products, we successfully determined their nucleotide sequences using the conventional Sanger sequencing method. All the sequence data are available from DDBJ/EMBL/GenBank databases with accession numbers shown in table 1.

3.1.2 MiFish-E

During the preliminary experiments using eDNA from the aquarium tanks, we found that only a few assembled reads from the MiSeq sequencing represented elasmobranchs (sharks and rays). The lack of elasmobranch sequences was totally unexpected, because we included a number of elasmobranchs while designing the universal primers (13 spp.; see the electronic supplementary material, table S1) and more than 100 large-sized individuals of various elasmobranchs (mostly more than 1 m in total lengths; figure 1a) were present and active in the Kuroshio tank. We suspected that absence of the elasmobranch sequences resulted from PCR bias derived from primer–template mismatches. Inspection of the newly downloaded 160 elasmobranch sequences found only a few such mismatches (table 3), with significant ones being restricted to two sites near the 5′-end of the forward primer and in a single site near the 3′-end of the reverse primer. The newly designed primers for the elasmobranchs based on these mismatches were proved effective for amplification of the region, with all the species with reference sequences being detected by the MiSeq sequencing (see below). The new forward (MiFish-E-F) and reverse (MiFish-E-R) primers were designed in an identical region to that of the universal primers, consisting of 21 and 27 bases (table 3) with G/C contents of 52% and 41% and Tm of 54.1°C and 55.2°C, respectively, and were used with MiFish-U in multiplex PCR.

Table 3.

Nucleotide sequences of the universal primers more specifically designed for the elasmobranchs (sharks and rays; MiFish-E) and base compositions in the selected 160 species (electronic supplementary material, table S2). (Nucleotide differences from MitoFish-U are highlighted with underline in bold. This forward (F) and reverse (R) primer pair amplifies the mid region of the mitochondrial 12S rRNA gene with a mean length of 182 bp (170–185 bp).)