Royal Academy of Sciences New Zealand Open Science
Open Science

Explaining large mitochondrial sequence differences within a population sample

Published:

Mitochondrial DNA sequence is frequently used to infer species' boundaries, as divergence is relatively rapid when populations are reproductively isolated. However, the shared history of a non-recombining gene naturally leads to correlation of pairwise differences, resulting in mtDNA clusters that might be mistaken for evidence of multiple species. There are four distinct processes that can explain high levels of mtDNA sequence difference within a single sample. Here, we examine one case in detail as an exemplar to distinguish among competing hypotheses. Within our sample of tree wētā (Hemideina crassidens; Orthoptera), we found multiple mtDNA haplotypes for a protein-coding region (cytb/ND1) that differed by a maximum of 7.9%. From sequencing the whole mitochondrial genome of two representative individuals, we found evidence of constraining selection. Heterozygotes were as common as expected under random mating at five nuclear loci. Morphological traits and nuclear markers did not resolve the mtDNA groupings of individuals. We concluded that the large differences found among our sample of mtDNA sequences were simply owing to a large population size over an extended period of time allowing an equilibrium between mutation and drift to retain a great deal of genetic diversity within a single species.

1. Introduction

High levels of mitochondrial sequence divergence can be misconstrued as indicating the presence of more than one species if DNA barcoding approaches are applied naively. Although the acquisition of mitochondrial sequence data has been promoted as a tool for identification of specimens [1,2], the application of mtDNA data frequently raises questions rather than discriminating among hypotheses [3]. However, the capacity for assigning specimens to previously identified, classified and sequenced maternal lineages is valuable for many aspects of biology which require this kind of identification. Issues relating to biosecurity, forensics, conservation and animal diet where extensive reference collections have been made with the necessary investment in mtDNA or cpDNA sequencing benefit from this approach, e.g. [48]. Unfortunately, 15 years since its introduction the fundamental flaws of DNA barcoding as an approach for taxonomic discovery [911] are still often ignored. MtDNA ‘barcode’ sequence variation above approximately 3% coupled with spatial structure or sympatry tends to lead to inferences of morphologically cryptic taxonomic diversity, even though theory provides other explanations. The presence within a single interbreeding population of multiple mtDNA lineages or haplogroups that differ by more than 3% sequence divergence is, for example, likely if populations have been large for many generations. Here, we demonstrate how hypotheses that explain this pattern can be distinguished by using a New Zealand orthopteran as a case study.

Four biological processes might lead to high genetic diversity at a locus within a single population sample. In a stable population, neutral genetic variants arise at a predictable rate and are lost stochastically (drift) at a rate determined by population size [12,13]. In large populations, the presence of many individuals increases the potential for new mutations each generation and the rate of allele loss via drift is relatively slow. Typically, therefore, invertebrate populations with relatively small individuals and large population size contain much more genetic diversity than large vertebrate species with their small effective populations, for example, compare mammalian and avian host with insect parasite diversity levels, e.g. [1416]. High genetic diversity within a sample might therefore be owing to persistence of a large population at equilibrium between mutation and drift [13]. Large constant population sizes result in a distribution of pairwise differences that deviate from the expected exponential distribution owing to their common history [17], which can mislead species delimitation tools that rely on a single non-recombining locus and so result in taxonomic over-splitting [18,19]. By contrast, non-neutral alleles are lost or go to fixation more rapidly at a rate proportional to their selection coefficient and so are less likely to reveal high diversity. High genetic diversity within a single population sample could potentially be explained by balancing selection if fitness was determined by allele frequency in the population. In the case of mtDNA sequences, balancing selection is possible if the variation involves expressed differences. Comparing the number of synonymous changes with the number of non-synonymous changes allows inference of the likely selection process [20]. Neutral genetic variation can be explained by population size in the absence of evidence of other processes.

The third and fourth processes leading to a putative single population sample containing relatively high genetic diversity are cryptic species [21] and recent introgression [22]. Elegant population genetics tools are available to differentiate between these possible sources of genetic diversity if information is available from more than one marker [2328].

Morphological variation derived from genotypic variation can provide the information needed to determine whether high genetic diversity results from sampling more than one species [21]. Concordance of characters and the absence of intermediates provide evidence for multiple genotypic clusters (species; [29]). Independent populations would show linkage of alleles at independent loci and a lack of heterozygote individuals. Infection with the parasitic bacteria Wolbachia could result in reproductive incompatibility between individuals infected and uninfected or between individuals infected with different bacteria stains [30]. In the long term, this would be expected to result in concordance of mtDNA haplotypes and nuclear alleles and a deficiency of heterozygote individuals; however, recent invasion of Wolbachia strains might resemble introgression. Maternally transmitted parasites are genetically linked to the mitochondria and if they cause cytoplasmic incompatibility will result in a loss of mtDNA diversity [31]. The presence of cryptic species provides no expectation of geographical limits, but hybridization is usually geographically constrained [32,33], producing a restricted region or zone of high diversity. Within such regions, bimodal distributions of genetic and/or phenotypic characters are indicative of selection against hybrids [25]. In the case where high mtDNA diversity in a putative population arises from hybridization, an association of unlinked markers would be expected.

In New Zealand, many endemic invertebrates have been subjects of phylogeographical investigations, with their mtDNA genetic diversity used to infer historical range changes [34,35]. Interpretation of notably high levels of mtDNA sequence variation within samples varies (reviewed in Trewick et al. [35]). For example, mtDNA differences have been ascribed to an accelerated rate of mtDNA evolution [36], or ancient (Pliocene) subdivision of populations and subsequent hybridization [37]. The contrast between Southern and Northern hemisphere study systems is striking. Low diversity in many invertebrate species in Europe and North America is concordant with the recent absence of suitable habitat owing to southward extension of polar ice sheets during the last glacial maximum (approx. 32 000–28 000 years ago; [3841]). Despite large modern species' ranges, low diversity in these areas is inferred as resulting from recent range expansion [23]. By contrast, although smaller, New Zealand lost disproportionally less terrestrial habitat to snow and ice, and increased land area from lower sea level plus relatively benign climate owing to maritime amelioration ensured widespread survival of forest/scrub habitat [4244].

One endemic insect, the tree wētā Hemideina crassidens (Orthoptera; Ensifera; Anostostomatidae), was probably widespread in North Island New Zealand during the last glacial maximum [44]. This arboreal wētā species contains three distinct lineages of mtDNA haplotypes (figure 1) with average genetic distances among haplogroups of 0.07–0.09 (GTR model with variable base frequencies and symmetrical substitution matrix) and much higher levels than typically accepted within species [1]. One H. crassidens mtDNA haplogroup is restricted to wētā samples from the southern part of the species’ range, concordant with a distinct 19-chromosome race [45]. Two others (haplogroups 2 and 3) have overlapping occurrence in South and North Island H. crassidens (figure 1). Both mtDNA haplogroups 2 and 3 occur within the 15-chromosome race of this insect [44]. In a separate study, we sought evidence of Wolbachia infections in a range of New Zealand insect species. We did not detect Wolbachia-like DNA sequences in any sampled population of Hemideina species, including individuals from both mtDNA haplogroups 2 and 3 of H. crassidens [46].

Figure 1.

Figure 1. Spatial distribution of mtDNA haplotype diversity within the New Zealand tree wētēt H. crassidens (yellow). A network of all mtDNA haplotypes observed within a single population sample (Rangiwahia) illustrates difference between DNA sequences and their relative frequency (circle area is scaled to sample size).

Our aim was to determine what biological process most likely led to the high mtDNA diversity within the 15-chromosome race of H. crassidens. We looked for spatial structure in the mtDNA diversity over the full landscape of the species for evidence of partitions in the distribution of genetic variation that would be a signature of past isolation followed by hybridization. We then focused on a sample from a single location where both lineages are present. By examining nuclear markers, we sought evidence of partitioning of nuclear genetic variation concordant with mtDNA variation that would support a hypothesis of two species. We compared the rate of silent substitutions to expressed substitutions using the whole mtDNA genome to detect any signature of selection.

2. Material and methods

2.1. Spatial structure of mtDNA diversity within Hemideina crassidens

The distribution of mitochondrial sequence variation within the natural range of H. crassidens was explored to identify potential spatial structure that might arise from historical processes or selection (data published in Bulgarella et al. [44]). We examined the distribution of the three distinct mitochondrial haplogroups in relation to both latitude and elevation. We visualized genetic distance patterns across the complete range of H. crassidens using three-dimensional surface plots known as ‘genetic landscape shapes' using Alleles In Space (AIS, http://www.marksgeneticsoftware.net/, [47]). To do this, a connectivity network of sampled individuals was constructed and genetic distances (Zi, calculated as the average proportion of nucleotide differences between individuals from different sampling areas) assigned to landscape coordinates at midpoints (Xi, Yi) of the connectivity edges, as described in Miller et al. [48]. Genetic distances were then inferred using an interpolation procedure on a uniformly spaced grid overlaid on the sampled landscape. Finally, a three-dimensional surface plot where X and Y coordinates correspond to geographical locations (latitude and longitude) on a rectangular grid and surface plot heights (Z) reflecting genetic distances was generated [49,50], We selected a 50 × 50 grid and a distance weighting parameter (a) of 1. Additional analyses were performed using a variety of grid sizes (50 × 50, 60 × 60, 100 × 100), distance weighting parameters (a = 0.5–2) and raw and residual genetic distances to make sure that interpretations were not sensitive to these parameters, as recommended.

High nucleotide diversity (owing to the coexistence of haplotypes from two distinct mitochondrial haplogroups) was recorded in a tree wētā sample from Rangiwahia Forest Park (39°53′38.51 S; 176°0′10.80 E) in central North Island New Zealand [44]. Additional specimens were collected at this location for the present analysis. A random sample of wētā with respect to age and sex was extracted from their daytime refuge holes in tree branches. Authority to collect came from the New Zealand Department of Conservation (TW-32116-FAU). All individuals were identified as H. crassidens using a combination of colour and spine traits that differentiates this species from other Hemideina species [51].

2.2. Morphological variation

Two morphological traits were recorded for each individual: total number of stridulatory ridges and total number of hind tibia dorsal spines. Neither of these traits correlates with wētā instar, and therefore, wētā at any age can be included in the analysis. Tree wētā use their abdominal stridulatory ridges to make rasping sounds when rubbed against pegs on the retrolateral surface of the hind femur, probably for intraspecific communication [52]. Ridge numbers vary within and among tree wētā species [53]. The number of ridges on the left and right sides of the abdomen was combined. The number of prolateral hind tibia spines differentiates H. crassidens and Hemideina thoracica at some locations suggesting interspecific selection. We therefore recorded all dorsal hind tibia spines (left and right) for each individual at Rangiwahia. Tubercles with rounded ends rather than points (as seen in spines) were recorded as half spines (0.5) if they occurred in the normal position for a dorsal tibia spine.

2.3. mtDNA analysis

Genomic DNA was extracted from each individual and an approximately 800 bp mtDNA fragment amplified, as previously described [44]. The mtDNA fragment spanning part of cytochrome b (cytb), tRNA serine and part of NADH dehydrogenase 1 (ND1) sequences were checked, edited and aligned in Geneious v. 8.1.4 (http://www.geneious.com, [54]). We calculated average nucleotide site diversity (π) for the population sample using the method of Tajima [55], with an alignment of mtDNA sequence and tools provided in Geneious. We estimated population size using the equation N = π/4u [13]. Three mutation rate estimates from published work were used to calculate population size: Drosophila rate of 6.2 × 10−8 [56]; Daphnia 1.5 × 10−7 [57] and Nematode 7.6 × 10−8 [58]. The tRNASer sequence within the mtDNA fragment was removed from the dataset before networks were inferred. Deletions and insertions (INDELS) in this tRNA region differentiate the haplotypes, but this variation is not readily incorporated into network analysis nor estimates of genetic distance. A minimum spanning network [59] was inferred with PopArt [60] using 29 sequences, trimmed to 519 bp. Aligned mtDNA sequences are available at http://evolves.massey.ac.nz/Data/Rangiwahia%20weta.zip and http://dx.doi.org/10.5061/dryad.rg15p [61].

2.4. Nuclear markers

One coding, putatively neutral nuclear locus (Sperm flagallar protein; Sflag; [62]) was amplified and sequenced, and coded as genotype data for analysis in combination with microsatellite loci. The 400 bp fragment of the nuclear gene (Sflag) was amplified and sequenced using the primers TWnucSflagF (5′-TCGCCAGTTCAGACCTAGGATGAGG-3′) and TWnucSflagR (5′-TGGCTCTGTACAAGGCTGGGA-3′) [62]. Only two Sflag alleles were detected and as some wētā were homozygous at this locus, heterozygote individuals could be identified and genotyped as they contained two nucleotides at each of the three sites that differentiate the two alleles.

Four microsatellite loci (HM04, HR12, HR35 and HR43) were amplified using primers and PCR conditions developed for other species within the genus Hemideina [63,64]. One primer from each pair was fluorescently labelled with either 6-FAM, PET or VIC dyes (Applied Biosystems). Fluorescent DNA fragments including a size standard were detected using an ABI 3130xl genetic analyser (Applied Biosystems) and sized with the Geneious microsat plugin and individuals manually genotyped (data available at http://evolves.massey.ac.nz/Data/Rangiwahia%20weta.zip and http://dx.doi.org/10.5061/dryad.rg15p) [61].

Evidence of linkage disequilibrium and deviations from the Hardy–Weinberg equilibrium were examined using Genepop v. 4.2 [65,66], with the following Markov chain parameters: 10 000 dememorization, 100 batches and 10 000 iterations per batch. Evidence of linkage disequilibrium used probability tests with the Bonferroni correction for multiple tests (n = 10; p < 0.005). Where two species are sympatric, but rarely or never interbreeding, alleles at independent loci are expected to show linkage disequilibrium as heterozygote individuals will be scarce or absent. With our wētā sample treated as a single population sample, detection of heterozygote deficiency would indicate non-random mating.

With the data partitioned between wētā from mtDNA haplogroup 2 (n = 23) and wētā haplogroup 3 (n = 6), we looked for evidence for these being non-interbreeding sympatric populations (e.g. different species). If distinct, differentiation at nuclear loci would result in significant departure of FST from zero. Pairwise FST was estimated using genotypes at five nuclear loci with Genepop v. 4.2.

2.5. Whole mtDNA selection

Two individuals, representing two mtDNA lineages (haplogroups 2 and 3), were used to calculate the KA/KS ratio in the coding regions of the mtDNA using both maximum-likelihood and approximate methods with KaKs_Calculator [67]. The ratio of non-synonymous substitutions (KA) to synonymous substitutions (KS) identifies the selective pressures occurring in protein-coding genes; a ratio of 1 implies neutral selection, greater than 1 implies positive selection and less than 1 implies constraining (purifying) selection. Constraining selection is predicted for coding DNA in large populations. Total genomic DNA from two individuals (H. crassidens haplogroup 2 (North) and H. crassidens haplogroup 3 (South)) were separately processed through massive parallel, high-throughput sequencing (Illumina HiSeq 2000) to obtain whole mitochondrial sequences for a separate phylogenetic study [68]. Genomic DNA was fragmented, prepared using the ThruPLEX® DNA-seq Kit (Rubicon Genomics) and used to generate 100 bp paired-end sequence. Adapter sequence barcodes and poor-quality data were removed from Illumina sequence reads using Cutadapt [69]. The quality of the data was assessed using SolexaQA [70] before de novo assembly with Velvet v. 1.1 [71] and Abyss v. 2.3.3 [72] using a k-mer length of 49. Contig files were compared to annotated Orthoptera mitochondrial sequences available on GenBank (NCBI). We found evidence of nuclear copies of some regions of the mtDNA and were able to remove paralogs using relative copy number and protein translation. The resulting assembled mitochondrial genomes were 16 535 and 15 905 bp in length (total 11 199 bp of coding regions) with the same gene order previously identified in Ensifera [73], but with the inclusion of a repetitive unit of several hundred bp in length between tRNAQ and cox1. Fasta files of the two H. crassidens mitochondrial genome sequences can be downloaded from http://evolves.massey.ac.nz/Data/Rangiwahia%20weta.zip and http://dx.doi.org/10.5061/dryad.rg15p [61].

2.6. Population genetic analysis with principal components analysis

Principal components analysis (PCA) was performed on the morphological and genetic data combined (excluding mtDNA). Stridulatory ridge counts and tibia spine counts were each treated as a single trait (range 9–15 ridges; 18.5–22 spines). For nuclear loci, we treated each allele as a character trait and each individual was given a character state per allele: 0 (if absent), 1 or 2 (if homozygous), as recommended by Odong et al. [74]. Alleles within a locus are not independent; thus, a PCA is an appropriate dimensionality reduction tool. PCA was performed with standardization using the function prcomp in R package ‘STATS’ (R Development Core Team, 2008) (filename, centre = FALSE, scale = TRUE). The first five principal components (cumulative proportion of variance was 0.664) were used in a model-based clustering approach using the Bayesian information criterion to select the number of genotypic clusters in the R package ‘Mclust’ [75,76]. Model-based clustering was used to classify individual wētā without prior information about the number of clusters or specimen identity [75]. The Mclust algorithm is built from a general model, where the total dataset is considered as a mixture of multivariate normal datasets, with a selection of covariance structures and vectors of expectation [77]. The Mclust analysis examined 90 models (10 different models with various combinations of parametrization and one to nine clusters/components). The optimal model was selected based on the maximized log likelihood, with a penalty for the number of parameters in the model [75,77]. We compared the concordance of clusters resolved from mitochondrial haplogroups with nuclear data (phenotype and microsatellite genotypes) using Cohen's κ. Cohen's κ coefficient measures agreement of assignment to categories, where values between 0.75 and 1 indicate non-random agreement, but where there is no agreement among the methods other than what would be expected by chance κ ≤ 0 (online statistical calculator: https://www.niwa.co.nz/node/104318/kappa).

3. Results

High nucleotide diversity owing to the coexistence of three distinct mitochondrial haplotype lineages has been described within the natural range of H. crassidens(figure 1a; Appendix S1 of [35]; jbi12224-sup-0001-Appendix S1–S3.docx). Examination of spatial structure in this mtDNA diversity revealed some evidence of geographical structure. Haplogroup 1 is concordant with chromosome race 19 and found only south of −41o5′, whereas no evidence of latitudinal or elevation partitioning was observed among haplotypes in haplogroups 2 and 3. This absence of spatial structure through the range of H. crassidens was well illustrated by genetic landscape shape interpolation (figure 2).

Figure 2.

Figure 2. Spatial distribution of mtDNA diversity within the New Zealand tree wētā H. crassidens across its natural range (yellow shading). Genetic landscape X and Y axes correspond to geographical locations of population samples within the overall physical landscape examined for the whole range of H. crassidens. Surface plot heights reflect relative genetic distances, data from Bulgarella et al. [44].

‘Peaks’ indicating greatest genetic distances among population samples were scattered through the species range. Qualitatively similar results were obtained using various grid sizes and a range of distance weighting parameters (a = 0.5–2). Thus, we obtained no evidence of the spatial structure that would be expected from secondary contact between previously isolated populations, nor evidence of barriers or discontinuities within the range of mtDNA haplogroups 2 and 3.

We increased sampling of H. crassidens at a location where haplotypes from both haplogroups 2 and 3 had been recorded (Rangiwahia, figure 2). Our 702 bp mtDNA fragment resolved 10 haplotypes from 29 individuals (data available at: http://evolves.massey.ac.nz/Data/Rangiwahia%20weta.zip and http://dx.doi.org/10.5061/dryad.rg15p; [61]). Average nucleotide site diversity (π) for this population sample was high (0.1035). Using this nucleotide site diversity, estimates of population size ranged from 172 500 to 417 338 depending on mutation rate used. The only available insect example (Drosophila mtDNA) has the lowest mutation rate and thus gave the largest population size estimate.

For further analyses, we trimmed the sequences to include only protein-coding sections, which removed tRNA INDELS. Resulting mtDNA sequences resolved seven haplotypes attributable to haplogroup 2 or 3 of the previously reported H. crassidens diversity [44] (figure 1). These haplotypes differed by a maximum of 7.9% (uncorrected; 41 substitutions out of 519 bp; figure 1), with an average uncorrected difference of 2.7% (14.2 substitutions). The frequency distribution of pairwise genetic differences within the Rangiwahia haplotype sample was bimodal as observed for the whole species dataset (figure 3a). When translated, nucleotide substitutions resulted in three amino acid replacements that could be subject to natural selection. However, when we calculated the Ka/Ks ratio for all coding regions of the whole mitochondrial genome (representing the two mtDNA clades; (http://evolves.massey.ac.nz/Data/Rangiwahia%20weta.zip and http://dx.doi.org/10.5061/dryad.rg15p [61])), the result was less than 1 (0.03), implying constraining selection not positive balancing selection was occurring in the mtDNA.

Figure 3.

Figure 3. The non-recombining mtDNA has a bimodal frequency distribution of pairwise differences generated by the coalescent process in a large population, for whole H. crassidens species dataset (dotted red line), and Rangiwahia population sample (black fill) (a). Unimodal frequency distributions of morphological traits suggest that hybridization is not the cause of high mtDNA variation within a single population sample of the New Zealand wētā H. crassidens. Variation in numbers of stridulatory ridges (b) and spines on hind tibia (c) in Rangiwahia population sample are not concordant with mtDNA lineages of specimens; haplogroup 2 (blue) and haplogroup 3 (pink).

Each wētā from the Rangiwahia population sample was genotyped at five polymorphic nuclear loci. The number of alleles per locus ranged from 2 to 12 (table 1). None of the five nuclear loci showed evidence of linkage to any other locus (p-values ranged between 1 and 0.021). Hardy–Weinberg expectations were met with no evidence of homozygote or heterozygote excess (multi-locus exact tests: heterozygote deficit p = 0.9321; heterozygote excess p = 0.0679). The two wētā haplogroups showed no evidence of genetic differentiation at nuclear loci. Population pairwise FST = −0.01777; genotypic differentiation was 0 (p = 0.895965 exact G test; p = 0.897884 Fisher's method).

Table 1.

Genetic diversity within a single population sample of the tree wētā H. crassidens for five nuclear markers. (Sample is divided using the two distinct mtDNA haplogroups observed. The number of alleles per locus is indicated along with (number) of alleles restricted to each mtDNA haplogroup. n = sample size.)

No differences between phenotypic traits were observed when compared for the two haplogroups. Stridulatory ridge counts averaged 11.2 and 11.3, and spine counts on hind tibia averaged 20.02 and 20.20 (t-test p > 0.05). Frequency distributions of these traits were unimodal (figure 3). Combining phenotypic traits and nuclear markers in PCA, principal components 1 and 2 (45.5% of variation) showed no concordance with mtDNA haplogroup (table 2). Although two genotypic clusters were naively resolved using Bayesian modelling in Mclust with the first five principal components (66.4% of variation), these clusters were not concordant with mtDNA haplogroups (Cohen's κ estimate was −0.050). Furthermore, the clusters inferred for just PC1 and PC2 did not coincide with the clusters resolved using PC1–5 (Cohen's κ −0.088).

Table 2.

Assignment of wētā individuals to clusters based on mtDNA haplogroup is not concordant with cluster assignment based on Bayesian modelling using principal components of phenotypic and genotypic variation. (Alternative assignments indicated by italics.)

4. Discussion

We confirmed that the population sample of tree wētā from Rangiwahia contains high mtDNA diversity, resulting from the presence of individuals at the same location with very different haplotypes. We focused on this site because sympatry of the mtDNA haplogroups allowed for detection of separate species (should they exist) using genotypic information [29]. We considered four biological processes that might result in the high mtDNA diversity within our sampling of H. crassidens tree wētā. Secondary contact and hybridization was not supported by the spatial distribution of mtDNA lineages 2 and 3 throughout the species range. A lack of genetic linkage, presence of heterozygotes and lack of genetic differentiation (FST = 0) are contrary to a hypothesis of cryptic species. Mito-nuclear discordance owing to intercellular bacteria can be excluded as an explanation as parasitic Wolbachia is not found in this genus [46]. Using a combination of morphological characters and nuclear markers, we observed no concordance among clusters inferred from different datasets (mtDNA, nuclear). Although one of the two mtDNA haplogroups at Rangiwahia was represented by just six specimens, this does not appear to have prevented us from detecting population structure. For example, our estimate of genotypic differentiation (FST) based on five loci was negative, which does not suggest a type II error, and our tests for significant deviations incorporated sample size. With the same sample size, the best model to explain the phenotypic and genotypic variation (using the Bayesian cluster-based approach) resolved more than one group, suggesting that our sample was sufficient to detect variation, had it existed.

Although some mtDNA sequence differences were expressed, we detected constraining selection within the mitochondrial genome as a whole, rather than diversifying selection that would be more likely if balancing selection was maintaining the high mtDNA diversity [78,79]. We studied the whole mtDNA genome so can exclude the possibility of diversifying selection in other regions of this (non-recombining) molecule. Therefore, retention of a large population is the most likely explanation for the genetic diversity observed here within H. crassidens. In a large population, a non-recombining gene without selection will tend to yield a distribution of pairwise differences that deviates substantially from the geometric distribution expected in a population with constant size. Pairwise differences of mtDNA sequence frequently show a bimodal or trimodal distribution simply because the history of coalescent events imposes a substantial correlation [17]. Simulation studies of mtDNA species delimitation tools find that over-splitting is a problem with medium-sized and large populations [18] owing to this natural correlation using a non-recombining gene sequence. When more complex population processes such as recent population decline are incorporated into simulations, the excess of recent coalescent events also creates mtDNA clusters within species. These clusters have pairwise differences with bimodal frequency distributions, such ‘barcoding gaps’ could be misidentified as evidence of separate taxa as observed by Fujisawa & Barraclough [19]. Thus, the apparent ‘barcoding gap’ observed in H. crassidensmtDNA could be the result of a large population accentuated by recent decline resulting from destruction of forest habitat, or range expansion with leading edge effects.

Our estimates of population size for this tree wētā of 170–420 thousand are plausible for a herbivorous ectotherm. Hemideina crassidens was probably abundant throughout the Pleistocene (2.58 Myr). Evidence from niche modelling and patterns of genetic diversity suggest that H. crassidens retained large population sizes in New Zealand during the climate cycling of the last 40 K years [44]. These most recent cycles overwrote the signature for earlier cycles, but climate modelling suggests that the species would not have been significantly reduced in range size during any of the Milankovitch cycles. Competitive interactions are likely to have had an influence on realized niche, but these probably did not limit population size substantially [39].

Sequence divergence observed between the three distinct mtDNA lineages within H. crassidens could be used to infer that the mtDNA lineages shared a recent common ancestor approximately 3–4 Ma into the Pliocene. Although the shape of New Zealand during the Pliocene was considerably different from that of the current islands [42], this has little relevance for the current distribution of mtDNA haplotypes observed today. However, we cannot exclude the possibility that past New Zealand islands drove divergence by vicariance, followed by secondary contact and mixing so that only the distinct mtDNA haplotypes are retained from the parental populations. It is tempting to speculate that during range expansion (either northwards during cool phases or south during warming), smaller effective population size might have resulted in coalescent events that generated the appearance of a ‘barcoding gap’. MtDNA haplogroup 1 is restricted to recently colonized southern sites—suggesting that leading edge effects yielded shallow coalescence of this mtDNA cluster within the species. As both haplogroups 2 and 3 are widespread, the contrast between old and recent coalescence is not associated with the most recent range shifts. However, it is not necessary to evoke vicariance or changes in population size as simulation studies demonstrate that a common history for a non-recombining gene results in distributions exactly as observed here [17].

We have shown how hypotheses that arise from large mtDNA sequence differences can be tested using phenotypic and nuclear genetic data. As there is no evidence of balancing selection or that these mtDNA differences represent biological lineages with separate histories, we suggest that large population size explains the retention of high diversity. The origin of the divergent mtDNA haplogroups might result from complex biogeographical scenarios or they might simply represent normal, stochastic processes of mutation and extinction of a non-recombining locus within a large population as demonstrated by simulation studies [17,18]. We urge others to consider population size rather than cryptic species as a null hypothesis when a lack of nuclear or phenotypic data prevent testing of alternative explanations.