To associate specimens identified by molecular characters to other biological knowledge, we need reference sequences annotated by Linnaean taxonomy. In this paper, we 1) report the creation of a comprehensive reference library of DNA barcodes for the arthropods of an entire country (Finland), 2) publish this library, and 3) deliver a new identification tool based on this resource. The reference library contains mtDNA COI barcodes for 11,275 (43%) of 26,437 arthropod species known from Finland, including 10,811 (45%) of 23,956 insect species. To quantify the improvement in identification accuracy enabled by the current reference library, we ran 1,000 Finnish insect and spider species through the Barcode of Life Data system (BOLD) identification engine. Of these, 91% were correctly assigned to a unique species when compared to the new reference library alone, 85% were correctly identified when compared to BOLD with the new material included, and 75% with the new material excluded. To capitalize on this resource, we used the new reference material to train a probabilistic taxonomic assignment tool, FinPROTAX, scoring high success. For the full-length barcode region, the accuracy of taxonomic assignments at the level of classes, orders, families, subfamilies, tribes, genera, and species reached 99.9%, 99.9%, 99.8%, 99.7%, 99.4%, 96.8%, and 88.5%, respectively. The FinBOL arthropod reference library and FinPROTAX are available through the Finnish Biodiversity Information Facility (www.laji.fi). Overall, the FinBOL investment represents a massive capacity-transfer from the taxonomic community of Finland to all sectors of society.
Ark shells are commercially important clam species that inhabit in muddy sediments of shallow coasts in East Asia. For a long time, the lack of genome resources has hindered scientific research of ark shells. Here, we reported a high-quality chromosome-level genome assembly of Scapharca kagoshimensis, with an aim to unravel the molecular basis of heme biosynthesis, and develop genomic resources for genetic breeding and population genetics in ark shells. Nineteen scaffolds corresponding to 19 chromosomes were constructed from 938 contigs (contig N50=2.01 Mb) to produce a final high-quality assembly with a total length of 1.11 Gb and scaffold N50 around 60.64 Mb. The genome assembly represents 93.4% completeness via matching 303 eukaryota core conserved genes. A total of 24,908 protein-coding genes were predicted and 24,551 genes (98.56%) of which were functionally annotated. The enrichment analyses suggested that genes in heme biosynthesis pathways were expanded and positive selection of the hemoglobin genes was also found in the genome of S. kagoshimensis, which gives important insights into the molecular mechanisms and evolution of the heme biosynthesis in mollusca. The valuable genome assembly of S. kagoshimensis would provide a solid foundation for investigating the molecular mechanisms that underlie the diverse biological functions and evolutionary adaptations of S. kagoshimensis.
We present the chromosome-level genome assembly of Dysdera silvatica Schmidt, 1981, a nocturnal ground-dwelling spider endemic from the Canary Islands. The genus Dysdera has undergone a remarkable diversification in this archipelago mostly associated with shifts in the level of trophic specialization, becoming an excellent model to study the genomic drivers of adaptive radiations. The new assembly (1.37 Gb; and scaffold N50 of 174.2 Mb), was performed using the chromosome conformation capture scaffolding technique, represents a continuity improvement of more than 4,500 times with respect to the previous version. The seven largest scaffolds or pseudochromosomes cover 87% of the total assembly size and match consistently with the seven chromosomes of the karyotype of this species, including the characteristic large X chromosome. To illustrate the value of this new resource we performed a comprehensive analysis of the two major arthropod chemoreceptor gene families (i.e., gustatory and ionotropic receptors). We identified 545 chemoreceptor sequences distributed across all pseudochromosomes, with a notable underrepresentation in the X chromosome. At least 54% of them localize in 83 genomic clusters with a significantly lower evolutionary distances between them than the average of the family, suggesting a recent origin of many of them. This chromosome-level assembly is the first high-quality genome representative of the Synspermiata clade, and just the third among spiders, representing a new valuable resource to gain insights into the structure and organization of chelicerate genomes, including the role that structural variants, repetitive elements and large gene families played in the extraordinary biology of spiders.
Phylogenetic trees have been extensively used in community ecology. However, how the phylogenetic reconstruction affects ecological inferences is poorly understood. In this study, we reconstructed three different types of phylogenetic trees (a synthetic-tree generated using VPhylomaker, a barcode-tree generated using rbcL+matK+trnH-psbA and a genome-tree generated from plastid genomes) that represented an increasing level of phylogenetic resolution among 580 woody plant species from six dynamic plots in subtropical evergreen broadleaved forests of China. We then evaluated the performance of each phylogeny in estimations of community phylogenetic structure, turnover and phylogenetic signal in functional traits. As expected, the genome-tree was most resolved and most supported for relationships among species. For local phylogenetic structure, the three trees showed consistent results with Faith’s PD and MPD; however, only the synthetic-tree produced significant clustering patterns using MNTD for some plots. For phylogenetic turnover, contrasting results between the molecular trees and the synthetic-tree occurred only with nearest neighbor distance. The barcode-tree agreed more with the genome-tree than the synthetic-tree for both phylogenetic structure and turnover. For functional traits, both the barcode-tree and genome-tree detected phylogenetic signal in maximum height, but only the genome-tree detected signal in leaf width. This is the first study that uses plastid genomes in large-scale community phylogenetics. Our results highlight the outperformance of genome-trees over barcode-trees and synthetic-trees for the analyses studied here. Our results also point to the possibility of Type I and II errors in estimation of phylogenetic structure and turnover and detection of phylogenetic signal when using synthetic-trees.
A high-quality reference genome is necessary to determine the molecular mechanisms underlying important biological phenomena; therefore, in the present study, a chromosome-level genome assembly of the Chinese shrimp Fenneropenaeus chinensis was performed. Muscle of a male shrimp was sequenced using PacBio platform, and assembled by Hi-C technology. The assembled F. chinensis genome was 1,465.32 Mb with contig N50 of 472.84 Kb, including 57.73% repetitive sequences, and was anchored to 43 pseudochromosomes, with scaffold N50 of 36.87 Mb. In total, 25,026 protein-coding genes were predicted. The genome size of F. chinensis showed significant contraction in comparison with that of other penaeid species, which is likely related to migration observed in this species. However, the F. chinensis genome included several expanded gene families related to cellular processes and metabolic processes, and the contracted gene families were associated with virus infection process. The findings signify the adaptation of F. chinensis to the selection pressure of migration and cold environment. Furthermore, the selection signature analysis identified genes associated with metabolism, phototransduction, and nervous system in cultured shrimps when compared with wild population, indicating targeted, artificial selection of growth, vision, and behavior during domestication. The construction of the genome of F. chinensis provided valuable information for the further genetic mechanism analysis of important biological processes, and will facilitate the research of genetic changes during evolution.
The promotion of responsible and sustainable trade in biological resources is widely proposed as one solution to mitigate currently high levels of global biodiversity loss. Various molecular identification methods have been proposed as appropriate tools for monitoring global supply chains of commercialized animals and plants. We demonstrate the efficacy of target capture genomic barcoding in identifying and establishing the geographic origin of samples traded as Anacyclus pyrethrum, a medicinal plant assessed as globally vulnerable in the IUCN Red List. Samples collected from national and international supply chains were identified through target capture sequencing of 443 low-copy nuclear makers and compared to results derived from genome skimming of plastome, standard plastid barcoding regions and ITS. Both target capture and genome skimming provided approximately 3.4 million reads per sample, but target capture largely outperformed standard plant DNA barcodes and entire plastid genome sequences. Despite the difficulty of distinguishing among closely related species and infraspecific taxa of Anacyclus using conventional taxonomic methods, we succeeded in identifying 89 of 110 analysed samples to subspecies level without ambiguity through target capture. Furthermore, we were able to discern the geographical origin of Anacyclus samples collected in Moroccan, Indian and Sri Lankan markets, differentiating between plant materials originally harvested from diverse populations in Algeria and Morocco. With a recent drop in the cost of analysing samples, target capture offers the potential to routinely identify commercialized plant species and determine their geographic origin. It promises to play an important role in monitoring and regulation of plant species in trade, supporting biodiversity conservation efforts, and in ensuring that plant products are unadulterated, contributing to consumer protection.
Until recently many historical museum specimens were largely inaccessible to genomic inquiry, but high-throughput sequencing (HTS) approaches have allowed researchers to successfully sequence genomic DNA from dried and fluid-preserved museum specimens. In addition to preserved specimens, many museums contain large series of allozyme supernatant samples but the amenability of these samples to HTS has not yet been assessed. Here, we compared the performance of a target-capture approach using alternative sources of genomic DNA from ten specimens of spring salamanders (Plethodontidae: Gyrinophilus porphyriticus) collected 1985–1990: allozyme supernatants, allozyme homogenate pellets, and formalin-fixed tissues. We designed capture probes based on double-digest restriction-site associated (RADseq) sequencing derived loci from seven of the specimens and assessed the success and consistency of capture and RADseq technical replicates. This study design enabled direct comparisons of data quality and potential biases among the different datasets for phylogenomic and population genomic analyses. We found that in phylogenetic analyses, all replicates for a given specimen clustered together, but in principal component space, RADseq replicates did not cluster with corresponding capture-based replicates. SNP calls were on average 18.3% different between technical replicates, but these discrepancies were primarily due to differences in heterozygous/homozygous SNP calls. We demonstrate that both allozyme supernatant and formalin-fixed samples can be successfully used for population genomic analyses and we discuss ways to identify and reduce biases associated with combining capture and RADseq data.
The soybean cyst nematode (Heterodera glycines) is a sedentary plant parasite that exceeds a billion dollars in yield losses annually. It has spread across the soybean-producing world, emerging as the primary pathogen of soybeans. This problem is exacerbated by H. glycines populations overcoming the limited sources of natural resistance in soybean and by the lack of effective and safe alternative treatments. Although there are genetic determinants that render soybean plants resistant to certain nematode genotypes, resistant soybean cultivars are increasingly ineffective because their multi-year usage has selected for virulent H. glycines populations. Successful H. glycines infection relies on the comprehensive re-engineering of soybean root cells into a syncytium, as well as the long-term suppression of host defenses to ensure syncytial viability. At the forefront of these complex molecular interactions are effectors, the proteins secreted by H. glycines into host root tissues. The mechanisms that control genomic effector acquisition, diversification, and selection are important insights needed for the development of essential novel control strategies. As a foundation to obtain this understanding, we developed a nine scaffold, 158Mb pseudomolecule assembly of the H. glycines genome using PacBio, Chicago, and Hi-C sequencing. An annotation of 22,465 genes was predicted using a Mikado pipeline informed by published short- and long-read expression data. Here we present results from our assembly and annotation of the H. glycines genome.
The mitochondrial gene cytochrome-c-oxidase subunit 1 (COI) is useful in many taxa for phylogenetics, population genetics, metabarcoding, and rapid species identifications. However, the phylum Ctenophora (comb jellies) has historically been difficult to study due to divergent mitochondrial sequences and the corresponding inability to amplify COI with degenerate and standard COI ‘barcoding’ primers. As a result, there are very few COI sequences available for ctenophores, despite over 200 described species in the phylum. Here, we designed new primers and amplified the COI fragment from members of all major groups of ctenophores, including many undescribed species. Phylogenetic analyses of the resulting COI sequences revealed high diversity within many groups that was not evident from more conserved 18S rDNA sequences, in particular among the Lobata. The COI phylogenetic results also revealed unexpected community structure within the genus Bolinopsis, suggested new species within the genus Bathocyroe, and supported the ecological and morphological differences of some species such as Lampocteis cruentiventer and similar lobates (Lampocteis sp. ‘V’ stratified by depth, and ‘A’ differentiated by color). The newly described primers reported herein provide important tools to enable researchers to illuminate the diversity of ctenophores worldwide via quick molecular identifications, improve the ability to analyze environmental DNA by improving reference libraries and amplifications, and enable a new breadth of population genetic studies.
Identifying local adaptation in bottlenecked species is essential for conservation management. Selection detection methods have an important role in species management plans, assessments of adaptive capacity, and looking for responses to climate change. Yet, the allele frequency changes exploited in selection detection methods are similar to those caused by the strong neutral genetic drift expected during a bottleneck. Consequently, it is often unclear what accuracy selection detection methods have across bottlenecked populations. In this study, simulations were used to explore if signals of selection could be confidently distinguished from genetic drift across 23 bottlenecked and reintroduced populations of Alpine ibex (Capra ibex). The meticulously recorded demographic history of the Alpine ibex was used to generate comprehensive simulated SNP data. The simulated SNPs were then used to benchmark the confidence we could place in outliers identified in empirical Alpine ibex SNP data. Within the simulated dataset, the false positive rates were high for all selection detection methods but fell substantially when two or more methods were combined. True positive rates were consistently low and became negligible with increased stringency. Despite finding many outlier loci in the empirical Alpine ibex SNPs, none could be distinguished from genetic drift-driven false positives. Unfortunately, the low true positive rate also prevents the exclusion of recent local adaptation within the Alpine ibex. The baselines and stringent approach outlined here should be applied to other bottlenecked species to ensure the risk of false positive, or negative, signals of selection are accounted for in conservation management plans.
Mapping the genes underlying ecologically-relevant traits in natural populations is fundamental to develop a molecular understanding of species adaptation. Current sequencing technologies enable the characterisation of a species' genetic diversity across the landscape or even over its whole range. The relevant capture of the genetic diversity across the landscape is critical for a successful genetic mapping of traits and there are no clear guidelines on how to achieve an optimal sampling and which sequencing strategy to implement. Here we determine through simulation, the sampling scheme that maximises the power to map the genetic basis of a complex trait in an outbreeding species across an idealised landscape and draw genomic predictions for the trait, comparing individual and pool sequencing strategies. Our results show that QTL detection power and prediction accuracy are higher when more populations over the landscape are sampled and this is more cost-effectively done with pool sequencing than with individual sequencing. Additionally, we recommend sampling populations from areas of high genetic diversity. As progress in sequencing enables the integration of trait-based functional ecology into landscape genomics studies, these findings will guide study designs allowing direct measures of genetic effects in natural populations across the environment.
DNA metabarcoding is an important tool for molecular ecology. However, its effectiveness hinges on the quality of reference sequence databases and classification parameters employed. Here we evaluate the performance of MiFish 12S taxonomic assignments using a case study of California Current Large Marine Ecosystem fishes to determine best practices for metabarcoding. Specifically, we use a taxonomy cross-validation by identity framework to compare classification performance between a global database comprised of all available sequences and a curated database that only includes sequences of fishes from the California Current Large Marine Ecosystem. We demonstrate that the curated, regional database provides higher assignment accuracy than the comprehensive global database. We also document a tradeoff between accuracy and misclassification across a range of taxonomic cutoff scores, highlighting the importance of parameter selection for taxonomic classification. Furthermore, we compared assignment accuracy with and without the inclusion of additionally generated reference sequences. To this end, we sequenced tissue from 605 species using the MiFish 12S primers, adding 253 species to GenBank’s existing 550 California Current Large Marine Ecosystem fish sequences. We then compared species and reads identified from seawater environmental DNA samples using global databases with and without our generated references, and the regional database. The addition of new references allowed for the identification of 16 native taxa and 17.0% of total reads from eDNA samples, including species with vast ecological and economic value. Together these results demonstrate the importance of comprehensive and curated reference databases for effective metabarcoding and the need for locus-specific validation efforts.
Current knowledge on environmental distribution and taxon richness of free-living bacteria is mainly based on cultivation-independent investigations employing 16S rRNA gene sequencing methods. Yet, 16S rRNA genes are evolutionarily rather conserved, resulting in limited taxonomic and ecological resolutions provided by this marker. We used a faster evolving protein-encoding marker to reveal ecological patterns hidden within a single OTU defined by >99% 16S rRNA sequence similarity. The studied taxon, subcluster PnecC of the genus Polynucleobacter, represents a ubiquitous group of planktonic freshwater bacteria with cosmopolitan distribution, which is very frequently detected by diversity surveys of freshwater systems. Based on genome taxonomy and a large set of genome sequences, a sequence similarity threshold for delineation of species-like taxa could be established. In total, 600 species-like taxa were detected in 99 freshwater habitats scattered across three regions representing a latitudinal range of 3400 km (42°N to 71°N) and a pH gradient of 4.2 to 8.6. Besides the unexpectedly high richness, the increased taxonomic resolution revealed structuring of Polynucleobacter communities by a couple of macroecological trends, which was previously only demonstrated for phylogenetically much broader groups of bacteria. A unexpected pattern was the almost complete compositional separation of Polynucleobacter communities of Ca2+-rich and Ca2+-poor habitats, which strongly resembled the vicariance of plant species on silicate and limestone soils. The presented new cultivation-independent approach opened a window to an incredible, previously unseen diversity, and enables investigations aiming on deeper understanding of how environmental conditions shape bacterial communities and drive evolution of free-living bacteria.
Scale insects are hemimetabolous, showing “incomplete” metamorphosis and no true pupal stage. Ericerus pela, commonly known as the white wax scale insect (hereafter, WWS), is a wax-producing insect found in Asia and Europe. WWS displays dramatic sexual dimorphism, with notably different metamorphic fates in males and females. Males develop into winged adults, while females are neotenic and maintain a nymph-like appearance, which are flightless and remain stationary. Here we report the de novo assembly of the WWS genome with its size of 638.30 Mb (69.68Mb for scaffold N50) by PacBio sequencing and Hi-C. From these data, we constructed a robust phylogenetic analysis of 24,923 gene families from 16 representative insect genomes, which indicates that holometabola evolved from incomplete metamorphosis insects in the Late Carboniferous, about 50 million years earlier than previously thought. To study the distinct development of males and females, we analyzed the methylome landscape in either sex. Surprisingly, WWS displayed high levels of methylation (4.42% for males) when compared to other insects. We observed differential methylation patterns for genes involved in steroid and sesquiterpenoids production as well as related fatty acid metabolism pathways. We show here that both males and females exhibit distinct titer profiles for ecdysone, the principal insect steroid hormone, and juvenile hormone (a sesquiterpenoid), suggesting that these hormones are the primary drivers of sexually dimorphic features. Our results provide a comprehensive genomic and epigenomic resource of scale insects that provide new insights into the evolution of metamorphosis and sexual dimorphism in insects.
Managing endangered species in fragmented landscapes requires estimating dispersal rates between populations over contemporary timescales. Here we develop a new method for quantifying recent dispersal using genetic pedigree data for close and distant kin. Specifically, we describe an approach that infers missing shared ancestors between pairs of kin in habitat patches across a fragmented landscape. We then apply a stepping-stone model to assign unsampled individuals in the pedigree to probable locations based on minimizing the number of movements required to produce the observed locations in sampled kin pairs. Finally, we use all pairs of reconstructed parent-offspring sets to estimate dispersal rates between habitat patches under a Bayesian model. Our approach measures connectivity over the timescale represented by the small number of generations contained within the pedigree and so is appropriate for estimating the impacts of recent habitat changes due to human activity. We used our method to estimate recent movement between newly discovered populations of threatened Eastern Massasauga Rattlesnakes (Sistrurus catenatus) using data from 2996 RAD-based genetic loci. Our pedigree analyses found no evidence for contemporary connectivity between five genetic groups, but, as validation of our approach, showed high dispersal rates between sample sites within a single genetic cluster. We conclude that these five genetic clusters of Eastern Massasauga Rattlesnakes have small numbers of resident snakes and are demographically isolated conservation units. More broadly, our methodology can be widely applied to determine contemporary connectivity rates, independent of bias from shared genetic similarity due to ancestry that impacts other approaches.
The hyper-diverse order Coleoptera comprises a staggering ~25% of known species on Earth. Despite recent breakthroughs in next generation sequencing, there remains a limited representation of beetle diversity in assembled genomes. Most notably, the ground beetle family Carabidae, comprising more than 40,000 described species, has not been studied in a comparative genomics framework using whole genome data. Here we generate a high-quality genome assembly for Nebria riversi, to examine sources of novelty in the genome evolution of beetles, as well as genetic changes associated with specialization to high elevation alpine habitats. In particular, this genome resource provides a foundation for expanding comparative molecular research into mechanisms of insect cold adaptation. Comparison to other beetles shows a strong signature of genome compaction, with N. riversi possessing a relatively small genome (~147 Mb) compared to other beetles, with associated reductions in repeat element content and intron length. Small genome size is not, however, associated with fewer protein-coding genes, and an analysis of gene family diversity shows significant expansions of genes associated with cellular membranes and membrane transport, as well as protein phosphorylation and muscle filament structure. Finally, our genomic analyses show that these high elevation beetles have endosymbiotic Spiroplasma, with several metabolic pathways (e.g. propanoate biosynthesis) that might complement N. riversi, although its role as a beneficial symbiont or as a reproductive parasite remains equivocal.
We used long read sequencing data generated from Knightia excelsaI R.Br, a nectar producing Proteaceae tree endemic to Aotearoa New Zealand, to explore how sequencing data type, volume and workflows can impact final assembly accuracy and chromosome construction. Establishing a high-quality genome for this species has specific cultural importance to Māori, the indigenous people, as well as commercial importance to honey producers in Aotearoa New Zealand. Assemblies were produced by five long read assemblers using data subsampled based on read lengths, two polishing strategies, and two Hi-C mapping methods. Our results from subsampling the data by read length showed that each assembler tested performed differently depending on the coverage and the read length of the data. Assemblies that used longer read lengths (>30 kb) and lower coverage were the most contiguous, kmer and gene complete. The final genome assembly was constructed into pseudo-chromosomes using all available data assembled with FLYE, polished using Racon/Medaka/Pilon combined, scaffolded using SALSA2 and AllHiC, curated using Juicebox, and validated by synteny with Macadamia. We highlighted the importance of developing assembly workflows based on the volume and type of sequencing data and establishing a set of robust quality metrics for generating high quality assemblies. Scaffolding analyses highlighted that problems found in the initial assemblies could not be resolved accurately by utilizing Hi-C data and that scaffolded assemblies were more accurate when the underlying contig assembly was of higher accuracy. These findings provide insight into what is required for future high-quality de-novo assemblies of non-model organisms.
The bean bug (Riptortus pedestris) causes great economic losses of soybeans by piercing and sucking pods and seeds. Although R. pedestris has become the focus of numerous studies associated with insect–microbe interactions, plant–insect interactions, and pesticide resistance, a lack of genomic resources has limited deeper insights. In this study, we report the first R. pedestris genome at the chromosomal level using PacBio, Illumina, and Hi-C technologies. The assembled genome was 1.193 Gb in size with a contig N50 of 13.97 Mb. More than 95.7% of the total genome bases were successfully anchored to 6 unique chromosomes, with the scaffold N50 reaching 181.34 Mb. Genome resequencing of male and female individuals and chromosomic staining demonstrated that the sex chromosome system of R. pedestris is XO, and the shortest chromosome is the X chromosome. In total, 21,562 protein-coding genes were predicted, 21,320 of which were validated as being expressed in different tissues or different developmental stages. Evolutionary analysis demonstrated that R. pedestris and Oncopeltus fasciatus formed a sister group and split ∼35 million years ago. Additionally, a 5.04 Mb complete genome of symbiotic Serratia marcescens Rip1 was assembled, and the virulence factors that account for successful colonization in the host midgut were identified. The high-quality R. pedestris genome provides a valuable resource for further research, as well as for the pest management of bug pests.