Here I describe the novel R package SNPfiltR and demonstrate its functionalities as the backbone of a customizable, reproducible SNP filtering pipeline implemented exclusively via the widely adopted R programming language. SNPfiltR extends existing SNP filtering functionalities by automating the visualization of key parameters such as depth, quality, and missing data, then allowing users to set filters based on optimized thresholds, all within a single, cohesive working environment. All SNPfiltR functions require a vcfR object as input, which can be easily generated by reading a SNP dataset stored as a standard vcf file into an R working environment using the function read.vcfR() from the R package vcfR. Performance benchmarking reveals that for moderately sized SNP datasets (up to 50M genotypes with associated quality information), SNPfiltR performs filtering with comparable efficiency to current state of the art command-line-based programs. These benchmarking results indicate that for most reduced-representation genomic datasets, SNPfiltR is an ideal choice for investigating, visualizing, and filtering SNPs as part of a cohesive and easily documentable bioinformatic pipeline. The SNPfiltR package can be downloaded from CRAN with the command [install.packages(“SNPfiltR”)], and a development version is available from GitHub at: (github.com/DevonDeRaad/SNPfiltR). Additionally, thorough documentation for SNPfiltR, including multiple comprehensive vignettes, is available at the website: (devonderaad.github.io/SNPfiltR/).
High-throughput sequencing for analysis of environmental microbial diversity has evolved vastly over the last decade. Currently the go-to method for microbial eukaryotes is short-read metabarcoding of variable regions of the 18S rRNA gene with <500 bp amplicons. However, there is a growing interest in long-read sequencing of amplicons covering the rRNA operon for improving taxonomic resolution. For both methods, the choice of primers is crucial. It determines if community members are covered, if they can be identified at a satisfactory taxonomic level, and if the obtained community profile is representative. Here, we designed new primers targeting 18S and 28S rRNA based on 177,934 and 21,072 database sequences, respectively. The primers were evaluated in silico along with published primers on reference sequence databases and marine metagenomics datasets. We further evaluated a subset of the primers for short- and long-read sequencing on environmental samples in vitro and compared the obtained community profile with primer-unbiased metagenomic sequencing. Of the short-read pairs, a new V6-V8 pair and the V4_Balzano pair used with a simplified PCR protocol provided good results in silico and in vitro. Fewer differences were observed between the long-read primer pairs. The long-read amplicons and ITS1 alone provided higher taxonomic resolution than V4. Together, our results represent a reference and guide for selection of robust primers for research on and environmental monitoring of microbial eukaryotes.
The use of NGS datasets has increased dramatically over the last decade, however, there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single Pinus contorta parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the SNP genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded one to two orders of magnitude larger numbers of SNPs and error rates, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.
The molecular characterisation of complex behaviours is a challenging task as a range of different factors are often involved to produce the observed phenotype. An established approach is to look at the overall levels of expression of brain genes – known as ‘neurogenomics’ – to select the best candidates that associate with patterns of interest. This approach has relied so far on a set of powerful statistical tools capable to provide a snapshot of the expression of many thousands of genes that are present in an organism’s genome. However, traditional neurogenomic analyses have some well-known limitations; above all, the limited number of biological replicates compared to the number of genes tested – often referred to as “curse of dimensionality”. Here we implemented a new Machine Learning (ML) approach that can be used as a complement to established methods of transcriptomic analyses. We tested three types of ML models for their performance in the identification of genes associated with honeybee waggle dance. We then intersected the results of these analyses with traditional outputs of differential gene expression analyses and identified two promising candidates for the neural regulation of the waggle dance: the G-protein coupled receptor boss and hnRNP A1, a gene involved in alternative splicing. Overall, our study demonstrates the application of Machine Learning to analyse transcriptomics data and identify genes underlying social behaviour. This approach has great potential for application to a wide range of different scenarios in evolutionary ecology, when investigating the genomic basis for complex phenotypic traits.
Because of their challenging taxonomy, arthropods are traditionally underrepresented in biological inventories and monitoring programs. However, arthropods are the largest component of biodiversity, and no assessment can be considered informative without including them. Arthropod immature stages are often discarded during sorting, despite frequently representing more than half of the collected individuals. To date, little effort has been devoted to characterising the impact of discarding non-adult specimens on our diversity estimates. Here, we use a metabarcoding approach to analyse spiders from white oak communities in the Iberian Peninsula collected with standardised protocols, to assess (1) the contribution of juvenile stages to local diversity estimates, and (2) their effect on the diversity patterns inferred across communities. We further investigate the ability of metabarcoding to inform on abundance. We obtained 363 and 331 species as adults and juveniles, respectively. Species represented only by juveniles represented an increase of 35% with respect to those identified from adults in the whole sampling. Differences in composition between communities were greatly reduced when immature stages were taken considered, especially across latitudes. Moreover, our results revealed that metabarcoding data are to a certain extent quantitative, but some sort of taxonomic conversion factor may be necessary to provide accurate informative estimates. Although our findings do not question the relevance of the information provided by adult-based inventories, they also reveal that juveniles provide a novel and relevant layer of knowledge that, especially in areas with marked seasonality, may influence our interpretations, providing more accurate information from standardised biological inventories.
Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.
The Heteroduplex mobility assay (HMA) has proven to be a robust tool for the detection of genetic variation. Here, we describe a simple and rapid application of the HMA by microfluidic capillary electrophoresis, for phylogenetics and population genetic analyses (pgHMA). We show how commonly applied techniques in phylogenetics and population genetics have equivalents with pgHMA: phylogenetic reconstruction with bootstrapping, skyline plots, and mismatch distribution analysis. We assess the performance and accuracy of pgHMA by comparing the results obtained against those obtained using standard methods of analyses applied to sequencing data. The resulting comparisons demonstrate that: (1) there is a significant linear relationship (R = 0.992) between heteroduplex mobility and genetic distance; (2) phylogenetic trees obtained by HMA and nucleotide sequences present nearly identical topologies; (3) clades with high pgHMA parametric bootstrap support also have high bootstrap support on nucleotide phylogenies; (4) skyline plots estimated from the UPGMA trees of HMA and Bayesian trees of nucleotide data reveal similar trends, especially for the median trend estimate of effective population size; and (5) optimized mismatch distributions of HMA are closely fitted to the mismatch distributions of nucleotide sequences. In summary, pgHMA is an easily-applied method for approximating phylogenetic diversity and population trends. KEYWORDS: bootstrap, heteroduplex mobility assay, mismatch distribution, phylogenetics, skyline plot
Sex-specific ecology has management implications, but rapid sex-chromosome turnover in fishes hinders development of markers to sex monomorphic species. Here, we use annotated genomes and reduced-representation sequencing data for two Australian percichthyids, the Macquarie perch Macquaria australasica and the golden perch M. ambigua, and whole genome resequencing data for 50 Macquarie perch of each sex, to detect sex-linked loci, identify a candidate sex-determining gene and develop an affordable sexing assay. In-silico pool-seq tests of 1,492,004 Macquarie perch SNP loci revealed that a 275-Kb scaffold, containing the transcription factor SOX1b gene, was enriched for gametologous loci. Within this scaffold, 22 loci were sex-linked in a predominantly XY system, with females being homozygous at all 22, and males being heterozygous at two or more. Seven XY-gametologous loci were within a 146-bp region. Being ~38 Kb upstream of SOX1b, it might act as an enhancer controlling SOX1b transcription in the bipotential gonad that drives gonad differentiation. A PCR-RFLP sexing assay, targeting one of the Y-linked SNPs, tested in 66 known-sex Macquarie perch and two individuals of each sex of three confamilial species, and amplicon sequencing of 400 bp encompassing the 146-bp region, revealed that the few sex-linked positions differ between species and between Macquarie perch populations. This indicates sex-chromosome lability in Percichthyidae, also supported by non-homologous scaffolds containing sex-linked loci for Macquarie- and golden perches. The resources developed here will facilitate genomic research in Percichthyidae. Sex-linked markers will be useful for determining genetic sex in some populations and studying sex chromosome turnover.
Targeted sequencing is an increasingly popular Next Generation Sequencing (NGS) approach for studying populations, through focusing sequencing efforts on specific parts of the genome of a species of interest. Methodologies and tools for designing targeted baits are scarce but in high demand. Here, we present specific guidelines and considerations for designing capture sequencing experiments for population genetics for both neutral genomic regions and regions subject to selection. We describe the bait design process for three diverse fish species: Atlantic salmon, Atlantic cod and tiger shark, which was carried out in our research group, and provide an evaluation of the performance of our approach across both historical and modern samples. The workflow used for designing these three bait sets has been implemented in the R-package supeRbaits, which encompass our considerations and guidelines for bait design to benefit researchers and practitioners. The supeRbaits R package is user‐friendly and versatile. It is written in C++ and implemented in R. supeRbaits and its manual are available from Github: https://github.com/BelenJM/supeRbaits
Environmental (e)DNA methods have enabled rapid, sensitive, and specific inferences of taxa presence throughout diverse fields of ecological study. However, use of eDNA results for decision-making has been impeded by uncertainties associated with false positive tests putatively caused by contamination. Sporadic contamination is a process that is inconsistent across samples and systemic contamination occurs consistently over a group of samples. Here, we used empirical data and lab experiments to (1) estimate the sporadic contamination rate for each stage of a common, targeted eDNA workflow employing best practice quality control measures under simulated conditions of rare and common target DNA presence, (2) determine the rate at which negative controls (i.e., “blanks”) detect varying concentrations of systemic contamination, (3) estimate the effort that would be required to consistently detect sporadic and systemic contamination. Sporadic contamination rates were very low across all eDNA workflow steps, and, therefore, an intractably high number of negative controls (>100) would be required to determine occurrence of sporadic contamination with any certainty. Contrarily, detection of intentionally introduced systemic contamination was more consistent; therefore, very few negative controls (<5) would be needed to consistently alert to systemic contamination. These results have considerable implications to eDNA study design when resources for sample analyses are constrained.
Genetic monitoring using non-invasive samples provides a complement or alternative to traditional population monitoring methods. However, Next Generation Sequencing approaches to monitoring typically require high quality DNA and the use of non-invasive samples (e.g. scat) is often challenged by poor DNA quality and contamination by non-target species. One promising solution is a highly multiplexed sequencing approach called Genotyping-in-thousands by sequencing (GT-seq), which can enable cost-efficient genomics-based monitoring for populations based on non-invasively collected samples. Here, we develop and validate a GT-seq panel of 324 single nucleotide polymorphisms (SNPs) optimized for genotyping of polar bears based on DNA from non-invasively collected fecal samples. We demonstrate 1) successful GT-seq genotyping of DNA from a range of sample sources, including successful genotyping of 85.7% of non-invasively collected fecal samples determined to contain polar bear DNA, and 2) that we can reliably differentiate individuals, ascertain sex, assess relatedness, and resolve population structure of Canadian polar bear subpopulations based on a GT-seq panel of 324 SNPs. Our GT-seq data reveal similar spatial-genetic patterns as previous polar bear studies but at lesser cost per sample and using non-invasively collected samples, indicating the potential of this approach for population monitoring. This GT-seq panel provides the foundation for a non-invasive toolkit for polar bear monitoring and contribute to community-based programs – a framework which may serve as a model for wildlife management and contribute to conservation and policy for species worldwide.
Helminth diseases have long been a threat to the health of humans and animals. Roundworms are important organisms for studying parasitic mechanisms, disease transmission and prevention. The study of parasites in the giant panda is of importance for understanding how roundworms adapt to the host. Here, we report a high-quality chromosome-scale genome of Baylisascaris schroederi with a genome size of 253.60 Mb and 19,262 predicted protein-coding genes. We found that gene families related to epidermal chitin synthesis and environmental information processes in the roundworm genome have expanded significantly. Furthermore, we demonstrated unique genes involved in essential amino acid metabolism in the B. schroederi genome, inferred to be essential for the adaptation to the giant panda-specific diet. In addition, under different deworming pressures, we found that four resistance-related genes (glc-1, nrf-6, bre-4 and ced-7) were under strong positive selection in a captive population. Finally, 23 known drug targets and 47 potential drug target proteins (essential homologues linked to lethal phenotypes) were identified. The genome provides a unique reference for inferring the early evolution of roundworms and their adaptation to the host. Population genetic analysis and drug sensitivity prediction provide insights revealing the impact of deworming history on population genetic structure of importance for disease prevention.
With the rapid growth of the number of sequenced ancient genomes, there has been increasing interest in using this new information to study past and present adaptation. Such an additional temporal component has the promise of providing improved power for the estimation of natural selection. Over the last decade, statistical approaches for detection and quantification of natural selection from ancient DNA (aDNA) data have been developed. However, most of the existing methods do not allow us to estimate the timing of natural selection along with its strength, which is key to understanding the evolution and persistence of organismal diversity. Additionally, most methods ignore the fact that natural populations are almost always structured, which can result in overestimation of the effect of natural selection. To address these issues, we propose a novel Bayesian framework for the inference of natural selection and gene migration from aDNA data with Markov chain Monte Carlo techniques, co-estimating both timing and strength of natural selection and gene migration. Such an advance enables us to infer drivers of natural selection and gene migration by correlating genetic evolution with potential causes such as the changes in the ecological context in which an organism has evolved. The performance of our procedure is evaluated through extensive simulations, with its utility shown with an application to ancient chicken samples.
Genome sequencing methods and assembly tools have improved dramatically since the 2013 publication of draft genome assemblies for the mountain pine beetle, Dendroctonus ponderosae Hopkins (Coleoptera: Curculionidae). We conducted proximity ligation library sequencing and scaffolding to improve contiguity, and then used linkage mapping and recent bioinformatic tools for correction and further improvement. The new assemblies have dramatically improved contiguity and gaps compared to the originals: N50 values increased 26- to 36-fold, and the number of gaps were reduced by half. Ninety percent of the content of the assemblies is now contained in 12 and 11 scaffolds for the female and male assemblies, respectively. Based on linkage mapping information, the 12 largest scaffolds in both assemblies represent all 11 autosomal chromosomes and the neo-X chromosome. These assemblies now have nearly chromosome-sized scaffolds and will be instrumental for studying genomic architecture, chromosome evolution, population genomics, functional genomics, and adaptation in this and other pest insects. We also identified regions in two chromosomes, including the ancestral-X portion of the neo-X chromosome, with elevated differentiation between northern and southern Canadian populations.
The 15 species of small carnivorous marsupials that comprise the genus Antechinus exhibit semelparity, a rare life-history strategy where death occurs after one breeding season. Antechinus males, but not females, age rapidly (demonstrate organismal senescence) during the breeding season and show promise as new animal models of ageing. Some antechinus species are also threatened or endangered. Here, we report chromosome-level genomes of the yellow-footed antechinus Antechinus flavipes. The genome assembly has a total length of 3.2 Gb with a contig N50 of 51.8 Mb and a scaffold N50 of 636.7 Mb. We anchored and oriented 99.7% of the assembly on seven pseudochromosomes and found that repetitive DNA sequences occupy 51.8% of the genome. Draft genome assemblies of three related species in the subfamily Phascogalinae, two additional antechinus species (A. argentus and A. arktos) and the iteroparous sister species Murexia melanurus were also generated. Preliminary demographic analysis supports the hypothesis that climate change during the Pleistocene isolated species in Phascogalinae and shaped their population size. A transcriptomic profile across the A. flavipes breeding season allowed us to identify genes associated with aspects of the male die-off. The chromosome-level A. flavipes genome provides a steppingstone to understanding an enigmatic life-history strategy and a resource to assist the conservation of antechinuses.
Populus has a wide ecogeographical range spanning the Northern Hemisphere, and exhibits abundant distinct species and hybrids globally. Populus tomentosa Carr. is widely distributed and cultivated in the eastern region of Asia, where it plays multiple important roles in forestry, agriculture, conservation, and urban horticulture. Reference genomes are available for several Populus species, however, our goals were to produce a very high quality de novo, chromosome-level genome assembly in P. tomentosa genome that could serve as a reference for evolutionary and ecological studies of hybrid speciation. Here, combining long-read sequencing and Hi-C scaffolding, we present a high-quality, haplotype-resolved genome assembly. The genome size was 740.2 Mb, with a contig N50 size of 5.47 Mb and a scaffold N50 size of 46.68 Mb, consisting of 38 chromosomes, as expected with the known diploid chromosome number (2n=2x=38). A total of 59,124 protein-coding genes were identified. Phylogenomic analyses revealed that P. tomentosa is comprised of two distinct subgenomes, which we deomonstrate is likely to have resulted from hybridization between Populus adenopoda as the female parent and Populus alba var. pyramidalis as the male parent, approximately 3.93 Mya. Although highly colinear, significant structural variation was also found between the two subgenomes. Our study provides a valuable resource for ecological genetics and forest biotechnology.
Microbial diversity and community function are related, and can be highly specialized in different gut regions. The cloacal microbiome of Sceloporus virgatus provides antifungal protection to eggshells during oviposition – a specialized function that suggests a specialized microbial composition. Here, we describe the S. virgatus cloacal microbiome from tissue and swab samples, and compare it to tissue samples from the gastrointestinal (GI) tract and oviduct, adding to the growing body of evidence of microbiome localization in reptiles. We further assessed whether common methods of microbial sampling – cloacal swabs and feces – provide accurate representations of these microbial communities and whether feces might “seed” the cloacal microbiome or impact the accuracy of cloacal swab sampling. We found that different regions of the gut had unique microbial community structures. The cloacal community, in particular, showed extreme specialization averaging 99% Proteobacteria (Phylum) and 83% Enterobacteriacaea (Family). Cloacal swabs recovered communities similar to that of lower intestine and cloacal tissues, but fecal samples had much higher diversity and a distinct composition (62% Firmicutes and 39% Lachnospiraceae) relative to all gut regions. Finally, we found that feces and cloacal swabs recover different communities, but cloacal swabs may be contaminated with fecal matter if taken immediately after defecation. These results serve as a caution against the assumption that fecal samples provide an accurate representation of the gut, and that although cloacal swabs can reflect a portion of the lower GI tract microbiome, they may also result in a mixed community of gut and fecal microbes.
To associate specimens identified by molecular characters to other biological knowledge, we need reference sequences annotated by Linnaean taxonomy. In this paper, we 1) report the creation of a comprehensive reference library of DNA barcodes for the arthropods of an entire country (Finland), 2) publish this library, and 3) deliver a new identification tool based on this resource. The reference library contains mtDNA COI barcodes for 11,275 (43%) of 26,437 arthropod species known from Finland, including 10,811 (45%) of 23,956 insect species. To quantify the improvement in identification accuracy enabled by the current reference library, we ran 1,000 Finnish insect and spider species through the Barcode of Life Data system (BOLD) identification engine. Of these, 91% were correctly assigned to a unique species when compared to the new reference library alone, 85% were correctly identified when compared to BOLD with the new material included, and 75% with the new material excluded. To capitalize on this resource, we used the new reference material to train a probabilistic taxonomic assignment tool, FinPROTAX, scoring high success. For the full-length barcode region, the accuracy of taxonomic assignments at the level of classes, orders, families, subfamilies, tribes, genera, and species reached 99.9%, 99.9%, 99.8%, 99.7%, 99.4%, 96.8%, and 88.5%, respectively. The FinBOL arthropod reference library and FinPROTAX are available through the Finnish Biodiversity Information Facility (www.laji.fi). Overall, the FinBOL investment represents a massive capacity-transfer from the taxonomic community of Finland to all sectors of society.