Emily joined the lab in early 2020 to work on a project investigating calf health and gut microbes, but very soon after joining the lab, the SARS-CoV-2 pandemic emerged and changed the way we were able to interact on campus. Without missing a beat, Emily shifted her efforts from helping me wrangle the lab renovations and sorting out our inventory, to helping me improve my teaching materials, to diving deep into previous literature to dig up protocols for her experiment in 2021: “Ideal Conditions for Cryptosporidium Attachment and Infection.“
We’ll be performing the experiment itself over the winter break, and then using the spring to analyze the data and write them up. As part of the CUGR award, Emily will be presenting her work at the 2021 Student Symposium in April, which will be held virtually this year. You’ll have to wait till then to get more details!
Mice have arrived for a collaborative project on diet, gut microbes, and health in conjunction with researchers at Husson University! This is the first mouse project for the Ishaq Lab, and also my first hands-on mouse project (in my previous publications with mice, I received datasets but the mouse work was performed solely by my collaborators).
This is one of my first new collaborations at the University of Maine, which began in September 2019 as I was just finding my way around campus. An established researcher at Husson University, Dr. Yanyan Li, reached out to welcome me and talk about overlap between our work. Yanyan, her husband Dr. Tao Zhang, also a researcher at Husson University, and collaborator Dr. Grace Chen at Michigan State University, had been working on beneficial compounds found in broccoli using mice as an experimental model for Inflammatory Bowel Disease (IBD). Over the past year, in consultation with IBD experts Drs. Gary Mawe and Peter Moses (who I worked with previously while at UVM!), we have written several proposals for funding to expand the project.
Johanna Holman worked for several years with Yanyan and Tao, as an undergraduate researcher and then as a research assistant. She joined the Ishaq Lab this fall to continue her work as a graduate student and add gut microbiology to her skill repertoire. This experiment will form the base of her graduate thesis, and Johanna is taking a lead role in managing the project as well as several undergraduate researchers, including Dorien Baudewyns, assisting with the mice and lab work. As an early career researcher, and new to mice, I’m extremely lucky to be able to learn from an experienced team of researchers!
As a new assistant professor at the University of Maine, 50% of my appointment is research. To establish my research, I started with curating a space to fulfill the needs of my work — “professional nesting”, if you will. I was allotted two adjacent rooms for my lab work, one as a microbial culturing space, and one for genomics work. I asked for and was granted separate spaces to reduce to likelihood of contamination sourced from my culturing space.
Prior to my arrival at the University of Maine, both lab spaces were set up to perform different research from what I do. This may not seem like it would interfere with my work, but the type of research you do will influence the machinery you need, each of which may have space or utilities requirements, as well as the flow of traffic through the room. To reduce the amount of time you spend moving around the room in search of elusive supplies, it’s best to curate work stations within the room. To that end, the Ishaq lab team spent several days re-arranging the large machinery and the table-top equipment, and then moving the supplies to the cabinets in corresponding locations. This change was most evident in the genomics room, that was previously used for human cell culture and biochemistry, shown below. At this time, I’m still working on updating the microbial culture room, which is larger and contained many more bits and pieces to organize.
Most research labs use extremely specialized equipment and machinery. Some of this was made available to me immediately; when research labs are discontinued, ownership of equipment and consumable materials reverts back to the researcher’s home department. I needed to purchase some of the more research-specific equipment, using some of the funds allotted to me for this purpose. Buying equipment can be stressful, because it can be incredibly expensive, and you want to be sure you selected the machine brand and range of capabilities for what you might want to do over the next 5 – 10 years, at least.
Finally, you need to stock your lab with reagents and researchers, but both of these have been temporarily put on hold as of March 2020, as we do our part to reduce the transmission of the Covid-19 virus. Whenever it is safe to do so, I look forward to completing the updates to my spaces and opening them up for collaborative work.
Now that I’m an assistant professor, a significant amount of my time is spent writing grant proposals to fund projects I’d like to do in the future.
Many large federal or foundational grants take up to a year from submission to funds distribution, and the success rate, especially for newly-established researches, can be quite low. It’s prudent to start writing well in advance of the due date, and to start small, with “pilot projects”.
To that end, I’m pleased to announce that Dr. Lily Calderwood and I just received word that the Wild Blueberry Commission of Maine is funding a pilot project of ours; “Exploration of Soil Microbiota in Wild Blueberry Soils“. We’ll be recruiting 1 – 2 UMaine students for summer/fall 2020 to participate in the research for their Capstone senior research projects.
Dr. Calderwood is an Extension Wild Blueberry Specialist, and Assistant Professor of Horticulture in the School of Food and Agriculture at UMaine. She and I developed this project when meeting for the first time, over coffee. We realized we’d both been at the University of Vermont doing our PhD’s concurrently, and in neighboring buildings! We got to chatting about my work in wheat soil microbial communities, and her work on blueberry production, and the untapped research potential between the two.
This pilot will generate some preliminary data to help us get a first look at the soil microbiota associated with blueberries, and in response to management practices and environmental conditions. From this seed funding, Lily and I hope to cultivate fruitful research projects for years to come!
In summer 2019, I developed and taught a course on ‘Microbes and Social Equity‘ to the Clark Honors College at the University of Oregon. The course assignments were literature review essays on various topics, which were compiled into a single manuscript as the group-based final project for the course. This large version is available as a preprint; however, the published version is more focused.
Suzanne L. Ishaq1,2*, Maurisa Rapp2,3, Risa Byerly2,3, Loretta S. McClellan2, Maya R. O’Boyle2, Anika Nykanen2, Patrick J. Fuller2,4, Calvin Aas2, Jude M. Stone2, Sean Killpatrick2,4, Manami M. Uptegrove2, Alex Vischer2, Hannah Wolf2, Fiona Smallman2, Houston Eymann2,5, Simon Narode2, Ellee Stapleton6, Camille C. Cioffi7, Hannah Tavalire8
Biology and the Built Environment Center, University of Oregon
Robert D. Clark Honors College, University of Oregon
Department of Human Physiology, University of Oregon
Charles H. Lundquist College of Business, University of Oregon
School of Journalism and Communication, University of Oregon
Department of Landscape Architecture, University of Oregon
Counseling Psychology and Human Services, College of Education, University of Oregon
Institute of Ecology and Evolution, University of Oregon
What do ‘microbes’ have to do with social equity? On the surface, very little. But these little organisms are integral to our health, the health of our natural environment, and even impact the ‘health’ of the environments we have built. Early life and the maturation of the immune system, our diet and lifestyle, and the quality of our surrounding environment can all impact our health. Similarly, the loss, gain, and retention of microorganisms — namely their flow from humans to the environment and back — can greatly impact our health and well-being. It is well-known that inequalities in access to perinatal care, healthy foods and fiber, a safe and clean home, and to the natural environment can create and arise from social inequality. Here, we focus on the argument that access to microorganisms as a facet of public health, and argue that health inequality may be compounded by inequitable microbial exposure.
After several years of bouncing through internal and external review, I’m pleased to announce that the first microbes paper out of the Montana State University Fort Ellis project has been published in Geoderma! The Fort Ellis research has encompassed multiple labs, projects, and many personnel, as it was a large collaboration looking at the effect of different farming systems on biodiversity at the macro (plant), mini (insect), and micro (-be) levels. Spanning multiple years, this project has been a massive undertaking that I briefly participated in but anticipate getting four publications out of (two more are in preparation).
Despite knowledge that management practices, seasonality, and plant phenology impact soil microbiota; farming system effects on soil microbiota are not often evaluated across the growing season. We assessed the bacterial diversity in soil around wheat roots through the spring and summer of 2016 in winter wheat (Triticum aestivium L.) in Montana, USA, from three contrasting farming systems: a chemically-managed no-tillage system, and two USDA-certified organic systems in their fourth year, one including tillage and one where sheep grazing partially offsets tillage frequency. Bacterial richness (range 605 – 1174 OTUs) and evenness (range 0.80 – 0.92) peaked in early June and dropped by late July (range 92 – 1190, 0.62-0.92, respectively), but was not different by farming systems. Organic tilled plots contained more putative nitrogen-fixing bacterial genera than the other two systems. Bacterial community similarities were significantly altered by sampling date, minimum and maximum temperature at sampling, bacterial abundance at date of sampling, total weed richness, and coverage of Taraxacum officinale, Lamium ampleuxicaule, and Thlaspi arvense. This study highlights that weed diversity, season, and farming management system all influence soil microbial communities. Local environmental conditions will strongly condition any practical applications aimed at improving soil diversity, especially in semi-arid regions where abiotic stress and seasonal variability in temperature and water availability drive primary production. Thus, it is critical to incorporate or address seasonality in soil sampling for microbial diversity.
The picture is just one instant in an event involving hundreds or thousands of organisms that were all doing a lot of different things, sometimes for just a few seconds. How would you describe it?
Maybe using the number of members present in this community? Or a list of names of attendees? The 16S rRNA gene for prokaryotes, or the 18S rRNA or ITS genes for eukaryotes, for examples, would tell us that. Those genes are found in all types of those organisms, and is a pretty effective means of basic identification. But, it’s only as good as how often that gene is found in the organisms you are looking for. There is no one gene that’s found exactly the same in all organisms, so you might need to target multiple different identification genes to look at all the different types of microorganisms, such as bacteria, fungi, protozoa, or archaea. Viruses don’t share a common gene across types, to look at viruses you’d need something else.
From our identification genes we could identify all the organisms wearing yellow; ex. phylogenetic Family = Ducks. That wouldn’t tell us if they were always found in this ecosystem (native Eugene population) or just passing through (transient population), but we could figure that out if we looked at every home game of the season and found certain community members there time and again.
But knowing they are Ducks doesn’t tell us anything else about that community member. What will they do if it starts raining? Are they able to go mountain biking? Perhaps we could identify their potential for activity by looking at the objects they are carrying? That would be akin to metagenomics, identifying all the DNA present from all the organisms, which tells us what genes are present, but not if they are currently or ever used. It can be challenging to interpret: think of sequencing data from one organism’s genome as one 1,000,000-piece puzzle and all the genomes in a community as 1,000 1,000,000-piece puzzles all dumped in a pile. In the crowd, metagenomics would tell us who had a credit card that was specifically used to buy umbrellas, but not whether they’d actually use the umbrella if it rains (ex. Eugeneans would not).
We could describe what everyone is doing at this moment. That would be transcriptomics, identifying all the RNA to determine which genes were actively being transcribed into proteins for use in some cellular function. If we see someone in the crowd using that credit card for an umbrella (DNA), the receipt would be the RNA. RNA is a working copy you make of the DNA to take to another part of the cell and use as a blueprint to make a protein. You don’t want your entire genome moving around, or need it to make one protein, so you make a small piece of RNA that will only hang around for a short period before degrading (i.e. you crumpling that RNA receipt and throwing it away because who keeps receipts anymore).
Using transcriptomics, we’d see you were activating your money to get that umbrella, but we wouldn’t see the umbrella itself. For that, we’d need metabolomics, which uses chemistry and physics instead of genomics, in order to identify chemicals (most often proteins). Think of metabolomics as describing this crowd by all the trash and crumbs and miscellaneous items they left behind. It’s one way to know what biological processes occurred (popcorn consumption and digestion).
From a technical standpoint, researching a microbiome might mean looking at all the DNA from all the organisms present to know who they are and of what they are capable. It might also mean looking at all the RNA present, which would tell you what genes were being used by “everyone” for whatever they were doing at a particular moment. Or you might also add metabolomics to identify all the chemical metabolites, which would be all the end products of what those cells were doing, and which are more stable than RNA so they could give you data about a longer frame of time. Collectively, -omics are technology that looks at all of a certain biological substance to help you understand a dynamic community. However, it’s important to remember that each technology gives a particular view of the community and comes with its own limitations.
Last year, one of my former research groups at Montana State University was awarded a USDA NIFA Foundational program grant, and I am a sub-award PI on that grant. We’ll be working together to investigate the effect of diversified farming systems – such as those that use cover crops, rotations, or integrate livestock grazing into field management – on crop production and soil bacterial communities: “Diversifying cropping systems through cover crops and targeted grazing: impacts on plant-microbe-insect interactions, yield and economic returns.”
The first soil samples were collected in Montana this summer, and I have been processing them for the past few weeks. I am using the opportunity to train a master’s student on microbiology and molecular genetics lab work.
Tindall Ouverson started this fall as a master’s student at MSU, working with Fabian Menalled and Tim Seipel in Bozeman, MT. She’s an environmental and soil scientist, and this is her first time working with microbes. She was here in Eugene for just a few days to learn everything needed for sequencing: DNA extraction, polymerase chain reaction, gel electrophoresis and visualization, DNA cleanup using magnetic beads, quantification, and pooling. Despite not having experience in microbiology or molecular biology, Tindall showed a real aptitude and picked up the techniques faster than I expected!
Once the sequences are generated, I’ll be (remotely) training Tindall on DNA sequence analysis. I’ll also be serving as one of her thesis committee members! Tindall will be the first of (hopefully) many cross-trained graduate students between myself and collaborators at MSU.
Sequence data contamination from biological or digital sources can obscure true results and falsely raise one’s hopes. Contamination is a persist issue in microbial ecology, and each experiment faces unique challenges from a myriad of sources, which I have previously discussed. In microbiology, those microscopic stowaways and spurious sequencing errors can be difficult to identify as non-sample contaminants, and collectively they can create large-scale changes to what you think a microbial community looks like.
Samples from large studies are often processed in batches based on how many samples can be processed by certain laboratory equipment, and if these span multiple bottles of reagents, or water-filtration systems, each batch might end up with a unique contamination profile. If your samples are not randomized between batches, and each batch ends up representing a specific time point or a treatment from your experiment, these batch effects can be mistaken for a treatment effect (a.k.a. a false positive).
“The times were statistically greater than prior time periods, while simultaneously being statistically lesser to prior times, according to longitudinal analysis.”
Over the past year, I analyzed a particularly complex bacterial 16S rRNA gene sequence data set, comprising nearly 600 home dust samples, and about 90 controls. Samples were collected from three climate regions in Oregon, over a span of one year, in which homes were sampled before and approximately six weeks after a home-specific weatherization improvement (treatment homes) or simply six weeks later in (comparison) homes which were eligible for weatherization but did not receive it. As these samples were collected over a span of a year, they were extracted with two different sequencing kits and multiple DNA extraction batches, although all within a short time after collection. The extracted DNA was spread across two sequence runs to allow for data processing to begin on cohort 1, while we waited for cohort 2 homes to be weatherized. Thus, there were a lot of opportunities to introduce technical error or biological contamination that could be conflated with treatment effects.
On top of this, each home was unique, with it’s own human and animal occupants, architectural and interior design, plants, compost, and quirks, and we didn’t ask homeowners to modify their behavior in any way. This was important, as it meant each of the homes – and their microbiomes – are somewhat unique. Therefore I didn’t want to remove sequences which might be contaminants on the basis of low abundance and risk removing microbial community members which were specific to that home. After the typical quality assurance steps to curate and process the data, which can be found on GitHub as an R script of a DADA2 package workflow, I needed to decide what to do with the negative controls.
Because sequencing is expensive, most of the time there is only one negative control included in sequencing library preparation, if that. The negative control is a blank sample – just water, or an unused swab – which does not intentionally contain cells or nucleic acids. Thus anything you find there will have come from contamination. The negative control can be used to normalize the relative abundance numbers – if you find 1,000 sequences in the negative control, which is supposed to have no DNA in it, then you might only continue looking at samples with a certain amount higher than 1,000 sequences. This risks throwing out valid sequences that happen to be rare. Alternatively, you can try to identify the contaminants and remove whole taxa from your data set, risking the complete removal of valid taxa.
I had three types of negative controls: sterile DNA swabs which were processed to check for biological contamination in collection materials, kit controls where a blank extraction was run for each batch of extractions to test for biological contamination in extraction reagents, and PCR negative controls to check for DNA contamination of PCR reagents. In total, 90 control samples were sequenced, giving me unprecedented resolution to deal with contamination. Looking at the total number of sequences before and after my quality-analysis processing, I can see that the number of sequences in my negative controls reduces dramatically; they were low-quality in some way and might be sequencing artifacts. But, an unsatisfactory number remain after QA filtering; these are high-quality and likely come from microbial contamination.
I wasn’t sure how I wanted to deal with each type of control. I came up with three approaches, and then looked at unweighted, non-rarefied ordination plots (PCoA) to watch how my axes changed based on important components (factors). What follows is a narrative summarize of what I did, but I included the R script of my phyloseq package workflow and workaround on GitHub.
“In microbial ecology, preprints are posted on late November nights. The foreboding atmosphere of conflated factors makes everyone uneasy.”
Ordination plots visualize lots of complex communities together. In both ordination figures below, each point on the graph represents a dust sample from one house. They are clustered by community distance: those closer together on the plot have a more similar community than points which are further away from each other. The points are shaped by the location of the samples, including Bend, Eugene, Portland, along with a few pilot samples labeled “Out”, and negative controls which have no location (not pictured but listed as NA). The points are colored by DNA extraction b
In Figure 1, the primary axis (axis 1) shows a clear clustering of samples by DNA extraction batch, but this is also mixed with geographic location, and as it turns out – date of collection and sequencing run. We know from other studies that geographic location, date of collection, and sequencing batch can all affect the microbial community.
Approach 1: Subtraction + outright removal
This approach subsets my data into DNA extraction batches, and then uses the number of sequences found in the negative controls to subtract out sequences from my dust samples. This assumes that if a particular sequence showed up 10 times in my negative control, but 50 times in my dust samples, that only 40 of those in my dust sample were real. For each of my DNA extraction batch negative control samples, I obtained the sum of each potential contaminant that I found there, and then subtracted those sums from the same sequence columns in my dust samples.
Approach 1 was alright, but there was still an effect of DNA extraction batch (indicated by color scale) that was stronger than location or treatment (not included on this graph). This approach is also more pertinent for working with OTUs, or situations where you wouldn’t want to remove the whole OTU, just subtract out a certain number sequences from specific columns. There is currently no way to do that just from phyloseq, so I made a work-around (see the GitHub page). However, using DADA2 gives you Sequence Variants, which are more precise and I found it’s better to remove them with approach 3.
Approach 2: Total Removal
This approach removes any contaminant sequences that is found in ANY of the negative controls from ALL the house samples, regardless of which negative control was for which extraction batch. This approach assumes that if it a sequence was found as a contaminant in a negative control somewhere, that it is a contaminant everywhere.
Once again, approach 2 was alright, and now that primary axis (axis 1) of potential batch effect is now my secondary axis; so there is still an effect of DNA extraction batch (indicated by color scale) but it is weaker. When I recolor by different variables, there is much more clustering by Treatment than by any batch effects. However, that second axis is also one of my time variables, so don’t want to get rid of all of the variation on that axis. But, since my negative kit controls showed a lot of variation in number and types of taxa, I don’t want to remove everything found there from all samples indiscriminately.
Additionally, I don’t favor throwing sequences out just because they were a contaminant somewhere, particularly for dust samples. Contamination can be situational, particularly if a microbe is found in the local air or water supply and would be legitimately found in house dust but would have also accidentally gotten into the extraction process.
Approach 3: “To each its own”
This approach removes all the sequences from PCR and swab contaminant SVs fully from each cohort, respectively, and removes extraction kit contaminants fully from each DNA extraction batch, respectively. I took all the sequences of the SVs found in my dust samples and made them into a vector (list), and then I took all the sequences of the SVs found in my controls and made them into a different vector. I effectively subtracted out the contaminant SVs by name, but asking to find the sequences which were different between my two lists (thus returning the sequences which were in my dust samples but not in my control samples). I did this respective to each sequencing cohort and batch, so that I only remove the pertinent sequences (ex. using kit control 1 to subtract from DNA extraction batch 1).
In Figure 4, potential batch effect is solidly my secondary axis and not the primary driving force behind clustering. The primary axis (axis 1) shows a clear separation by climate zone, or location of homes, once the batch contamination has been removed. When I recolor by different variables, there is much more clustering by Treatment and almost none by batch effects. I say almost none, because some of my DNA extraction batches also happen to be Treatment batches, as they represent a subset of samples from a different location. Thus, I can’t tell if those samples cluster separately solely because of location or also because of batch effect. However, I am satisfied with the results and ready to move on.
Unlike its namesake, this tale has a happier ending.
To study DNA or RNA, there are a number of “wet-lab” (laboratory) and “dry-lab” (analysis) steps which are required to access the genetic code from inside cells, polish it to a high-sheen such that the delicate technology we rely on can use it, and then make sense of it all. Destructive enzymes must be removed, one strand of DNA must be turned into millions of strands so that collectively they create a measurable signal for sequencing, and contamination must be removed. Yet, what constitutes contamination, and when or how to deal with it, remains an actively debated topic in science. Major contamination sources include human handlers, non-sterile laboratory materials, other samples during processing, and artificial generation due to technological quirks.
Contamination from human handlers
This one is easiest to understand; we constantly shed microorganisms and our own cells and these aerosolized cells may fall into samples during collection or processing. This might be of minimal concern working with feces, where the sheer number of microbial cells in a single teaspoon swamp the number that you might have shed into it, or it may be of vital concern when investigating house dust which not only has comparatively few cells and little diversity, but is also expected to have a large amount of human-associated microorganisms present. To combat this, researchers wear personal protective equipment (PPE) which protects you from your samples and your samples from you, and work in biosafety cabinets which use laminar air flow to prevent your microbial cloud from floating onto your workstation and samples.
Fun fact, many photos in laboratories are staged, including this one, of me as a grad student. I’m just pretending to work. Reflective surfaces, lighting, cramped spaces, busy scenes, and difficulty in positioning oneself makes “action shots” difficult. That’s why many lab photos are staged, and often lack PPE.
Photo Credit: Kristina Drobny
Contamination from laboratory materials
Microbiology or molecular biology laboratory materials are sterilized before and between uses, perhaps using chemicals (ex. 70% ethanol), an ultraviolet lamp, or autoclaving which combines heat and pressure to destroy, and which can be used to sterilize liquids, biological material, clothing, metal, some plastics, etc. However, microorganisms can be tough – really tough, and can sometimes survive the harsh cleaning protocols we use. Or, their DNA can survive, and get picked up by sequencing techniques that don’t discriminate between live and dead cellular DNA.
In addition to careful adherence to protocols, some of this biologically-sourced contamination can be handled in analysis. A survey of human cell RNA sequence libraries found widespread contamination by bacterial RNA, which was attributed to environmental contamination. The paper includes an interesting discussion on how to correct this bioinformatically, as well as a perspective on contamination. Likewise, you can simply remove sequences belonging to certain taxa during quality control steps in sequence processing. There are a number of hardy bacteria that have been commonly found in laboratory reagents and are considered contaminants, the trouble is that many of these are also found in the environment, and in certain cases may be real community members. Should one throw the Bradyrhizobium out with the laboratory water bath?
Like the mythical creatures these are named for, sequence chimeras are DNA (or cDNA) strands which are accidentally created when two other DNA strands merged. Chimeric sequences can be made up of more than two DNA strand parents, but the probability of that is much lower. Chimeras occur during PCR, which takes one strand of genetic code and makes thousands to millions of copies, and a process used in nearly all sequencing workflows at some point. If there is an uneven voltage supplied to the machine, the amplification process can hiccup, producing partial DNA strands which can concatenate and produce a new strand, which might be confused for a new species. These can be removed during analysis by comparing the first and second half of each of your sequences to a reference database of sequences. If each half matches to a different “parent”, it is deemed chimeric and removed.
Cross – sample contamination
During DNA or RNA extraction, genetic code can be flicked from one sample to another during any number of wash or shaking steps, or if droplets are flicked from fast moving pipettes. This can be mitigated by properly sealing all sample containers or plates, moving slowly and carefully controlling your technique, or using precision robots which have been programmed with exacting detail — down to the curvature of the tube used, the amount and viscosity of the liquid, and how fast you want to pipette to move, so that the computer can calculate the pressure needed to perform each task. Sequencing machines are extremely expensive, and many labs are moving towards shared facilities or third-party service providers, both of which may use proprietary protocols. This makes it more difficult to track possible contamination, as was the case in a recent study using RNA; the researchers found that much of the sample-sample contamination occurred at the facility or in shipping, and that this negatively affected their ability to properly analyze trends in the data.
Sample-sample contamination during sequencing
Controlling sample-sample contamination during sequencing, however, is much more difficult to control. Each sequencing technology was designed with a different research goal in mind, for example, some generate an immense amount of short reads to get high resolution on specific areas, while others aim to get the longest continuous piece of DNA sequenced as possible before the reaction fails or become unreliable. they each come with their own quirks and potential for quality control failures.
Due to the high cost of sequencing, and the practicality that most microbiome studies don’t require more than 10,000 reads per sample, it is very common to pool samples during a run. During wet-lab processing to prepare your biological samples into a “sequencing library”, a unique piece of artificial “DNA” called a barcode, tag, or index, is added to all the pieces of genetic code in a single sample (in reality, this is not DNA but a single strand of nucleotides without any of DNA’s bells and whistles). Each of your samples gets a different barcode, and then all your samples can be mixed together in a “pool”. After sequencing the pool, your computer program can sort the sequences back into their respective samples using those barcodes.
While this technique has made sequencing significantly cheaper, it adds other complications. For example, Illumina MiSeq machines generate a certain number of sequence reads (about 200 million right now) which are divided up among the samples in that run (like a pie). The samples are added to a sequencing plate or flow cell (for things like Illumina MiSeq). The flow cells have multiple lanes where samples can be added; if you add a smaller number of samples to each lane, the machine will generate more sequences per sample, and if you add a larger number of samples, each one has fewer sequences at the end of the run. you have contamination. One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.
Cross-contamination can happen on a flow cell when the sample pool wasn’t thoroughly cleaned of adapters or primers, and there are great explanations of this here and here. To generate many copies of genetic code from a single strand, you mimic DNA replication in the lab by providing all the basic ingredients (process described here). To do that, you need to add a primer (just like with painting) which can attach to your sample DNA at a specific site and act as scaffolding for your enzyme to attach to the sample DNA and start adding bases to form a complimentary strand. Adapters are just primers with barcodes and the sequencing primer already attached. Primers and adapters are small strands, roughly 10 to 50 nucleotides long, and are much shorter than your DNA of interest, which is generally 100 to 1000 nucleotides long. There are a number of methods to remove them, but if they hang around and make it to the sequencing run, they can be incorporated incorrectly and make it seem like a sequence belongs to a different sample.
This may sound easy to fix, but sequencing library preparation already goes through a lot of stringent cleaning procedures to remove everything but the DNA (or RNA) strands you want to work with. It’s so stringent, that the problem of barcode swapping, also known as tag switching or index hopping, was not immediately apparent. Even when it is noted, it typically affects a small number of the total sequences. This may not be an issue, if you are working with rumen samples and are only interested in sequences which represent >1% of your total abundance. But it can really be an issue in low biomass samples, such as air or dust, particularly in hospitals or clean rooms. If you were trying to determine whether healthy adults were carrying but not infected by the pathogen C. difficile in their GI tract, you would be very interested in the presence of even one C. difficile sequence and would want to be extremely sure of which sample it came from. Tag switching can be made worse by combining samples from very different sample types or genetic code targets on the same run.
There are a number of articles proposing methods of dealing with tag switching using double tags to reduce confusion or other primer design techniques, computational correction or variance stabilization of the sequence data, identification and removal of contaminant sequences, or utilizing synthetic mock controls. Mock controls are microbial communities which have been created in the lab by mixed a few dozen microbial cultures together, and are used as a positive control to ensure your procedures are working. because you are adding the cells to the sample yourself, you can control the relative concentrations of each species which can act as a standard to estimate the number of cells that might be in your biological samples. Synthetic mock controls don’t use real organisms, they instead use synthetically created DNA to act as artificial “organisms”. If you find these in a biological sample, you know you have contamination. One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.
Incorrect base calls
Cross-contamination during sequencing can also be a solely bioinformatic problem – since many of the barcodes are only a few nucleotides (10 or 12 being the most commonly used), if the computer misinterprets the bases it thinks was just added, it can interpret the barcode as being a different one and attribute that sequence to being from a different sample than it was. This may not be a problem if there aren’t many incorrect sequences generated and it falls below the threshold of what is “important because it is abundant”, but again, it can be a problem if you are looking for the presence of perhaps just a few hundred cells.
When researching environments that have very low biomass, such as air, dust, and hospital or cleanroom surfaces, there are very few microbial cells to begin with. Adding even a few dozen or several hundred cells can make a dramatic impactinto what that microbial community looks like, and can confound findings.
Collectively, contamination issues can lead to batch effects, where all the samples that were processed together have similar contamination. This can be confused with an actual treatment effect if you aren’t careful in how you process your samples. For example, if all your samples from timepoint 1 were extracted, amplified, and sequenced together, and all your samples from timepoint 2 were extracted, amplified, and sequenced together later, you might find that timepoint 1 and 2 have significantly different bacterial communities. If this was because a large number of low-abundance species were responsible for that change, you wouldn’t really know if that was because the community had changed subtly or if it was because of the collective effect of low-level contamination.
Stay tuned for a piece on batch effects in sequencing!