(Reblog) A Tale of Two Cohorts

Original posting on BioBE.

Sequence data contamination from biological or digital sources can obscure true results and falsely raise one’s hopes.  Contamination is a persist issue in microbial ecology, and each experiment faces unique challenges from a myriad of sources, which I have previously discussed.  In microbiology, those microscopic stowaways and spurious sequencing errors can be difficult to identify as non-sample contaminants, and collectively they can create large-scale changes to what you think a microbial community looks like.

Samples from large studies are often processed in batches based on how many samples can be processed by certain laboratory equipment, and if these span multiple bottles of reagents, or water-filtration systems, each batch might end up with a unique contamination profile.  If your samples are not randomized between batches, and each batch ends up representing a specific time point or a treatment from your experiment, these batch effects can be mistaken for a treatment effect (a.k.a. a false positive).

Due to the high cost of sequencing, and the technical and analytical artistry required for contamination identification and removal, batch effects have long plagued molecular biology and genetics.  Only recently have the pathologies of batch effects been revealed in a harsher light, thanks to more sophisticated analysis techniques (examples here and here and here) and projects dedicated to tracking contamination through a laboratory pipeline.  To further complicate the issue, sources of and practical responses to contamination in fungal data sets is quite different than that of bacterial data sets.

Chapter 1

“The times were statistically greater than prior time periods, while simultaneously being statistically lesser to prior times, according to longitudinal analysis.”

Over the past year, I analyzed a particularly complex bacterial 16S rRNA gene sequence data set, comprising nearly 600 home dust samples, and about 90 controls.  Samples were collected from three climate regions in Oregon, over a span of one year, in which homes were sampled before and approximately six weeks after a home-specific weatherization improvement (treatment homes) or simply six weeks later in (comparison) homes which were eligible for weatherization but did not receive it.  As these samples were collected over a span of a year, they were extracted with two different sequencing kits and multiple DNA extraction batches, although all within a short time after collection. The extracted DNA was spread across two sequence runs to allow for data processing to begin on cohort 1, while we waited for cohort 2 homes to be weatherized.  Thus, there were a lot of opportunities to introduce technical error or biological contamination that could be conflated with treatment effects.

On top of this, each home was unique, with it’s own human and animal occupants, architectural and interior design, plants, compost, and quirks, and we didn’t ask homeowners to modify their behavior in any way.  This was important, as it meant each of the homes – and their microbiomes – are somewhat unique.  Therefore I didn’t want to remove sequences which might be contaminants on the basis of low abundance and risk removing microbial community members which were specific to that home.  After the typical quality assurance steps to curate and process the data, which can be found on GitHub as an R script of a DADA2 package workflow, I needed to decide what to do with the negative controls.

Because sequencing is expensive, most of the time there is only one negative control included in sequencing library preparation, if that.  The negative control is a blank sample – just water, or an unused swab –  which does not intentionally contain cells or nucleic acids. Thus anything you find there will have come from contamination. The negative control can be used to normalize the relative abundance numbers – if you find 1,000 sequences in the negative control, which is supposed to have no DNA in it, then you might only continue looking at samples with a certain amount higher than 1,000 sequences. This risks throwing out valid sequences that happen to be rare. Alternatively, you can try to identify the contaminants and remove whole taxa from your data set, risking the complete removal of valid taxa.

I had three types of negative controls: sterile DNA swabs which were processed to check for biological contamination in collection materials, kit controls where a blank extraction was run for each batch of extractions to test for biological contamination in extraction reagents, and PCR negative controls to check for DNA contamination of PCR reagents. In total, 90 control samples were sequenced, giving me unprecedented resolution to deal with contamination. Looking at the total number of sequences before and after my quality-analysis processing, I can see that the number of sequences in my negative controls reduces dramatically; they were low-quality in some way and might be sequencing artifacts. But, an unsatisfactory number remain after QA filtering; these are high-quality and likely come from microbial contamination.

This slideshow requires JavaScript.

I wasn’t sure how I wanted to deal with each type of control. I came up with three approaches, and then looked at unweighted, non-rarefied ordination plots (PCoA) to watch how my axes changed based on important components (factors).  What follows is a narrative summarize of what I did, but I included the R script of my phyloseq package workflow and workaround on GitHub.

Chapter 2

“In microbial ecology, preprints are posted on late November nights. The foreboding atmosphere of conflated factors makes everyone uneasy.”

Ordination plots visualize lots of complex communities together. In both ordination figures below, each point on the graph represents a dust sample from one house. They are clustered by community distance: those closer together on the plot have a more similar community than points which are further away from each other.  The points are shaped by the location of the samples, including Bend, Eugene, Portland, along with a few pilot samples labeled “Out”, and negative controls which have no location (not pictured but listed as NA).  The points are colored by DNA extraction b

PCoA cohort 1, prior to cleaning out negative controls
Figure 1 Ordination of home samples prior to removing contaminants found in negative controls.  

In Figure 1, the primary axis (axis 1) shows a clear clustering of samples by DNA extraction batch, but this is also mixed with geographic location, and as it turns out – date of collection and sequencing run.  We know from other studies that geographic location, date of collection, and sequencing batch can all affect the microbial community.

Approach 1: Subtraction + outright removal

This approach subsets my data into DNA extraction batches, and then uses the number of sequences found in the negative controls to subtract out sequences from my dust samples.  This assumes that if a particular sequence showed up 10 times in my negative control, but 50 times in my dust samples, that only 40 of those in my dust sample were real. For each of my DNA extraction batch negative control samples, I obtained the sum of each potential contaminant that I found there, and then subtracted those sums from the same sequence columns in my dust samples.

Screen Shot 2018-08-24 at 5.34.03 PM.png
Figure 2 Ordination of home samples after removing contaminants found in negative controls, particular to each batch, using approach 1.  

Approach 1 was alright, but there was still an effect of DNA extraction batch (indicated by color scale) that was stronger than location or treatment (not included on this graph). This approach is also more pertinent for working with OTUs, or situations where you wouldn’t want to remove the whole OTU, just subtract out a certain number sequences from specific columns. There is currently no way to do that just from phyloseq, so I made a work-around (see the GitHub page). However, using DADA2 gives you Sequence Variants, which are more precise and I found it’s better to remove them with approach 3.

Approach 2: Total Removal

This approach removes any contaminant sequences that is found in ANY of the negative controls from ALL the house samples, regardless of which negative control was for which extraction batch. This approach assumes that if it a sequence was found as a contaminant in a negative control somewhere, that it is a contaminant everywhere.

Screen Shot 2018-08-24 at 5.34.16 PM.png
Figure 3 Ordination of home samples after removing contaminants found in negative controls, particular to each batch, using approach 2.  

Once again, approach 2 was alright, and now that primary axis (axis 1) of potential batch effect is now my secondary axis; so there is still an effect of DNA extraction batch (indicated by color scale) but it is weaker. When I recolor by different variables, there is much more clustering by Treatment than by any batch effects. However, that second axis is also one of my time variables, so don’t want to get rid of all of the variation on that axis. But, since my negative kit controls showed a lot of variation in number and types of taxa, I don’t want to remove everything found there from all samples indiscriminately.

Additionally, I don’t favor throwing sequences out just because they were a contaminant somewhere, particularly for dust samples. Contamination can be situational, particularly if a microbe is found in the local air or water supply and would be legitimately found in house dust but would have also accidentally gotten into the extraction process.

Approach 3: “To each its own”

This approach removes all the sequences from PCR and swab contaminant SVs fully from each cohort, respectively, and removes extraction kit contaminants fully from each DNA extraction batch, respectively. I took all the sequences of the SVs found in my dust samples and made them into a vector (list), and then I took all the sequences of the SVs found in my controls and made them into a different vector.  I effectively subtracted out the contaminant SVs by name, but asking to find the sequences which were different between my two lists (thus returning the sequences which were in my dust samples but not in my control samples).  I did this respective to each sequencing cohort and batch, so that I only remove the pertinent sequences (ex. using kit control 1 to subtract from DNA extraction batch 1).

PCoA cohort 1, after cleaning out negative controls, approach 3, ext. batch.png
Figure 4 Ordination of home samples after removing contaminants found in negative controls, particular to each batch, using approach 3.  

In Figure 4, potential batch effect is solidly my secondary axis and not the primary driving force behind clustering. The primary axis (axis 1) shows a clear separation by climate zone, or location of homes, once the batch contamination has been removed.  When I recolor by different variables, there is much more clustering by Treatment and almost none by batch effects. I say almost none, because some of my DNA extraction batches also happen to be Treatment batches, as they represent a subset of samples from a different location. Thus, I can’t tell if those samples cluster separately solely because of location or also because of batch effect. However, I am satisfied with the results and ready to move on.

Unlike its namesake, this tale has a happier ending.

The Fine Art of Finding Scientific Information

Not a day goes by that I don’t search for information, and whether that information is a movie showtime or the mechanism by which a bacterial species is resistant to zinc toxicity, I need that information to be accurate. In the era of real fake-news and fake real-news, mockumentaries, and misinformation campaigns, the ability to find accurate and unbiased information is more important than ever.

Yet, assessing the validity of information and verifying sources is an under-appreciated and under-taught skill. There are some great resources available for determining the reliability (if the same results are achieved each time), and validity (is it a real effect), of a dataset, as well as of the authors. Even with fact-evaluation resources available through The National Center for Complementary and Integrated Health (NCCIS), The University of EdinburghThe Georgetown University Library, or Michigan State University, like any skill, finding information takes practice.

Where do I go for Science Information?

Thanks to the massive shift towards digital archiving and open-access online journals, nearly all of my information hunting is done online (and an excellent reason why Net Neutrality is vital to researchers). Most of the time, this information is in the form of  scientific journal articles or books online, and finding this information can be accomplished by using regular search engines. In particular, Google has really pushed to improve its ability to index scientific publications (critical to Google Scholar and Paperpile).

However, it takes skill to compose your search request to find accurate results. I nearly always add “journal article” or “scientific study” to the end of my query because I need the original sources of information, not popular media reports on it. This cuts out A LOT of inaccuracy in search results. If I’m looking for more general information, I might add “review” to find scientific papers which broadly summarize the results of dozens to hundreds of smaller studies on a particular topic. If I have no idea where to begin and need basic information on what I’m trying to look for, I will try my luck with a general search online or even Wikipedia (scientists have made a concerted effort to improve many science-related entries). This can help me figure out the right terminology to phrase my question.

How do I know if it’s accurate?

One of the things I’m searching for when looking for accurate sources is peer-review.  Typically, scientific manuscripts submitted to reputable journals are reviewed by 1 – 3 other authorities in that field, more if the paper goes through several journal submissions. The reviewers may know who the authors are, but the authors don’t know their reviewers until at least after publication, and sometimes never. This single-blind (or double-blind if the reviewers can’t see the authors’ names) process allows for manuscripts to be reviewed, edited, and challenged before they are published. Note that perspective or opinion pieces in journals are typically not peer-reviewed, as they don’t contain new data, just interpretation. The demand for rapid publishing rates and the rise of predatory journals has led some outlets to publish without peer-review, and I avoid those sources. The reason is that scientists might not see the flaws or errors in their own study, and having a third party question your results improves your ability to communicate those results accurately.

image description
Kriegeskorte, 2012

Another way to assess the validity of an article is the inclusion of correct control groups. The control group acts a baseline against which you can measure your treatment effects, those which go through the same experimental parameters except they don’t receive an active treatment. Instead, the group receives a placebo, because you want to make sure that the acts of experimentation and observation themselves do not lead to a reaction – The Placebo Effect. The Placebo Effect is a very real thing and can really throw off your results when working with humans.

Similarly, one study does not a scientific law make. Scientific results can be situational, or particular to the parameters in that study, and might not be generalizable (applicable to a broader audience or circumstances). It often takes dozens if not a hundred studies to get at the underlying mechanisms of an experimental effect, or to show that the effect is reliably recreated across experiments.

Data or it didn’t happen. I can’t stress this one enough. Making a claim, statement, or conclusion is hollow until you have supplied observations to prove it. This a really common problem in internet-based arguments, as people put forth references as fact when they are actually opinionated speeches or videos that don’t list their sources. These opinionated speeches have their place, I post a lot of them myself. They often say what I want to say in a much more eloquent manner. Unfortunately, they are not data and can’t prove your point.

The other reason you need data to match your statements is that in almost all scientific articles, the authors include speculation and theory of thought in the Discussion section. This is meant to provide context to the study, or ponder over the broader meaning, or identify things which need to be verified in future studies. But often these statements are repeated in other articles as if they were facts which were evaluated in the first article, and the ideas get perpetuated as proven facts instead of as theories to be tested. This often happens when the Discussion section of an article is hidden behind a pay wall and you end up taking that second paper’s word for it about what happened in the first paper. It’s only when the claim is traced all the way back to the original article that you find that someone mistook thought supposition for data exposition.

The “Echo Chamber Effect” is also prominent when it comes to translating scientific articles into news publications, a great example of which is discussed by 538. Researchers mapped the genome of about 30 transgender individuals – about half and half of male to female and female to male, to get an idea of whether gender identity could be described with a nuanced genetic fingerprint rather than a binary category. This is an extremely small sample group, and the paper was more about testing the idea and suggesting some genes which would be used for the fingerprint. In the mix-up, comments about the research were attributed to a journalist at 538 – comments that the journalist had not made, and this error was perpetuated when further news organizations used other news publications as the source instead of conducting their own interview or referencing the publication. In addition, the findings and impact of the study were wrongly reported – it was stated that 7 genes had been identified by researchers as your gender fingerprint, which is a gross exaggeration of what the original research article was really about. When possible, try to trace information back to its origin, and get comments straight from the source.

How do I know if it’s unbiased?

This can be tricky, as there are a number of ways someone can have a conflict of interest.  One giveaway is tone, as scientific texts are supposed to remain neutral. You can also check the author affiliations (who they are and what institution they are at), the conflict of interest section, and the disclosure of funding source or acknowledgements sections, all of which are common inclusions on scientific papers. “Following the money” is a particularly good way of determining if there is biased involved, depending on the reputation of the publisher.

When in doubt, try asking a librarian

There are a lot of resources online and in-person to help you find accurate information, and public libraries and databases are free to use!

Figure 7; Guadamillas Gómez, 2017.

 

Introduction to Mammalian Microbiomes

Since the end of September, I’ve been teaching a course for the UO Clark Honors College; Introduction to Mammalian Microbiomes.   And in a novel challenge for me – I’m teaching the idea of complex, dynamic microbial ecosystems and their interaction with animal hosts … to non-majors.  My undergraduate students almost entirely hail from the humanities and liberal arts, and I couldn’t be more pleased. So far, it’s been a wonderful opportunity for me to pilot a newly developed course, improve my teaching skills, and flex my creativity, both in how I explain concepts and how I design course objectives.

I enthusiastically support efforts towards science communication, especially in making science more accessible to a wider audience.  My students likely won’t be scientific researchers themselves, but some will be reporting on science publications, or considering funding bills, and all of them are exposed to information about human-associated microbial communities from a variety of sources. To navigate the complicated and occasionally conflicting deluge of information online about the human microbiome, my students will need to build skills in scientific article reading comprehension, critical thinking, and discussion.  To that end, many of my assignments are designed to engage students in these skills.

I feel that it’s important to teach not only what we know about the microbial community living in the mouth or the skin, but to teach the technologies that provide that knowledge, and how that technology has informed our working theories and understanding of microbiology over centuries.  Importantly, I hope to teach them that science, and health sciences, are not static fields, we are learning new things every day.  I don’t just teach about what science has done right, but I try to put our accomplishments in the context of the number of years and personnel to achieve publications, or the counter-theories that were posited and disproved along the way.

And most of, I want the course to be engaging, interesting, and thought provoking.  I encouraged class discussions and student questions as they puzzle through complex theories, and I’ve included a few surprise additions to the syllabus along the way.  Yesterday, University of Oregon physics Ph.D. student Deepika Sundarraman taught us about her research in Dr. Parthasarathy’s lab on using light sheet fluorescence microscopy to visualize bacterial communities in the digestive tract of larval zebra fish! Stay tuned for more fun in #IntroMammalianMicrobiomes!

 

I am now an Academic Editor at PLOS ONE!

I am pleased to announce I have joined the PLOS ONE journal editorial team as an Academic Editor! I am particularly pleased to have joined with PLOS ONE, as the journal is dedicated to open-access publication and is readily accessible to the general public.

As I discussed previously, the effort of many individuals goes into a scientific manuscript, including ad-hoc reviewers and editorial staff at journals.  As an Academic Editor, I will now interface between scientific reviewers and higher-level editorial staff to manage the peer-review process; including evaluating manuscript submissions for applicability to the journal, selecting reviewers, and assessing reviewer comments to making editorial decisions on publishing.

My subject areas include: Biology and life sciences, Agriculture, Animal products, Agricultural production, Animal management, Ecology, Microbial ecology, Microbiology, Applied microbiology, Bacteriology, Microbial ecology, Molecular biology, Molecular biology techniques, Molecular genetics.

 

Where Art Meets Science

The Nature Lab at Rhode Island School of Design is presenting an exhibition on the interface of biology and art; Biodesign: From Inspiration to Integration.  Curated in collaboration with William Myers, the show is part of their 80th anniversary celebrations. The exhibition runs from Aug 25—Sept 27 at the Woods Gerry Gallery, and will feature photos and sampling equipment from the Biology and the Built Environment Center.

 

Biodesign: From Inspiration to Integration

OPENING RECEPTION

An exhibition curated by William Myers and the RISD Nature Lab, this show features the following works:

Hy-Fi and Bio-processing Software—David Benjamin / The Living
Mycelium architecture, made in collaboration with Ecovative and 3M.

Zoa—Natalia Krasnodebska / Modern Meadow
Leather grown using yeasts that secrete collagen, and grown completely without animal derivatives.

The Built Environment Microbiome—BioBE Center / Jessica Green, Sue Ishaq and Kevin Van Den Wymelenberg
The BioBE conducts research into the built environment microbiome, mapping the indoor microbiome, with an eye towards pro-biotic architecture.

Zea Mays / Cultivar Series—Uli Westphal
Newly commissioned corn study, this project highlights maize’s evolution through interaction with humans.

Harvest / Interwoven—Diana Scherer
Artist coaxes root systems plant root systems into patterns.

Fifty Sisters & Morphogenesis—Jon McCormack
Artist algorithmically generates images that mimic evolutionary growth, but tweaks them to include aesthetics of the logos of global petroleum producing corporations.

Organ on a Chip—Wyss Institute
Wyss Institute creates microchips that recapitulate the functions of living human organs, offering a potential alternative to animal testing.

AgroDerivatives: Permutations on Generative Citizenship—Mae-Ling Lokko
This project proposes labor, production criteria and circulation of capital within agrowaste/bioadhesive upcycling ecosystems.

New Experiments in Mycelium—Ecovative
Ecovative makes prototypes of mycelium items such as insulation, soundproofing tiles, surfboards, lampshades.

Bistro in Vitro—Next Nature Network
Performance with speculative future foods samples. The installation will include video screens and a cookbook on a table display.

Raw Earth Construction—Miguel Ferreira Mendes
This project highlights an ancient technique that uses soil, focusing on how soil is living.

Burial Globes: Rat Models—Kathy High
This project presents glass globes that hold the ashes of the five HLA-B27 transgenic rats, each one named and remembered: Echo, Flowers, Tara, Matilda, Star.

To Flavour Our Tears—Center for Genomic Gastronomy
Set up as an experimental restaurant, this project places humans back into the foodchain — investigating the human body as a food source for other species.

Blood Related—Basse Stittgen
A series of compressed blood objects—inspired by Hemacite objects made from blood/sawdust compressed in a process invented in the late 19th century—highlights bloodwaste in the slaughterhouse industry.

Silk Poems—Jen Bervin
A poem made from a six-character chain represents the DNA structure of silk, it refers to the silkworm’s con-structure of a cocoon, and addresses the ability of silk to be used as a bio sensor, implanted under people’s skin.

Zoe: A Living Sea Sculpture—Colleen Flanigan
Zoe is an underwater structure, part of coral restoration research, that regenerates corals in areas highly impacted by hurricanes, human activity and pollution.

Aquatic Life Forms—Mikhail Mansion
Computationally generated lifeforms animated using motion-based data captured from Arelia aurita.

Algae Powered Digital Clock—Fabienne Felder
By turning electrons produced during photosynthesis and bacterial digestion into electricity, algae will be used to power a small digital clock.

A Place for Plastics—Megan Valanidas
This designer presents a new process of making bioplastics that are bio-based, biodegradable AND compostable

Data Veins & Flesh Voxels—Ani Liu
This project explores how technology influences our notion of being human from different points of view, with a focus on exploring the relationship between our bodies as matter and as data.

Pink Chicken Project—Studio (Non)human (Non)sense/ Leo Fidjeland & Linnea Våglund
By changing the color of chickens to pink, this project rejects the current violence inflicted upon the non-human world and poses questions of the impact and power of synthetic biology.

August 24
5:30 pm – 7:30 pm
Woods Gerry Gallery, 62 Prospect Street, Providence, RI 02903

(Reblog) A perspective on tackling contamination in microbial ecology

Original posting from BioBE.

To study DNA or RNA, there are a number of “wet-lab” (laboratory) and “dry-lab” (analysis) steps which are required to access the genetic code from inside cells, polish it to a high-sheen such that the delicate technology we rely on can use it, and then make sense of it all.  Destructive enzymes must be removed, one strand of DNA must be turned into millions of strands so that collectively they create a measurable signal for sequencing, and contamination must be removed.  Yet, what constitutes contamination, and when or how to deal with it, remains an actively debated topic in science. Major contamination sources include human handlers, non-sterile laboratory materials, other samples during processing, and artificial generation due to technological quirks.

Contamination from human handlers

This one is easiest to understand; we constantly shed microorganisms and our own cells and these aerosolized cells may fall into samples during collection or processing.  This might be of minimal concern working with feces, where the sheer number of microbial cells in a single teaspoon swamp the number that you might have shed into it, or it may be of vital concern when investigating house dust which not only has comparatively few cells and little diversity, but is also expected to have a large amount of human-associated microorganisms present.  To combat this, researchers wear personal protective equipment (PPE) which protects you from your samples and your samples from you, and work in biosafety cabinets which use laminar air flow to prevent your microbial cloud from floating onto your workstation and samples.

Fun fact, many photos in laboratories are staged, including this one, of me as a grad student.  I’m just pretending to work.  Reflective surfaces, lighting, cramped spaces, busy scenes, and difficulty in positioning oneself makes “action shots” difficult.  That’s why many lab photos are staged, and often lack PPE.

sue_02_small
Photo Credit: Kristina Drobny

Contamination from laboratory materials

Microbiology or molecular biology laboratory materials are sterilized before and between uses, perhaps using chemicals (ex. 70% ethanol), an ultraviolet lamp, or autoclaving which combines heat and pressure to destroy, and which can be used to sterilize liquids, biological material, clothing, metal, some plastics, etc.  However, microorganisms can be tough – really tough, and can sometimes survive the harsh cleaning protocols we use.  Or, their DNA can survive, and get picked up by sequencing techniques that don’t discriminate between live and dead cellular DNA.

In addition to careful adherence to protocols, some of this biologically-sourced contamination can be handled in analysis.  A survey of human cell RNA sequence libraries found widespread contamination by bacterial RNA, which was attributed to environmental contamination.  The paper includes an interesting discussion on how to correct this bioinformatically, as well as a perspective on contamination.  Likewise, you can simply remove sequences belonging to certain taxa during quality control steps in sequence processing. There are a number of hardy bacteria that have been commonly found in laboratory reagents and are considered contaminants, the trouble is that many of these are also found in the environment, and in certain cases may be real community members.  Should one throw the Bradyrhizobium out with the laboratory water bath?

Chimeras

Like the mythical creatures these are named for, sequence chimeras are DNA (or cDNA) strands which are accidentally created when two other DNA strands merged.  Chimeric sequences can be made up of more than two DNA strand parents, but the probability of that is much lower.  Chimeras occur during PCR, which takes one strand of genetic code and makes thousands to millions of copies, and a process used in nearly all sequencing workflows at some point.  If there is an uneven voltage supplied to the machine, the amplification process can hiccup, producing partial DNA strands which can concatenate and produce a new strand, which might be confused for a new species.  These can be removed during analysis by comparing the first and second half of each of your sequences to a reference database of sequences.  If each half matches to a different “parent”, it is deemed chimeric and removed.

1024px-Splicing_by_Overlap_Extension_PCR.svg
Chimeric DNA

Cross – sample contamination

During DNA or RNA extraction, genetic code can be flicked from one sample to another during any number of wash or shaking steps, or if droplets are flicked from fast moving pipettes.  This can be mitigated by properly sealing all sample containers or plates, moving slowly and carefully controlling your technique, or using precision robots which have been programmed with exacting detail — down to the curvature of the tube used, the amount and viscosity of the liquid, and how fast you want to pipette to move, so that the computer can calculate the pressure needed to perform each task.  Sequencing machines are extremely expensive, and many labs are moving towards shared facilities or third-party service providers, both of which may use proprietary protocols.  This makes it more difficult to track possible contamination, as was the case in a recent study using RNA; the researchers found that much of the sample-sample contamination occurred at the facility or in shipping, and that this negatively affected their ability to properly analyze trends in the data.

Sample-sample contamination during sequencing

Controlling sample-sample contamination during sequencing, however, is much more difficult to control. Each sequencing technology was designed with a different research goal in mind, for example, some generate an immense amount of short reads to get high resolution on specific areas, while others aim to get the longest continuous piece of DNA sequenced as possible before the reaction fails or become unreliable.  they each come with their own quirks and potential for quality control failures.

Due to the high cost of sequencing, and the practicality that most microbiome studies don’t require more than 10,000 reads per sample, it is very common to pool samples during a run.  During wet-lab processing to prepare your biological samples into a “sequencing library”, a unique piece of artificial “DNA” called a barcode, tag, or index, is added to all the pieces of genetic code in a single sample (in reality, this is not DNA but a single strand of nucleotides without any of DNA’s bells and whistles).  Each of your samples gets a different barcode, and then all your samples can be mixed together in a “pool”.  After sequencing the pool, your computer program can sort the sequences back into their respective samples using those barcodes.

While this technique has made sequencing significantly cheaper, it adds other complications.  For example, Illumina MiSeq machines generate a certain number of sequence reads (about 200 million right now) which are divided up among the samples in that run (like a pie).   The samples are added to a sequencing plate or flow cell (for things like Illumina MiSeq).  The flow cells have multiple lanes where samples can be added; if you add a smaller number of samples to each lane, the machine will generate more sequences per sample, and if you add a larger number of samples, each one has fewer sequences at the end of the run. you have contamination.  One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.

illumina-gaiix-for-high-throughput-sequencing-15-728
Illumina GAIIx for high-throughput sequencing.

Cross-contamination can happen on a flow cell when the sample pool wasn’t thoroughly cleaned of adapters or primers, and there are great explanations of this here and here.  To generate many copies of genetic code from a single strand, you mimic DNA replication in the lab by providing all the basic ingredients (process described here).   To do that, you need to add a primer (just like with painting) which can attach to your sample DNA at a specific site and act as scaffolding for your enzyme to attach to the sample DNA and start adding bases to form a complimentary strand.  Adapters are just primers with barcodes and the sequencing primer already attached.   Primers and adapters are small strands, roughly 10 to 50 nucleotides long, and are much shorter than your DNA of interest, which is generally 100 to 1000 nucleotides long.  There are a number of methods to remove them, but if they hang around and make it to the sequencing run, they can be incorporated incorrectly and make it seem like a sequence belongs to a different sample.

MB512
DNA Purification

 

barcode_swap_mechanism.png
Barcode swapping

This may sound easy to fix, but sequencing library preparation already goes through a lot of stringent cleaning procedures to remove everything but the DNA (or RNA) strands you want to work with.  It’s so stringent, that the problem of barcode swapping, also known as tag switching or index hopping, was not immediately apparent.  Even when it is noted, it typically affects a small number of the total sequences.  This may not be an issue, if you are working with rumen samples and are only interested in sequences which represent >1% of your total abundance.  But it can really be an issue in low biomass samples, such as air or dust, particularly in hospitals or clean rooms.  If you were trying to determine whether healthy adults were carrying but not infected by the pathogen C. difficile in their GI tract, you would be very interested in the presence of even one C. difficile sequence and would want to be extremely sure of which sample it came from.  Tag switching can be made worse by combining samples from very different sample types or genetic code targets on the same run.

There are a number of articles proposing methods of dealing with tag switching using double tags to reduce confusion or other primer design techniques, computational correction or variance stabilization of the sequence data, identification and removal of contaminant sequences, or utilizing synthetic mock controls.  Mock controls are microbial communities which have been created in the lab by mixed a few dozen microbial cultures together, and are used as a positive control to ensure your procedures are working.  because you are adding the cells to the sample yourself, you can control the relative concentrations of each species which can act as a standard to estimate the number of cells that might be in your biological samples.  Synthetic mock controls don’t use real organisms, they instead use synthetically created DNA to act as artificial “organisms”. If you find these in a biological sample, you know you have contamination.  One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.

Incorrect base calls

Cross-contamination during sequencing can also be a solely bioinformatic problem – since many of the barcodes are only a few nucleotides (10 or 12 being the most commonly used), if the computer misinterprets the bases it thinks was just added, it can interpret the barcode as being a different one and attribute that sequence to being from a different sample than it was.  This may not be a problem if there aren’t many incorrect sequences generated and it falls below the threshold of what is “important because it is abundant”, but again, it can be a problem if you are looking for the presence of perhaps just a few hundred cells.

Implications

When researching environments that have very low biomass, such as air, dust, and hospital or cleanroom surfaces, there are very few microbial cells to begin with.  Adding even a few dozen or several hundred cells can make a dramatic impactinto what that microbial community looks like, and can confound findings.

Collectively, contamination issues can lead to batch effects, where all the samples that were processed together have similar contamination.  This can be confused with an actual treatment effect if you aren’t careful in how you process your samples.  For example, if all your samples from timepoint 1 were extracted, amplified, and sequenced together, and all your samples from timepoint 2 were extracted, amplified, and sequenced together later, you might find that timepoint 1 and 2 have significantly different bacterial communities.  If this was because a large number of low-abundance species were responsible for that change, you wouldn’t really know if that was because the community had changed subtly or if it was because of the collective effect of low-level contamination.

Stay tuned for a piece on batch effects in sequencing!

 

 

500 Women Scientists Eugene featured on local news!

500 Women Scientists Eugene Pod Coordinators Leslie Dietz, Theresa Cheng and I sat down with KMTR reporter Kelsey Christensen today to about 500 Women Scientists in Eugene, and the Science Salons we’ve been hosting monthly since March.  You can find the video clip in the link below.

“We’re trying to help change peoples idea of what a scientist looks like.”

You can catch up with the Eugene Pod and find our schedule of events online:

Facebook | Twitter | Website

 

34

It takes a village to write a scientific paper

Every scientist I know (myself included) underestimates how long it will take to write, edit, and submit a paper.  Despite having 22 publications to date, I still set laughably-high expectations for my writing deadlines.  Even though scientists go into a project with a defined hypothesis, objectives, and workflow, by the end of data analysis we often find ourselves surprised.  Perhaps your assumptions were not supported by the actual observations, sometimes what you thought would be insignificant becomes a fascinating result.  Either way, by the time you have finished most of the data analysis and exploration, you face the difficult task of compiling the results into a meaningful paper.  You can’t simply report your data without giving them context and interpretation.  I’ve already discussed the portions of scientific manuscripts and how one is composed, and here I want to focus on the support network that goes into this process, which can help shape that context that you provide to your data.

One of the best ways in which we can promote rigorous, thoughtful science is through peer-review, which can take a number of forms.  It is worth noting, that peer-review also allows for professional bullying, and can be swayed by current theories and “common knowledge”.  It is the journal editor’s job to select and referee reviewers (usually 2 – 4), to compile their comments, and to make the final recommendation for the disposition of the manuscript (accept, modify, reject).  Reputation, and personal demographics such as gender, race, or institutional pedigree can also play a role in the quality and tone of the peer-review you receive. Nevertheless, getting an outside opinion of your work is critical, and a number of procedural changes to improve transparency and accountability have been proposed and implemented.  For example, many journals now publish reviews names online with the article after it has been accepted, such that the review does not stay blind forever.

Thorough reading and editing of a manuscript takes time.  Yet peer-reviewers for scientific journals almost unanimously do not receive compensation.  It is an expected service of academics, and theoretically if we are all acting as peer-reviewers for each other then there should be no shortage.  Unfortunately, due to the pressures of the publish-or-perish race to be awarded tenure, many non-tenured scientists (graduate students, post-docs, non-tenure track faculty, and pre-tenured tenure-track faculty) are reluctant to spend precious time on any activity which will not land them tenure, particularly reviewing.  Moreover, tenured faculty also tend to find themselves without enough time to review, particularly if they are serving on a large number of committees or in an administrative capacity.  On top of that, you are not allowed to accept a review if you have a conflict of interest, including current or recent collaboration with the authors, personal relationships with authors, a financial stake in the manuscript or results, etc.  The peer-review process commonly gets delayed when editors are unable to find enough reviewers able to accept a manuscript, or when reviewers cannot complete the review in a timely manner (typically 2 – 4 weeks).

I have recently tried to solicit peer-review from friends and colleagues who are not part of the project before I submit to a journal.  If you regularly follow my blog, you’ll probably guess that one of the reasons I do this is to catch spelling and grammatical mistakes, which I pick out of other works with hawk-like vision and miss in my own with mole-like vision.  More importantly, trying to communicate my work to someone who is not already involved in the project is a great way to improve my ability to effectively and specifically communicate my work.  Technical jargon, colloquial phrasing, sentence construction, and writing tone can all affect the information and data interpretation that a reader can glean from your work, and this will be modulated by the knowledge background of the reader.

I’ve learned that I write like an animal microbiologist, and when writing make assumptions about which information is common knowledge and doesn’t need a citation or to be included at all because it can be assumed.  However, anyone besides animal microbiologists who have been raised on different field-specific common knowledge may not be familiar with the abbreviations, techniques, or terms I use.  It may seem self-explanatory to me, but I would rather have to reword my manuscript that have readers confuse the message from my article.  Even better, internal review from colleagues who are not involved with the project or who are in a different field can provide valuable interdisciplinary perspective.  I have been able to apply my knowledge of animal science to my work in the built environment, and insights from my collaborators in plant ecology have helped me broaden my approach towards both animals and buildings.

No scientific article would be published without the help of the journal editorial team, either, who proof the final manuscript, verify certain information, curate figures and tables, and type-set the final version.  But working backwards from submission and journal staff, before peer-review and internal peer-review, there are a lot of people that contribute to a scientific article who aren’t necessarily considered when contemplating the amount of personnel needed to compose a scientific article.  In fact, that one article represents just the tip of the iceberg of people involved in that science in some way; there are database curators, people developing and maintaining open-source software or free analysis programs, laboratory technicians, or equipment and consumables suppliers.  Broadening our definition of science support network further includes human resources personnel, sponsored projects staff who manage grants, building operational personnel who maintain the building services for the laboratory, and administrative staff who handle many of the logistical details to running a lab.  It takes a village to run a research institution, to publish a scientific article, to provide jobs and educational opportunities, and to support the research and development which fuels economic growth.  When it comes time to set federal and state budgets, it bears remembering that that science village requires financial support.

 

Featured Image Credit: Kriegeskorte, 2012

Summer outlook

I’ve got quite a busy summer ahead!  You’ll be able to find me at:

June 22, 2018: The HOMEChem Open House at the UT Austin Test House , University of Texas at Austin’s J.J. Pickle Research Campus.  I’ll be meeting with BioBE collaborators to discuss pilot projects exploring the link between indoor chemistry and indoor microbiology.

July 15 – 20, 2018: The Microbiology of the Built Environment (MoBE) Gordon Research Conference, University of New England in Biddeford, ME.  BioBE’s Dr. Jessica Green is meeting Vice Chair.

July 22 – 28, 2018: Indoor Air 2018 Conference in Philadelphia, PA.  I’ll be presenting some of the work I’ve been part of, exploring the effect of weatherization on bacteria indoors.

August 12 – 18, 2018: The 17th International Society for Microbial Ecology (ISME17) in Leipzig, Germany.  Here as well, I’ll be presenting some of the work I’ve been part of, exploring the effect of weatherization on bacteria indoors.