A visit from Bozeman

Last year, one of my former research groups at Montana State University was awarded a USDA NIFA Foundational program grant, and I am a sub-award PI on that grant.  We’ll be working together to investigate the effect of diversified farming systems – such as those that use cover crops, rotations, or integrate livestock grazing into field management – on crop production and soil bacterial communities: “Diversifying cropping systems through cover crops and targeted grazing: impacts on plant-microbe-insect interactions, yield and economic returns.”

The first soil samples were collected in Montana this summer, and I have been processing them for the past few weeks. I am using the opportunity to train a master’s student on microbiology and molecular genetics lab work. 

Tindall Ouverson started this fall as a master’s student at MSU, working with Fabian Menalled and Tim Seipel in Bozeman, MT.  She’s an environmental and soil scientist, and this is her first time working with microbes.  She was here in Eugene for just a few days to learn everything needed for sequencing: DNA extraction, polymerase chain reaction, gel electrophoresis and visualization, DNA cleanup using magnetic beads, quantification, and pooling.  Despite not having experience in microbiology or molecular biology, Tindall showed a real aptitude and picked up the techniques faster than I expected!

Once the sequences are generated, I’ll be (remotely) training Tindall on DNA sequence analysis.  I’ll also be serving as one of her thesis committee members! Tindall will be the first of (hopefully) many cross-trained graduate students between myself and collaborators at MSU.

(Reblog) A perspective on tackling contamination in microbial ecology

Original posting from BioBE.

To study DNA or RNA, there are a number of “wet-lab” (laboratory) and “dry-lab” (analysis) steps which are required to access the genetic code from inside cells, polish it to a high-sheen such that the delicate technology we rely on can use it, and then make sense of it all.  Destructive enzymes must be removed, one strand of DNA must be turned into millions of strands so that collectively they create a measurable signal for sequencing, and contamination must be removed.  Yet, what constitutes contamination, and when or how to deal with it, remains an actively debated topic in science. Major contamination sources include human handlers, non-sterile laboratory materials, other samples during processing, and artificial generation due to technological quirks.

Contamination from human handlers

This one is easiest to understand; we constantly shed microorganisms and our own cells and these aerosolized cells may fall into samples during collection or processing.  This might be of minimal concern working with feces, where the sheer number of microbial cells in a single teaspoon swamp the number that you might have shed into it, or it may be of vital concern when investigating house dust which not only has comparatively few cells and little diversity, but is also expected to have a large amount of human-associated microorganisms present.  To combat this, researchers wear personal protective equipment (PPE) which protects you from your samples and your samples from you, and work in biosafety cabinets which use laminar air flow to prevent your microbial cloud from floating onto your workstation and samples.

Fun fact, many photos in laboratories are staged, including this one, of me as a grad student.  I’m just pretending to work.  Reflective surfaces, lighting, cramped spaces, busy scenes, and difficulty in positioning oneself makes “action shots” difficult.  That’s why many lab photos are staged, and often lack PPE.

sue_02_small
Photo Credit: Kristina Drobny

Contamination from laboratory materials

Microbiology or molecular biology laboratory materials are sterilized before and between uses, perhaps using chemicals (ex. 70% ethanol), an ultraviolet lamp, or autoclaving which combines heat and pressure to destroy, and which can be used to sterilize liquids, biological material, clothing, metal, some plastics, etc.  However, microorganisms can be tough – really tough, and can sometimes survive the harsh cleaning protocols we use.  Or, their DNA can survive, and get picked up by sequencing techniques that don’t discriminate between live and dead cellular DNA.

In addition to careful adherence to protocols, some of this biologically-sourced contamination can be handled in analysis.  A survey of human cell RNA sequence libraries found widespread contamination by bacterial RNA, which was attributed to environmental contamination.  The paper includes an interesting discussion on how to correct this bioinformatically, as well as a perspective on contamination.  Likewise, you can simply remove sequences belonging to certain taxa during quality control steps in sequence processing. There are a number of hardy bacteria that have been commonly found in laboratory reagents and are considered contaminants, the trouble is that many of these are also found in the environment, and in certain cases may be real community members.  Should one throw the Bradyrhizobium out with the laboratory water bath?

Chimeras

Like the mythical creatures these are named for, sequence chimeras are DNA (or cDNA) strands which are accidentally created when two other DNA strands merged.  Chimeric sequences can be made up of more than two DNA strand parents, but the probability of that is much lower.  Chimeras occur during PCR, which takes one strand of genetic code and makes thousands to millions of copies, and a process used in nearly all sequencing workflows at some point.  If there is an uneven voltage supplied to the machine, the amplification process can hiccup, producing partial DNA strands which can concatenate and produce a new strand, which might be confused for a new species.  These can be removed during analysis by comparing the first and second half of each of your sequences to a reference database of sequences.  If each half matches to a different “parent”, it is deemed chimeric and removed.

1024px-Splicing_by_Overlap_Extension_PCR.svg
Chimeric DNA

Cross – sample contamination

During DNA or RNA extraction, genetic code can be flicked from one sample to another during any number of wash or shaking steps, or if droplets are flicked from fast moving pipettes.  This can be mitigated by properly sealing all sample containers or plates, moving slowly and carefully controlling your technique, or using precision robots which have been programmed with exacting detail — down to the curvature of the tube used, the amount and viscosity of the liquid, and how fast you want to pipette to move, so that the computer can calculate the pressure needed to perform each task.  Sequencing machines are extremely expensive, and many labs are moving towards shared facilities or third-party service providers, both of which may use proprietary protocols.  This makes it more difficult to track possible contamination, as was the case in a recent study using RNA; the researchers found that much of the sample-sample contamination occurred at the facility or in shipping, and that this negatively affected their ability to properly analyze trends in the data.

Sample-sample contamination during sequencing

Controlling sample-sample contamination during sequencing, however, is much more difficult to control. Each sequencing technology was designed with a different research goal in mind, for example, some generate an immense amount of short reads to get high resolution on specific areas, while others aim to get the longest continuous piece of DNA sequenced as possible before the reaction fails or become unreliable.  they each come with their own quirks and potential for quality control failures.

Due to the high cost of sequencing, and the practicality that most microbiome studies don’t require more than 10,000 reads per sample, it is very common to pool samples during a run.  During wet-lab processing to prepare your biological samples into a “sequencing library”, a unique piece of artificial “DNA” called a barcode, tag, or index, is added to all the pieces of genetic code in a single sample (in reality, this is not DNA but a single strand of nucleotides without any of DNA’s bells and whistles).  Each of your samples gets a different barcode, and then all your samples can be mixed together in a “pool”.  After sequencing the pool, your computer program can sort the sequences back into their respective samples using those barcodes.

While this technique has made sequencing significantly cheaper, it adds other complications.  For example, Illumina MiSeq machines generate a certain number of sequence reads (about 200 million right now) which are divided up among the samples in that run (like a pie).   The samples are added to a sequencing plate or flow cell (for things like Illumina MiSeq).  The flow cells have multiple lanes where samples can be added; if you add a smaller number of samples to each lane, the machine will generate more sequences per sample, and if you add a larger number of samples, each one has fewer sequences at the end of the run. you have contamination.  One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.

illumina-gaiix-for-high-throughput-sequencing-15-728
Illumina GAIIx for high-throughput sequencing.

Cross-contamination can happen on a flow cell when the sample pool wasn’t thoroughly cleaned of adapters or primers, and there are great explanations of this here and here.  To generate many copies of genetic code from a single strand, you mimic DNA replication in the lab by providing all the basic ingredients (process described here).   To do that, you need to add a primer (just like with painting) which can attach to your sample DNA at a specific site and act as scaffolding for your enzyme to attach to the sample DNA and start adding bases to form a complimentary strand.  Adapters are just primers with barcodes and the sequencing primer already attached.   Primers and adapters are small strands, roughly 10 to 50 nucleotides long, and are much shorter than your DNA of interest, which is generally 100 to 1000 nucleotides long.  There are a number of methods to remove them, but if they hang around and make it to the sequencing run, they can be incorporated incorrectly and make it seem like a sequence belongs to a different sample.

MB512
DNA Purification

 

barcode_swap_mechanism.png
Barcode swapping

This may sound easy to fix, but sequencing library preparation already goes through a lot of stringent cleaning procedures to remove everything but the DNA (or RNA) strands you want to work with.  It’s so stringent, that the problem of barcode swapping, also known as tag switching or index hopping, was not immediately apparent.  Even when it is noted, it typically affects a small number of the total sequences.  This may not be an issue, if you are working with rumen samples and are only interested in sequences which represent >1% of your total abundance.  But it can really be an issue in low biomass samples, such as air or dust, particularly in hospitals or clean rooms.  If you were trying to determine whether healthy adults were carrying but not infected by the pathogen C. difficile in their GI tract, you would be very interested in the presence of even one C. difficile sequence and would want to be extremely sure of which sample it came from.  Tag switching can be made worse by combining samples from very different sample types or genetic code targets on the same run.

There are a number of articles proposing methods of dealing with tag switching using double tags to reduce confusion or other primer design techniques, computational correction or variance stabilization of the sequence data, identification and removal of contaminant sequences, or utilizing synthetic mock controls.  Mock controls are microbial communities which have been created in the lab by mixed a few dozen microbial cultures together, and are used as a positive control to ensure your procedures are working.  because you are adding the cells to the sample yourself, you can control the relative concentrations of each species which can act as a standard to estimate the number of cells that might be in your biological samples.  Synthetic mock controls don’t use real organisms, they instead use synthetically created DNA to act as artificial “organisms”. If you find these in a biological sample, you know you have contamination.  One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.

Incorrect base calls

Cross-contamination during sequencing can also be a solely bioinformatic problem – since many of the barcodes are only a few nucleotides (10 or 12 being the most commonly used), if the computer misinterprets the bases it thinks was just added, it can interpret the barcode as being a different one and attribute that sequence to being from a different sample than it was.  This may not be a problem if there aren’t many incorrect sequences generated and it falls below the threshold of what is “important because it is abundant”, but again, it can be a problem if you are looking for the presence of perhaps just a few hundred cells.

Implications

When researching environments that have very low biomass, such as air, dust, and hospital or cleanroom surfaces, there are very few microbial cells to begin with.  Adding even a few dozen or several hundred cells can make a dramatic impactinto what that microbial community looks like, and can confound findings.

Collectively, contamination issues can lead to batch effects, where all the samples that were processed together have similar contamination.  This can be confused with an actual treatment effect if you aren’t careful in how you process your samples.  For example, if all your samples from timepoint 1 were extracted, amplified, and sequenced together, and all your samples from timepoint 2 were extracted, amplified, and sequenced together later, you might find that timepoint 1 and 2 have significantly different bacterial communities.  If this was because a large number of low-abundance species were responsible for that change, you wouldn’t really know if that was because the community had changed subtly or if it was because of the collective effect of low-level contamination.

Stay tuned for a piece on batch effects in sequencing!

 

 

What I do for a living Part 2: Bioinformatics

Bioinformatics brings statistics, mathematics, and computer programming to biology and other sciences.  In my area, it allows for the analysis of massive amounts of genomic (DNA), transcriptomic (RNA), proteomic (proteins), or metabolomic (metabolites) data.

In recent years, the advances in sequencing have allowed for the large-scale investigation of a variety of microbiomes.  Microbiome refers to the collective genetic material or genomes of all the microorganisms in a specific environment, such as the digestive tract or the elbow.  The term microbiome is often casually thrown around: some people mistakenly use it interchangeably with “microbiota”, or use it to describe only the genetic material of a specific type of microorganism (i.e. “microbiome” instead of “bacterial microbiome”).  Not only have targeted, or amplicon sequencing techniques improved, but methods that use single or multiple whole genomes have become much more efficient.  In both cases, this has resulted in more sequences being amplified more times.  This creates “sequencing depth”, a.k.a. better “coverage”: if you can sequence one piece of DNA 10 times instead of just once of twice, then you can determine if changes in the sequence are random errors or really there.  Unfortunately, faster sequencing techniques usually have more spontaneous errors, so your data are “messy” and harder to deal with.  More and messier data creates the problem of handling data.

coverage.png
The grey lines on the right represent sequence pieces reassembled into a genome, with white showing gaps.  The colored lines represent a nucleotide that is different from the reference genome, usually just a random error in one sequence. The red bar shows where each sequence has a nucleotide different from that of the reference genome, indicating that this bacterial strain really is different there.  This is a single nucleotide polymorphism (SNP).

DNA analysis requires very complex mathematical equations in order to have a standardized way to quantitatively and statistically compare two or two million DNA sequences.  For example, you can use equations for estimating entropy (chaos) and estimate how many sequences you might be missing due to sequencing shortcomings based on how homogeneous (similar) or varied your dataset is.  If you look at your data in chunks of 100 sequences, and 90 of them are different from each other, then sequencing your dataset again will probably turn up something new.  But if 90 are the same, you have likely found nearly all the species in that sample.

Bioinformatics takes these complex equations and uses computer programs to break them down into many simple pieces and automate them.  However, the more data you have, the more equations the computer will need to do, and the larger your files will be.  Thus, many researchers are limited by how much data they can process.

Mr-DNA-2-300
Mr. DNA, Jurassic Park (1993)

There are several challenges to analyzing any dataset. The first is assembly.

Sequencing technology can only add so many nucleotide bases to a synthesized sequence before it starts introducing more and more errors, or just stops adding altogether.  To combat this increase in errors, DNA or RNA is cut into small fragments, or primers are used to amplify only certain small regions.  These pieces can be sequenced from one end to another, or can be sequenced starting at both ends and working towards the middle to create a region of overlap.  In that case, to assemble, the computer needs to match up both ends and create one contiguous segment (“contig”).  With some platforms, like Illumina, the computer tags each sequence by where on the plate it was, so it knows which forward piece matches which reverse.

assembly-before-and-after-12-638
http://www.slideshare.net/flxlex/assembly-before-and-after

When sequencing an entire genome (or many), the pieces are enzymatically cut, or sheared by vibrating them at a certain frequency, and all the pieces are sequenced multiple times.  The computer then needs to match the ends up using short pieces of overlap.  This can be very resource-intensive for the computer, depending on how many pieces you need to put back together, and whether you have a reference genome for it to use (like the picture on a puzzle box), or whether you are doing it de novo from scratch (putting together a puzzle without a picture, by trial and error, two pieces at a time).

contig assembly.gif
http://slicetools.sourceforge.net/

Once assembled into their respective consensus sequences, you need to quality-check the data.

This can take a significant amount of time, depending on how you go about it. It also requires good judgement, and a willingness to re-run the steps with different parameters to see what will happen.  An easy and quick way is to have the computer throw out any data below a certain threshold: longer or shorter than what your target sequence length was, ambiguous bases (N) which the computer couldn’t call as a primary nucleotide (A, T, C, or G), or the confidence level (quality score) of the base call was low.  These scores are generated by the sequencing machine as a relative measure of how “confident” the base call is, and this roughly translates to potential number of base call errors (ex. marking it an A instead of a T) per 1,000 bases.  You can also cut off low-quality pieces, like the very beginning or ends of sequences which tend to sequence poorly and have low quality.  This is a great example of where judgement is needed: if you quality-check and trim off low quality bases first, and then assemble, you are likely to have cut off the overlapping ends which end up in the middle of a contig and won’t be able to put the two halves together.  If you assemble first, you might end up with a sequence that is low-quality in the middle, or very short if you trim it on the low quality portions.  If your run did not sequence well and you have lot of spontaneous errors, you will have to decide whether to work with a lot of poor-quality data, or a small amount of good-quality data leftover after you trim out the rest, or spend the money to try and re-sequence.

There are several steps that I like to add, some of which are necessary and some which are technically optional.  One of them is to look for chimeras, which are two sequence pieces that mistakenly got joined together.  This happens during the PCR amplification step, often if there is an inconsistent electrical current or other technical problem with the machine.  While time- and processor-consuming, chimera checking can remove these fake sequences before you accidentally think you’ve discovered a new species.  Your screen might end up looking something like this…

Screen Shot 2016-04-13 at 5.48.21 PM
Actual and common screen-shot… but I am familiar enough with it to be able to interpret!

 

Eventually, you can taxonomically and statistically assess your data.

Figure 1
Ishaq and Wright, 2014, Microbial Ecology

In order to assign taxonomic identification (ex. genus or species) to a sequence, you need to have a reference database.  This is a list of sequences labelled with their taxonomy (ex. Bacillus licheniformis), so that you can match your sequences to the reference and identify what you have.  There are several pre-made ones publicly available, but in many cases you need to add to or edit these, and several times I have made my own using available data in online databases.

Figure 4
Ishaq and Wright, 2014, Microbial Ecology

You can also statistically compare your samples.  This can get complicated, but in essence tries to mathematically compare datasets to determine if they are actually different, and if that difference could have happened by chance or not.  You can determine if organically-farmed soil contains more diversity than conventionally-farmed soils. Or whether you have enough sequencing coverage, or need to go back and do another run.  You can also see trends across the data, for example, whether moose from different geographic locations have similar bacterial diversity to each other (left).  Or whether certain species or environmental factors have a positive/negative/ or no correlation (below).

 

correlogram.jpg

Bioinformatics can be complicated and frustrating, especially because computers are very literal machines and need to have things written in very specific ways to get them to accomplish tasks.  They also aren’t very good at telling you what you are doing wrong; sometimes it’s as simple as having a space where it’s not supposed to be.  It takes dedication and patience to go back through code to look for minute errors, or to backtrack in an analysis and figure out at which step several thousand sequences disappeared and why.  Like any skill, computer science and bioinformatics take time and practice to master.  In the end, the interpretation of the data and identifying trends can be really interesting, and it’s really rewarding when you finally manage to get your statistical program to create a particularly complicated graph!

Stay tuned for an in-depth look at my current post-doctoral work with weed management in agriculture and soil microbial diversity!