The picture is just one instant in an event involving hundreds or thousands of organisms that were all doing a lot of different things, sometimes for just a few seconds. How would you describe it?
Maybe using the number of members present in this community? Or a list of names of attendees? The 16S rRNA gene for prokaryotes, or the 18S rRNA or ITS genes for eukaryotes, for examples, would tell us that. Those genes are found in all types of those organisms, and is a pretty effective means of basic identification. But, it’s only as good as how often that gene is found in the organisms you are looking for. There is no one gene that’s found exactly the same in all organisms, so you might need to target multiple different identification genes to look at all the different types of microorganisms, such as bacteria, fungi, protozoa, or archaea. Viruses don’t share a common gene across types, to look at viruses you’d need something else.
From our identification genes we could identify all the organisms wearing yellow; ex. phylogenetic Family = Ducks. That wouldn’t tell us if they were always found in this ecosystem (native Eugene population) or just passing through (transient population), but we could figure that out if we looked at every home game of the season and found certain community members there time and again.
But knowing they are Ducks doesn’t tell us anything else about that community member. What will they do if it starts raining? Are they able to go mountain biking? Perhaps we could identify their potential for activity by looking at the objects they are carrying? That would be akin to metagenomics, identifying all the DNA present from all the organisms, which tells us what genes are present, but not if they are currently or ever used. It can be challenging to interpret: think of sequencing data from one organism’s genome as one 1,000,000-piece puzzle and all the genomes in a community as 1,000 1,000,000-piece puzzles all dumped in a pile. In the crowd, metagenomics would tell us who had a credit card that was specifically used to buy umbrellas, but not whether they’d actually use the umbrella if it rains (ex. Eugeneans would not).
We could describe what everyone is doing at this moment. That would be transcriptomics, identifying all the RNA to determine which genes were actively being transcribed into proteins for use in some cellular function. If we see someone in the crowd using that credit card for an umbrella (DNA), the receipt would be the RNA. RNA is a working copy you make of the DNA to take to another part of the cell and use as a blueprint to make a protein. You don’t want your entire genome moving around, or need it to make one protein, so you make a small piece of RNA that will only hang around for a short period before degrading (i.e. you crumpling that RNA receipt and throwing it away because who keeps receipts anymore).
Using transcriptomics, we’d see you were activating your money to get that umbrella, but we wouldn’t see the umbrella itself. For that, we’d need metabolomics, which uses chemistry and physics instead of genomics, in order to identify chemicals (most often proteins). Think of metabolomics as describing this crowd by all the trash and crumbs and miscellaneous items they left behind. It’s one way to know what biological processes occurred (popcorn consumption and digestion).
From a technical standpoint, researching a microbiome might mean looking at all the DNA from all the organisms present to know who they are and of what they are capable. It might also mean looking at all the RNA present, which would tell you what genes were being used by “everyone” for whatever they were doing at a particular moment. Or you might also add metabolomics to identify all the chemical metabolites, which would be all the end products of what those cells were doing, and which are more stable than RNA so they could give you data about a longer frame of time. Collectively, -omics are technology that looks at all of a certain biological substance to help you understand a dynamic community. However, it’s important to remember that each technology gives a particular view of the community and comes with its own limitations.
To study DNA or RNA, there are a number of “wet-lab” (laboratory) and “dry-lab” (analysis) steps which are required to access the genetic code from inside cells, polish it to a high-sheen such that the delicate technology we rely on can use it, and then make sense of it all. Destructive enzymes must be removed, one strand of DNA must be turned into millions of strands so that collectively they create a measurable signal for sequencing, and contamination must be removed. Yet, what constitutes contamination, and when or how to deal with it, remains an actively debated topic in science. Major contamination sources include human handlers, non-sterile laboratory materials, other samples during processing, and artificial generation due to technological quirks.
Contamination from human handlers
This one is easiest to understand; we constantly shed microorganisms and our own cells and these aerosolized cells may fall into samples during collection or processing. This might be of minimal concern working with feces, where the sheer number of microbial cells in a single teaspoon swamp the number that you might have shed into it, or it may be of vital concern when investigating house dust which not only has comparatively few cells and little diversity, but is also expected to have a large amount of human-associated microorganisms present. To combat this, researchers wear personal protective equipment (PPE) which protects you from your samples and your samples from you, and work in biosafety cabinets which use laminar air flow to prevent your microbial cloud from floating onto your workstation and samples.
Fun fact, many photos in laboratories are staged, including this one, of me as a grad student. I’m just pretending to work. Reflective surfaces, lighting, cramped spaces, busy scenes, and difficulty in positioning oneself makes “action shots” difficult. That’s why many lab photos are staged, and often lack PPE.
Photo Credit: Kristina Drobny
Contamination from laboratory materials
Microbiology or molecular biology laboratory materials are sterilized before and between uses, perhaps using chemicals (ex. 70% ethanol), an ultraviolet lamp, or autoclaving which combines heat and pressure to destroy, and which can be used to sterilize liquids, biological material, clothing, metal, some plastics, etc. However, microorganisms can be tough – really tough, and can sometimes survive the harsh cleaning protocols we use. Or, their DNA can survive, and get picked up by sequencing techniques that don’t discriminate between live and dead cellular DNA.
In addition to careful adherence to protocols, some of this biologically-sourced contamination can be handled in analysis. A survey of human cell RNA sequence libraries found widespread contamination by bacterial RNA, which was attributed to environmental contamination. The paper includes an interesting discussion on how to correct this bioinformatically, as well as a perspective on contamination. Likewise, you can simply remove sequences belonging to certain taxa during quality control steps in sequence processing. There are a number of hardy bacteria that have been commonly found in laboratory reagents and are considered contaminants, the trouble is that many of these are also found in the environment, and in certain cases may be real community members. Should one throw the Bradyrhizobium out with the laboratory water bath?
Like the mythical creatures these are named for, sequence chimeras are DNA (or cDNA) strands which are accidentally created when two other DNA strands merged. Chimeric sequences can be made up of more than two DNA strand parents, but the probability of that is much lower. Chimeras occur during PCR, which takes one strand of genetic code and makes thousands to millions of copies, and a process used in nearly all sequencing workflows at some point. If there is an uneven voltage supplied to the machine, the amplification process can hiccup, producing partial DNA strands which can concatenate and produce a new strand, which might be confused for a new species. These can be removed during analysis by comparing the first and second half of each of your sequences to a reference database of sequences. If each half matches to a different “parent”, it is deemed chimeric and removed.
Cross – sample contamination
During DNA or RNA extraction, genetic code can be flicked from one sample to another during any number of wash or shaking steps, or if droplets are flicked from fast moving pipettes. This can be mitigated by properly sealing all sample containers or plates, moving slowly and carefully controlling your technique, or using precision robots which have been programmed with exacting detail — down to the curvature of the tube used, the amount and viscosity of the liquid, and how fast you want to pipette to move, so that the computer can calculate the pressure needed to perform each task. Sequencing machines are extremely expensive, and many labs are moving towards shared facilities or third-party service providers, both of which may use proprietary protocols. This makes it more difficult to track possible contamination, as was the case in a recent study using RNA; the researchers found that much of the sample-sample contamination occurred at the facility or in shipping, and that this negatively affected their ability to properly analyze trends in the data.
Sample-sample contamination during sequencing
Controlling sample-sample contamination during sequencing, however, is much more difficult to control. Each sequencing technology was designed with a different research goal in mind, for example, some generate an immense amount of short reads to get high resolution on specific areas, while others aim to get the longest continuous piece of DNA sequenced as possible before the reaction fails or become unreliable. they each come with their own quirks and potential for quality control failures.
Due to the high cost of sequencing, and the practicality that most microbiome studies don’t require more than 10,000 reads per sample, it is very common to pool samples during a run. During wet-lab processing to prepare your biological samples into a “sequencing library”, a unique piece of artificial “DNA” called a barcode, tag, or index, is added to all the pieces of genetic code in a single sample (in reality, this is not DNA but a single strand of nucleotides without any of DNA’s bells and whistles). Each of your samples gets a different barcode, and then all your samples can be mixed together in a “pool”. After sequencing the pool, your computer program can sort the sequences back into their respective samples using those barcodes.
While this technique has made sequencing significantly cheaper, it adds other complications. For example, Illumina MiSeq machines generate a certain number of sequence reads (about 200 million right now) which are divided up among the samples in that run (like a pie). The samples are added to a sequencing plate or flow cell (for things like Illumina MiSeq). The flow cells have multiple lanes where samples can be added; if you add a smaller number of samples to each lane, the machine will generate more sequences per sample, and if you add a larger number of samples, each one has fewer sequences at the end of the run. you have contamination. One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.
Cross-contamination can happen on a flow cell when the sample pool wasn’t thoroughly cleaned of adapters or primers, and there are great explanations of this here and here. To generate many copies of genetic code from a single strand, you mimic DNA replication in the lab by providing all the basic ingredients (process described here). To do that, you need to add a primer (just like with painting) which can attach to your sample DNA at a specific site and act as scaffolding for your enzyme to attach to the sample DNA and start adding bases to form a complimentary strand. Adapters are just primers with barcodes and the sequencing primer already attached. Primers and adapters are small strands, roughly 10 to 50 nucleotides long, and are much shorter than your DNA of interest, which is generally 100 to 1000 nucleotides long. There are a number of methods to remove them, but if they hang around and make it to the sequencing run, they can be incorporated incorrectly and make it seem like a sequence belongs to a different sample.
This may sound easy to fix, but sequencing library preparation already goes through a lot of stringent cleaning procedures to remove everything but the DNA (or RNA) strands you want to work with. It’s so stringent, that the problem of barcode swapping, also known as tag switching or index hopping, was not immediately apparent. Even when it is noted, it typically affects a small number of the total sequences. This may not be an issue, if you are working with rumen samples and are only interested in sequences which represent >1% of your total abundance. But it can really be an issue in low biomass samples, such as air or dust, particularly in hospitals or clean rooms. If you were trying to determine whether healthy adults were carrying but not infected by the pathogen C. difficile in their GI tract, you would be very interested in the presence of even one C. difficile sequence and would want to be extremely sure of which sample it came from. Tag switching can be made worse by combining samples from very different sample types or genetic code targets on the same run.
There are a number of articles proposing methods of dealing with tag switching using double tags to reduce confusion or other primer design techniques, computational correction or variance stabilization of the sequence data, identification and removal of contaminant sequences, or utilizing synthetic mock controls. Mock controls are microbial communities which have been created in the lab by mixed a few dozen microbial cultures together, and are used as a positive control to ensure your procedures are working. because you are adding the cells to the sample yourself, you can control the relative concentrations of each species which can act as a standard to estimate the number of cells that might be in your biological samples. Synthetic mock controls don’t use real organisms, they instead use synthetically created DNA to act as artificial “organisms”. If you find these in a biological sample, you know you have contamination. One drawback to this is that positive controls always sequence really well, much better than your low-biomass biological samples, which can mean that your samples do not generate many sequences during a run or means that tag switching is encouraged from your high-biomass samples to your low-biomass samples.
Incorrect base calls
Cross-contamination during sequencing can also be a solely bioinformatic problem – since many of the barcodes are only a few nucleotides (10 or 12 being the most commonly used), if the computer misinterprets the bases it thinks was just added, it can interpret the barcode as being a different one and attribute that sequence to being from a different sample than it was. This may not be a problem if there aren’t many incorrect sequences generated and it falls below the threshold of what is “important because it is abundant”, but again, it can be a problem if you are looking for the presence of perhaps just a few hundred cells.
When researching environments that have very low biomass, such as air, dust, and hospital or cleanroom surfaces, there are very few microbial cells to begin with. Adding even a few dozen or several hundred cells can make a dramatic impactinto what that microbial community looks like, and can confound findings.
Collectively, contamination issues can lead to batch effects, where all the samples that were processed together have similar contamination. This can be confused with an actual treatment effect if you aren’t careful in how you process your samples. For example, if all your samples from timepoint 1 were extracted, amplified, and sequenced together, and all your samples from timepoint 2 were extracted, amplified, and sequenced together later, you might find that timepoint 1 and 2 have significantly different bacterial communities. If this was because a large number of low-abundance species were responsible for that change, you wouldn’t really know if that was because the community had changed subtly or if it was because of the collective effect of low-level contamination.
Stay tuned for a piece on batch effects in sequencing!
500 Women Scientists Eugene would like to thank the organizations that helped make this event possible. First and foremost, First National Taphouse in Eugene, who shared their wonderful space with us and where we will be putting on future Salons, and donated a keg to the event! We are also extremely grateful to several organizations which contributed raffle items for us to raise additional funds, including Broadway Metro, Sizzle Pie, and the Eugene Science Center. Our beautiful logo was crafted by Cassie Cook, our amazing event posters were designed by Serena Lim, and photographer Danielle Cosme took some incredible event photos. Fertilab generously lent us a sound system, the Biology and the Built Environment Center donated the bacterial culture supplies, and both Theresa Cheng and Jessica Flannery provided materials and support for the interactive portion of the event. And of course, we want to acknowledge the national leadership of 500 Women Scientists, who brought us together, gave us a voice, and who suggested these Science Salons as a way to help CienciaPR, a organization which similarly supports science education and infrastructure.
I’d also like to acknowledge the powerhouse team of women who came together to organize this event, and who turned my silly event title into a reality: Karen Yook, Theresa Cheng, Leslie Dietz, and Hannah Tavalire. 500 Women Scientists was formed in the spirit of cooperation and support, and this team truly took that to heart. I can’t wait to organize the next one with you ladies, and the next one, and the next one, and the next one…
Last night, I gave my first “science stand-up” as part of the Oregon Museum of Science and Industry (OMSI) Science Pub series at Whirled Pies in Eugene, OR. I really enjoy giving public presentations of my work, and while I’ve been on stage with a microphone before, it was the first time I got a stool to put my drink on.
I gave a talk which encompassed much of my previous work on host-associated microbiomes in moose and other ruminants, as well as more current research from others on the human gut. It’s difficult enough to fit the field of host-associated microbiomes into a semester-long class, nevermind an hour (I digress), so I kept it to the highlights: “A crash course on the microbiome of the digestive tract“. You can find the slides here: Ishaq OMSI SciPub 20180208, although there is no video presentation at this time. I was honored to have such a well-attended lecture (about 120 people!) with an engaged audience, who had some really on-track questions about the intersection of microbial diversity and health.
As I’ve discussed here before, academic outreach is a sometimes overlooked, yet nevertheless extremely important, aspect of science. The members of the general public are a large portion of our stakeholder audience, and outreach helps disseminate that research knowledge, facilitate transparency of the research process, and engage people who might benefit from or be interested in our work. As I told the audience last night, scientists do like when people ask us about our work, but “we’re more scared of you than you are of us”. I encourage everyone to add science to their life by getting informed, getting involved, and getting out to vote.
Thanks again to OMSI for inviting me to participate, and to Whirled Pies for hosting!
As a thank you, I received this awesome pint glass!
As the 2016 growing season comes to a close in Montana, here in the lab we aren’t preparing to overwinter just yet. In the last few weeks, I have been setting up my first greenhouse trial to expand upon the work we were doing in the field. My ongoing project is to look at changes in microbial diversity in response to climate change. The greenhouse trial will expand on that by looking at the potential legacy effects of soil diversity following climate change, as well as other agricultural factors.
First, though, we had to prep all of our materials, and since we are looking at microbial diversity, we wanted to minimize the potential for microbial influences. This meant that the entire greenhouse bay needed to be cleaned and decontaminated. To mitigate the environmental impact of our research, we washed and reused nearly 700 plant pots and tags in order to reduce the amount of plastic that will end up in the Bozeman landfill.
Each pot needed to be scrubbed with disinfectant soap and then soaked in bleach.
Lines of pots drying on the rack.
I scrubbed 700 labels clean in order to reuse them.
We also needed to autoclave all our soil before we could use it, to make sure we are starting with only the microorganisms we are intentionally putting in. These came directly from my plots in the field study, and are being used as an inoculum, or probiotic, into soil as we grow a new crop of wheat.
This is trial one of three, each of which has three phases, so by the end of 2016 I’ll have cleaned and put soil into 648 pots with 648 tags; planted, harvested, dried and weighed 11,664 plants; and sampled, extracted DNA from, sequenced, and analyzed 330 soil and environmental samples!
Each pot gets six tiny winter wheat seeds planted.
Trial 1: 216 pots ready to grow!
Stay tuned for more updates and results (eventually) from this and my field study!
Bioinformatics brings statistics, mathematics, and computer programming to biology and other sciences. In my area, it allows for the analysis of massive amounts of genomic (DNA), transcriptomic (RNA), proteomic (proteins), or metabolomic (metabolites) data.
In recent years, the advances in sequencing have allowed for the large-scale investigation of a variety of microbiomes. Microbiome refers to the collective genetic material or genomes of all the microorganisms in a specific environment, such as the digestive tract or the elbow. The term microbiome is often casually thrown around: some people mistakenly use it interchangeably with “microbiota”, or use it to describe only the genetic material of a specific type of microorganism (i.e. “microbiome” instead of “bacterial microbiome”). Not only have targeted, or amplicon sequencing techniques improved, but methods that use single or multiple whole genomes have become much more efficient. In both cases, this has resulted in more sequences being amplified more times. This creates “sequencing depth”, a.k.a. better “coverage”: if you can sequence one piece of DNA 10 times instead of just once of twice, then you can determine if changes in the sequence are random errors or really there. Unfortunately, faster sequencing techniques usually have more spontaneous errors, so your data are “messy” and harder to deal with. More and messier data creates the problem of handling data.
DNA analysis requires very complex mathematical equations in order to have a standardized way to quantitatively and statistically compare two or two million DNA sequences. For example, you can use equations for estimating entropy (chaos) and estimate how many sequences you might be missing due to sequencing shortcomings based on how homogeneous (similar) or varied your dataset is. If you look at your data in chunks of 100 sequences, and 90 of them are different from each other, then sequencing your dataset again will probably turn up something new. But if 90 are the same, you have likely found nearly all the species in that sample.
Bioinformatics takes these complex equations and uses computer programs to break them down into many simple pieces and automate them. However, the more data you have, the more equations the computer will need to do, and the larger your files will be. Thus, many researchers are limited by how much data they can process.
There are several challenges to analyzing any dataset. The first is assembly.
Sequencing technology can only add so many nucleotide bases to a synthesized sequence before it starts introducing more and more errors, or just stops adding altogether. To combat this increase in errors, DNA or RNA is cut into small fragments, or primers are used to amplify only certain small regions. These pieces can be sequenced from one end to another, or can be sequenced starting at both ends and working towards the middle to create a region of overlap. In that case, to assemble, the computer needs to match up both ends and create one contiguous segment (“contig”). With some platforms, like Illumina, the computer tags each sequence by where on the plate it was, so it knows which forward piece matches which reverse.
When sequencing an entire genome (or many), the pieces are enzymatically cut, or sheared by vibrating them at a certain frequency, and all the pieces are sequenced multiple times. The computer then needs to match the ends up using short pieces of overlap. This can be very resource-intensive for the computer, depending on how many pieces you need to put back together, and whether you have a reference genome for it to use (like the picture on a puzzle box), or whether you are doing it de novo from scratch (putting together a puzzle without a picture, by trial and error, two pieces at a time).
Once assembled into their respective consensus sequences, you need to quality-check the data.
This can take a significant amount of time, depending on how you go about it. It also requires good judgement, and a willingness to re-run the steps with different parameters to see what will happen. An easy and quick way is to have the computer throw out any data below a certain threshold: longer or shorter than what your target sequence length was, ambiguous bases (N) which the computer couldn’t call as a primary nucleotide (A, T, C, or G), or the confidence level (quality score) of the base call was low. These scores are generated by the sequencing machine as a relative measure of how “confident” the base call is, and this roughly translates to potential number of base call errors (ex. marking it an A instead of a T) per 1,000 bases. You can also cut off low-quality pieces, like the very beginning or ends of sequences which tend to sequence poorly and have low quality. This is a great example of where judgement is needed: if you quality-check and trim off low quality bases first, and then assemble, you are likely to have cut off the overlapping ends which end up in the middle of a contig and won’t be able to put the two halves together. If you assemble first, you might end up with a sequence that is low-quality in the middle, or very short if you trim it on the low quality portions. If your run did not sequence well and you have lot of spontaneous errors, you will have to decide whether to work with a lot of poor-quality data, or a small amount of good-quality data leftover after you trim out the rest, or spend the money to try and re-sequence.
There are several steps that I like to add, some of which are necessary and some which are technically optional. One of them is to look for chimeras, which are two sequence pieces that mistakenly got joined together. This happens during the PCR amplification step, often if there is an inconsistent electrical current or other technical problem with the machine. While time- and processor-consuming, chimera checking can remove these fake sequences before you accidentally think you’ve discovered a new species. Your screen might end up looking something like this…
Eventually, you can taxonomically and statistically assess your data.
In order to assign taxonomic identification (ex. genus or species) to a sequence, you need to have a reference database. This is a list of sequences labelled with their taxonomy (ex. Bacillus licheniformis), so that you can match your sequences to the reference and identify what you have. There are several pre-made ones publicly available, but in many cases you need to add to or edit these, and several times I have made my own using available data in online databases.
You can also statistically compare your samples. This can get complicated, but in essence tries to mathematically compare datasets to determine if they are actually different, and if that difference could have happened by chance or not. You can determine if organically-farmed soil contains more diversity than conventionally-farmed soils. Or whether you have enough sequencing coverage, or need to go back and do another run. You can also see trends across the data, for example, whether moose from different geographic locations have similar bacterial diversity to each other (left). Or whether certain species or environmental factors have a positive/negative/ or no correlation (below).
Bioinformatics can be complicated and frustrating, especially because computers are very literal machines and need to have things written in very specific ways to get them to accomplish tasks. They also aren’t very good at telling you what you are doing wrong; sometimes it’s as simple as having a space where it’s not supposed to be. It takes dedication and patience to go back through code to look for minute errors, or to backtrack in an analysis and figure out at which step several thousand sequences disappeared and why. Like any skill, computer science and bioinformatics take time and practice to master. In the end, the interpretation of the data and identifying trends can be really interesting, and it’s really rewarding when you finally manage to get your statistical program to create a particularly complicated graph!
Stay tuned for an in-depth look at my current post-doctoral work with weed management in agriculture and soil microbial diversity!
Microbiome studies do not usually employ culturing techniques, and many microorganisms are too recalcitrant to grow in the laboratory. Instead, presumptive identification is made using gene sequence comparisons to known species. The ribosome is an organelle found in all living cells (they are ubiquitous), and it is responsible for translating RNA into amino acid chains. The genes in DNA which encode the parts of the ribosome are great targets for identification-based sequencing. In particular, the small subunit of the ribosome (SSU rRNA) provides a good platform for current molecular methods, although the gene itself does not provide any information about the phenotypic functionality of the organism.
Prokaryotes, such as bacteria and archaea, have a 16S rRNA gene which is approximately 1,600 nucleotide base pairs in length. Eukaryotes, such as protozoa, fungi, plants, animals, etc., have an 18S rRNA gene which is up to 2,300 base pairs in length, depending on the kingdom. In both cases, the 16 or 18 refers to sedimentation rates, and the S stands for Svedberg Units, all-together it is a relative measure of weight and size. Thus, the 18S is larger than the 16S, and would sink faster in water. In both genes, there exist regions which are conserved (identical or near-identical) across taxa, and nine variable regions (V1-V9) . The variable regions are generally found on the exterior of the ribosome, where they are more exposed and prone to higher evolutionary rates. Since the outside of the ribosome is not integral to maintaining its structure, the variable regions are not under functional constraint and may evolve without destroying the ribosome. They provide a means for identification and classification through analysis [2-6]. The conserved areas are targets for primers, as a single primer can bind universally (to all or nearly-all) to its target taxa. The conserved regions are all on the internal structure of the ribosome, and too much change in the sequence will cause its 3D (tertiary) structure to change, thus it won’t be able to interact with the many components in the cell. Mutations or changes in the conserved regions often causes a non-functional ribosome and will kill the cell.
In addition to a small subunit, ribosomes also possess a large subunit (LSU rRNA), the 23S rRNA in prokaryotes, and the 28S rRNA in eukaryotes. Eukaryotes have an additional 5.8S subunit which is non-coding, and all small and large units of RNA have associated proteins which aid in structure and function. Taken together, this gives a combined 70S ribosome in prokaryotes, and a combined 80S ribosome rRNA in eukaryotes.
The way to study the rRNA gene is to sequence it. First, you need to extract the DNA from cells, and then you need to make millions of copies of the gene you want using Polymerase Chain Reaction (PCR). PCR and sequencing technology more or less work the same way as a cell would make copies of DNA for cell processes or division (mitosis). You take template DNA, building block nucleotides, and a polymerase enzyme which is responsible for reading the DNA sequence and making an identical copy, and with hours of troubleshooting get a billion copies! Many sequencing machines use nucleotides that have colored dyes attached, and when a nucleotide is added, that dye gets cut (cleaved) off, and the camera can catch and interpret that action. It then records each nucleotide being added to each separate DNA strand, and outputs the sequences for the microorganisms that were in your original sample!
The two main challenges facing high-throughput sequencing are in choosing a target for amplification, and being able to integrate the generated data into an increased understanding of the microbiome of the environment being studied. High-throughput sequencing can currently sequence thousands to millions of reads which are up to 600-1000 bases in length, depending on the platform. This has forced studies to choose which variable regions of the rRNA gene to amplify and sequence, and has opened up an arena for debate on which variable region to choose . And of course, the DNA analysis of all this data you’ve now created is quickly being recognized as the most difficult part- which is what I focused on during my post-doc in the Yeoman Lab. Stay tuned for a blog post on the wonderful world of bioinformatics!
Neefs J-M, Van de Peer Y, Hendriks L, De Wachter R: Compilation of small ribosomal subunit RNA sequences. Nucleic Acids Res 1990, 18:2237–2318.
Kim M, Morrison M, Yu Z: Evaluation of different partial 16S rRNA gene sequence regions for phylogenetic analysis of microbiomes.J Microbiol Methods 2010, 84:81–87.
Doud MS, Light M, Gonzalez G, Narasimhan G, Mathee K: Combination of 16S rRNA variable regions provides a detailed analysis of bacterial community dynamics in the lungs of cystic fibrosis patients.Hum. Genomics 2010, 4:147–169.
Yu Z, Morrison M: Comparisons of different hypervariable regions of rrs genes for use in fingerprinting of microbial communities by PCR-denaturing gradient gel electrophoresis.Appl Env Microbiol 2004, 70:4800–4806.
Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR: Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses.Proc Natl Acad Sci USA 1985, 82:6955–6959.
Yu Z, García-González R, Schanbacher FL, Morrison M: Evaluations of different hypervariable regions of archaeal 16S rRNA genes in profiling of methanogens by archaea-specific PCR and denaturing gradient gel electrophoresis.Appl Env Microbiol 2007, 74:889–893.