People know that DNA sequencing technology has advanced and I think that the common lay-person’s perception is that we can sequence a whole genome, each chromosome, from end to end. In many cases, that’s possible, but it’s still a monumental effort. Notions of a “$1000 genome” belie the difficulties in full genome sequencing. When you hear in the news that we can sequence your genome – services like “23 and Me”, you think that we’re getting the whole picture, but we aren’t. We can sequence multitudes of short sequences very quickly and what we get is then mapped to a reference genome (which was one of those pain-staking efforts). But a (what I would consider) large portion of what is sequenced cannot be mapped and those that are mapped can have many inconsistencies – because one person’s genome may have a certain number of shuffled portions and subtle differences. AND you could even have two different cells from your own body possess 2 different distinct genomes.
Then there’s metagenomics – where we sequence multiple organisms all in one shot. You take a sample of water, dirt, or a swab from the flora of your mouth and you extract the DNA from all the microbes there and sequence it without a reference to map any of the resulting sequences to. In this torrent of information, we lack certain controls typically used to gauge quality of the sequence. As with all machinery, there is a margin of error. Sometimes a sequence that comes out has a typo, an A instead of a T or an extra G, or a missing C. When we’re sequencing one organism, we can compare a piece of DNA with other copies of DNA with the same “word” and the error gets out-voted and ignored. It’s like having 100 secretaries type up the same document in a foreign language that you don’t know. If 99 secretaries type the first word as “Que” and 1 of them types “Uqe” or “Quee”, we can pretty safely say that the correct word is “Que”. But if each secretary is randomly given 1 of 25 different documents to type up – each of which is purposefully slightly different, it’s not so easy to dismiss “Uqe” or “Quee”.
But if we know that the “e” key is slightly sticky and prone to typing double letters every once in awhile, it becomes easier to dismiss an instance of “Quee”, and that’s what this post is about. But what if there actually is a word such as “Quee” and we’d be dismissing a real word because we assume all rare occurrences of a double ‘e’ as a mistake? We can figure this out by using a control to measure how frequently this type of mistake occurs. As long as the occurrences of “Quee” fall into that general frequency or below, we can reasonably assume that it’s a typo. If we see “Quee” twice or three times as many times as we would expect if it were a typo, we might conclude that it’s a real word. And that is the basis for my recent paper and related software.
Typically, these sorts of errors are filtered out by first grouping all the most similar “words” together and then selecting the most frequent one as the representative of that group – assuming all others are errors of it. However our method forgoes the clustering step and first tries to measure the frequencies of each type of error present in the data – the ones we are fairly confident are errors. Then we look at the most similar words and determine whether one word could be an error of another by measuring how frequently each word is encountered and determine how likely the less frequent word is to be an error of the other by seeing whether it falls into the typical error rate/frequency we measured earlier. We call our method “Cluster Free Filtering”, or CFF for short.
There’s a lot more to it, but that’s the basic concept. You can get the nitty gritty details from the paper or even try out CFF for yourself if you have some DNA on your computer. It’s freely available. Note though that this is software specific to 1 narrow realm of metagenomic analysis: analysis of 16S rRNA variable regions where all the short sequences are very similar at the starting point.