Automator Services for finding coordinates in DNA/AA strings

Do you write code to analyze or modify DNA or proteins?  Do you do your work on a Mac?  If so, I have a few Automator Services, written in AppleScript, that you may find very handy:

  1. Get the sequence (& alignment) length of a selected nucleotide string
  2. Get the sequence length of any selected string (e.g. protein or quality string)
  3. Show where a coordinate is in any selected string (including white-spaces)
  4. Show where a coordinate is in a selected sequence (e.g. protein or quality string)
  5. Show where a nucleotide coordinate is in a selected sequence
  6. Show where an alignment coordinate is in a selected nucleotide sequence
  7. Get the reverse complement of a selected nucleotide sequence
  8. Guess the barcodes present in a FastQ file *NEW

With these, you can highlight a sequence anywhere in any application and either get the selection length or show where a supplied coordinate is in the selected sequence.

Each service, once installed, will show up in the contextual menu that shows up anytime you right-click any selected text, system-wide on your mac, under the services sub-menu, e.g.:


Here are the full details of how to use each service and what it does:

1. Get the sequence (& alignment) length of a selected nucleotide string

Name: Count Nucleotides


This service does a bit more than count nucleotides.  As seen in the example on the right, it reports the number of nucleotides in the selection (sequence length), the length of the selected alignment, the number of discrete & ambiguous nucleotides, and a breakdown of all case-insensitive sequence characters (including gaps and ambiguous nucleotides).

Spaces, tabs, newlines, carriage returns, numbers, or any other non-sequence characters are completely ignored, so you can select any sequence, even if it is formatted & displayed with coordinates.  Only the sequence found inside the highlighted text is considered.  The first selected sequence character is coordinate 1.

2. Get the sequence length of any selected string (e.g. protein or quality string)

Name: Count Sequence Characterscountnonwschars.png

This service counts every character selected except for spaces, tabs, newlines, and carriage returns.  It’s good for getting the length of selected unaligned protein sequences (with no formatted coordinates in the selection) or of quality strings.  Note, there is currently no service for aligned amino acid sequences, but if you would like such a service, let me know in the comments and I’ll whip one up.  I work mostly with DNA, thus I haven’t had much need for aligned protein coordinate determinations.

3. Show where a coordinate is in any selected string (including white-spaces)

Name: Select N Characters

This service works only on “solid sequence” (i.e. having no whitespaces, hard returns, or for that matter: any non-sequence characters).  See services 4-6 for sequence-specific functions.  The way this service shows where a coordinate is, is by changing the length of the selection.  The resulting last character of the selection after the length modification is the length supplied by you, the user.  For example, if you tell it to select 4 characters in this string you selected: ATGCCGTAG, the selection will end up as: ATGCCGTAG.


There are 2 ways to supply the coordinate.  The default way is to grab the coordinate from the clipboard, so all you have to do is copy the number you want to use to set the selection length.  However, if the content of the clipboard is not a number, a popup window will appear to ask you to enter the desired selection length.


To use this service:

  1. [Optional] Copy a number/length indicating the amount of the sequence you want to select.
  2. Select any length of sequence from the start position (position 1)
  3. Right-click the selection and select Services -> Select N Characters
  4. [If you didn’t do step 1] Enter the length of sequence you want to select

This script alters the selection length of the selected text you right-clicked on in the window in which you clicked on the sequence, regardless of application.  However, if you are in, it displays the selection result in a popup window instead of in the Terminal itself.  This is because the modification of the selection length is accomplished by shift-arrow keystroke emulation and this is not a means by which you can modify a selection in the Terminal app.  This has 1 side effect.  Normally, if the entered length is longer than the selection, in any other app, that’s not a problem and the selection just expands, but in Terminal’s popup result window, all the service has access to is the selected text, so placeholder ‘N’s are appended to show the desired sequence length.

lengthresult-terminal.pngThere are other drawbacks to this Terminal selection-length work-around.  The font is not fixed-width and the width of the popup window is fixed at a fairly narrow size, thus large sequences cannot be displayed very well.

Since the selection modification happens via emulated keystrokes, it takes a little time for the final selection to be made, but you’ll see how fast it goes as you watch the selection being made.  Since it’s not instantaneous, the script will either adjust the selection from the end or select anew from the beginning for efficiency.

4-8. New services

Names: Select N Sequence Characters, Select N Nucleotides, Select N Alignment Characters, Reverse Complement, & Guess barcodes

These services operate just like Select N Characters, but take the character type into account.

Select N Sequence Characters doesn’t include whitespace characters such as spaces, tabs, and newlines in the calculation of a coordinate in a string.

Select N Nucleotides doesn’t include non-nucleotide characters such as white spaces, gap characters, numbers, etc. in the calculation of a coordinate in a string.

Select N Alignment Characters behaves just like Select N Nucleotides but includes gap characters in the calculation of a coordinate in a string as the one exception.

Guess Barcodes allows you to right-click on a file and find out what the likely barcodes are.  It takes a little bit to run and makes some common assumptions, but if you think the results are wrong, you are given the opportunity to tweak the parameters and run again.


  1. Go to the github gist containing the Applescript code for each service
  2. Copy the code from one of the 3 files in the github gistcopygistcode.png
  3. Open Automator.appautomatordock.png
  4. Select service “Service”/gear icon from the dropdown sheet & click “Choose”selectservice.png
  5. Drag the “Run AppleScript” action into your workflowautomatorapplescript.gif
  6. Replace the purple code in the Run AppleScript action with the code you copied in step 1
  7. Save the workflow and name it however you would like it to appear in the services contextual menu (E.g. “Count Non-Whitespace Characters.workflow” – the extension will not appear in the menu)
  8. Repeat for the remaining 6 services.

The workflows/services will be saved automatically to your Library/Services directory in your home directory.  If you right-click the file name at the top of the window in Automator, you can select the Services folder to reveal it in the Finder.  You can then copy that file and send it to any other computer you would like to also have that service.

Just try your new services out by right-clicking on selected text anywhere.selectncharsexample.gif

And as you can see from the example above, Select N Characters works on any text.

Have fun!

Disclaimers: These services are only intended as a quick and dirty solution to work in any context, & any app.  If you have a repeated common use-case, consider other solutions.  Note also that any application which reserves the arrow keys for some function when the shift key is held down, other than modifying the most recent selection, this service will fail.  Some applications, such as java applications, modify selections using shift-arrow navigation differently, depending on the direction of the mouse drag during text selection.  This can produce unexpected results.  A work-around for both such issues can be to use the strategy used for, but this would required modification to the code.  A few of the features in the script rely on some tricks such as statically set delays and command-line calls, necessary to either wait for an application to respond or to control the focus of various windows.  If your computer is very busy or has any configuration issues, the proper functioning of these services may be disrupted.  These services were developed and tested on macOS Sierra, 10.12.6.  They may or may not work in other macOS versions.

Filtering Metagenomic Errors

People know that DNA sequencing technology has advanced and I think that the common lay-person’s perception is that we can sequence a whole genome, each chromosome, from end to end.  In many cases, that’s possible, but it’s still a monumental effort.  Notions of a “$1000 genome” belie the difficulties in full genome sequencing.  When you hear in the news that we can sequence your genome – services like “23 and Me”, you think that we’re getting the whole picture, but we aren’t.  We can sequence multitudes of short sequences very quickly and what we get is then mapped to a reference genome (which was one of those pain-staking efforts).  But a (what I would consider) large portion of what is sequenced cannot be mapped and those that are mapped can have many inconsistencies – because one person’s genome may have a certain number of shuffled portions and subtle differences.  AND you could even have two different cells from your own body possess 2 different distinct genomes.

Then there’s metagenomics – where we sequence multiple organisms all in one shot.  You take a sample of water, dirt, or a swab from the flora of your mouth and you extract the DNA from all the microbes there and sequence it without a reference to map any of the resulting sequences to.  In this torrent of information, we lack certain controls typically used to gauge quality of the sequence.  As with all machinery, there is a margin of error.  Sometimes a sequence that comes out has a typo, an A instead of a T or an extra G, or a missing C.  When we’re sequencing one organism, we can compare a piece of DNA with other copies of DNA with the same “word” and the error gets out-voted and ignored. It’s like having 100 secretaries type up the same document in a foreign language that you don’t know.  If 99 secretaries type the first word as “Que” and 1 of them types “Uqe” or “Quee”, we can pretty safely say that the correct word is “Que”.  But if each secretary is randomly given 1 of 25 different documents to type up – each of which is purposefully slightly different, it’s not so easy to dismiss “Uqe” or “Quee”.

But if we know that the “e” key is slightly sticky and prone to typing double letters every once in awhile, it becomes easier to dismiss an instance of “Quee”, and that’s what this post is about.  But what if there actually is a word such as “Quee” and we’d be dismissing a real word because we assume all rare occurrences of a double ‘e’ as a mistake?  We can figure this out by using a control to measure how frequently this type of mistake occurs.  As long as the occurrences of “Quee” fall into that general frequency or below, we can reasonably assume that it’s a typo.  If we see “Quee” twice or three times as many times as we would expect if it were a typo, we might conclude that it’s a real word.  And that is the basis for my recent paper and related software.

Typically, these sorts of errors are filtered out by first grouping all the most similar “words” together and then selecting the most frequent one as the representative of that group – assuming all others are errors of it.  However our method forgoes the clustering step and first tries to measure the frequencies of each type of error present in the data – the ones we are fairly confident are errors.  Then we look at the most similar words and determine whether one word could be an error of another by measuring how frequently each word is encountered and determine how likely the less frequent word is to be an error of the other by seeing whether it falls into the typical error rate/frequency we measured earlier.  We call our method “Cluster Free Filtering”, or CFF for short.

There’s a lot more to it, but that’s the basic concept.  You can get the nitty gritty details from the paper or even try out CFF for yourself if you have some DNA on your computer.  It’s freely available.  Note though that this is software specific to 1 narrow realm of metagenomic analysis: analysis of 16S rRNA variable regions where all the short sequences are very similar at the starting point.