Brian Connelly: howto Articles tagged 'howto' on Brian Connelly en-us http://bconnelly.net Mon, 28 Nov 2016 15:10:40 -0800 Mon, 28 Nov 2016 15:10:40 -0800 Jekyll v3.3.1 R Phone Home: Notifications with pushoverr Brian Connelly Wed, 23 Nov 2016 12:09:00 -0800 http://bconnelly.net/2016/11/R-phone-home/ http://bconnelly.net/2016/11/R-phone-home/ rhttrpushoverhowto There are a lot of times when it would be great if your computer talked to you. pushoverr allows you to send yourself or your group notifications from R. Instead of compulsively checking the status of a running job, you can just have R tell you when it’s done, if it failed, or what the results are.

]]>
There are a lot of times when it would be great if your computer talked to you. pushoverr allows you to send yourself or your group notifications from R. Instead of compulsively checking the status of a running job, you can just have R tell you when it’s done, if it failed, or what the results are.

To get started, download a Pushover client for your mobile device or desktop. Pushover is free for 7 days, but requires a $4.99 license after that. Once you’ve registered, you’ll get a user key, which looks like uQiRzpo4DXghDmr9QzzfQu27cmVRsG. You can find your user key either in the app’s settings or on Pushover’s website after logging in.

Pushover allows you to create “applications”, which are basically just different channels for your messages. You can also receive messages from “public applications” like GitHub (via a service hook), IFTTT, Zapier, and many others.

Creating Your Application

All that it takes to create an application dedicated to receiving messages from R is to visit the create a new application and enter a few fields. For this demo, I’m going to call it pushoverr (it can be anything you want, though). The other fields are optional, but I’m going to set an icon. The R Logo seems like a natural fit (scaled down to 72 pixels wide, download here). Now, whenever a message arrives, I’ll be able to know at a glance that it came from R.

Creating our Pushover application

Once you’ve agreed to the terms and clicked Create Application, you’ll be taken to the information page for your new application. Here, you can see some plots of the number of messages sent by your app both recently and over a longer period. You can also create subscription codes for your applications, which can allow other people to easily subscribe to notifications for that app. This is really handy for sending notifications to coworkers or even a wider audience. For now, what’s most important is your API Token/Key, which we’ll need to send notifications from this app. For my demo application here, the API token is avgk9kgxjp2e5xqnn3kzutvh43g27k.

pushoverr application information page

Sending Notifications

Now it’s time to start sending some messages. Start up an R session and install pushoverr:

install.packages("pushoverr")

Once that’s completed, we’re ready to let R start talking. Load the pushoverr package, and use the pushover function to craft and send your first message.

library("pushoverr")

pushover(message = "Mr. Watson--come here--I want to see you.")

Since this is your first time, pushover will ask you for your user key and app token. Don’t use mine.

setting pushoverr credentials (screenshot)

If everything worked, you should see a notification on your device very shortly.

our first pushoverr notification arrives (screenshot)

Using other arguments to pushover, you can set other aspects of your message, including sounds, links, and message priorities.

Your user key and app token are now saved for the rest of your session, so you can now send messages without having to enter them. Pushover offers 5 different message priorities, ranging from silent to emergency. Let’s send an emergency priority message, which will re-send every 30 seconds until it is acknowledged on your device.

pushover_emergency(message = "The kittens are awake, and they are ANGRY!")

Notification: The kittens are awake, and they are ANGRY! (screenshot)

Displaying Data on Your Wrist

I previously used Numerous to put Lake Union’s current temperature on my watch, which I scraped using R. It was a great way to know how shocking my daily swim would be. Numerous shut down earlier this year, so I was extremely excited to see that Pushover recently added the ability to push data to smaller screens.

pushoverr can now send several different types of data to your watch. For example, we can quickly display the current water temperature.

water_temp <- 52
update_glance(count = water_temp)

Watch displaying temperature (screenshot)

I won’t be swimming today.

We can also send numeric values as a percent, which is good for displaying progress. There’s also support for small bits of text:

update_glance(text = "hello woRld")

Watch displaying 'hello woRld' (screenshot)

Wrapping Up

Enabling R to send you bits of information can be extremely useful. I’ve used Pushover to do this for a few years, but there are several other alternatives. Pushbullet is a similar service that offers a few additional features, but does not yet have watch support. Dirk Eddelbuettel’s RPushbullet package allows you to interface with Pushbullet from R. You can also send notifications from IFTTT. Check out my previous blog post Connecting R to Everything with IFTTT to see how to send notifications and a whole lot more.

]]>
http://bconnelly.net/2016/11/R-phone-home/#comments
Connecting R to Everything with IFTTT Brian Connelly Thu, 18 Jun 2015 13:21:00 -0700 http://bconnelly.net/2015/06/connecting-r-to-everything-with-ifttt/ http://bconnelly.net/2015/06/connecting-r-to-everything-with-ifttt/ howtorifttthttr IFTTT (“if this then that”) is one of my favorite tools. I use it to keep and share articles, turn on my home’s lights at sundown, alert me when certain keywords are mentioned on Twitter/Reddit/etc., and many other things. Recently, the great people at IFTTT announced the Maker Channel, which allows recipes to make and receive web requests. This second option caught my interest as a nice way to do all kinds of things from R. For example, you could set the temperature with a Nest Thermostat, blink your hue lightbulbs, write some data to a Google Drive document, or do a whole lot of other things.

]]>
IFTTT (“if this then that”) is one of my favorite tools. I use it to keep and share articles, turn on my home’s lights at sundown, alert me when certain keywords are mentioned on Twitter/Reddit/etc., and many other things. Recently, the great people at IFTTT announced the Maker Channel, which allows recipes to make and receive web requests. This second option caught my interest as a nice way to do all kinds of things from R. For example, you could set the temperature with a Nest Thermostat, blink your hue lightbulbs, write some data to a Google Drive document, or do a whole lot of other things.

For the sake of demonstration, I’m going to use the Maker Channel to send notifications to my phone (even though I’m partial to the pushoverr package for that).

If you don’t already have one, you’ll first need to create an IFTTT account. To receive notifications on your phone, you’ll also need to install the IF app for iOS or Android. The instructions that follow can either be done from the IF app or from the IFTTT website. I’m going to be using the website.

Setting Up the Maker Channel

First, we’ll need to visit the Channels page to enable the Maker Channel. In the search bar, type “maker”. Select it by clicking on the fancy M in the results.

Finding the Maker channel

Now activate the Maker Channel by clicking on the Connect button.

Connecting the Maker Channel

This will get the channel ready to go and generate a secret key. You can see in the picture below that my secret key is ci740p2XeuKq35nfHohG9Z. It’s called a secret key for a reason, so don’t share this with anyone, or they can use it to trigger whatever actions you define. If you’ve managed to leak yours like I have here, you can generate a new secret key by pressing the Reconnect Channel button (which I have done following this post, so don’t try to send me your ANOVA results).

Channel is now activated

Now we’re ready to go. If you’d like to see exactly how to trigger events with the Maker Channel, follow the How to Trigger Events link (if you’ve clicked the link here, you’ll need to replace REPLACE_ME with your actual secret key).

Creating a Recipe for Notifications

Now we’re going to create a recipe to send you a notification on your phone whenever you (or R!) connects to the Maker Channel. Go to My Recipes and click on the Create a Recipe button.

If you’re new to IFTTT, the goal is to connect some output from one channel (i.e., the this) with some other channel (i.e., the that).

Select THIS

Click on the this link, and then select the fancy M for the Maker Channel.

Choosing the Maker channel

For the this part of a recipe, we can only choose to receive web requests, so follow that option. We’re now going to pick a name for the event that is triggered whenever a web request is received. For now, let’s call it r_status. When you’re a Maker Channel pro, you can create multiple different events and have them do different things.

Entering the Event name

Now it’s time to choose the that part of the recipe, which is what IFTTT will do whenever it receives r_status events. Click on the that link, and select either Android Notifications or iOS Notifications, depending on which type of device you have. I’ll be going with the iOS option, so the remaining screenshots may look slightly different for you.

Select THAT

No matter which route you go, the next option is to select Send a Notification.

Here, we can get creative with what the notification says. You can include the name of the event ({{EventName}}), when the web request was received ({{OccurredAt}}), and up to three text values that are given in the web request ({{Value1}}-{{Value3}}). So if you’re fitting a linear model with lm, you could notify yourself of the slope, the intercept, and the amount of time it took to run, which we’ll do later.

Creating the notification message

Once you’ve crafted your message, hit the Create Action button, edit your recipe title (optional), and hit the Create Recipe button.

Giving R a Voice

Now fire up R or RStudio. We’re going to need the httr package to send web requests. If you don’t already have it (or think you might be out of date), run install.packages("httr").

Before we send our first notification, I’m going to save my event name and secret key in variables. If you do the same (but with your secret values), you’ll be able to easily copy and run all of the other code that follows.

my_event <- 'r_status'
my_key <- 'ci740p2XeuKq35nfHohG9Z'

Now let’s send a first message! First, we’ll build the URL where we’ll issue the request, and then we’ll issue the request with httr’s POST:

maker_url <- paste('https://maker.ifttt.com/trigger', my_event, 'with/key',
                   my_key, sep='/')
                   httr::POST(maker_url)
## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

If all went well, your device should be beeping or buzzing. We see in the results that the status was 200, which indicates a success.

Our first notification

Sending Values

You may have noticed that your notification didn’t contain any data. We can add data by specifying values for value1, value2, and value3 (note the lowercase) in the body of our message.

httr::POST(maker_url, body=list(value1='hola', value2='mundo', value3=7))
## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

Notification from our hola mundo example

Now that ¡Hola, mundo! is out of the way, we can send it some data. I mentioned fitting a linear model earlier, so let’s do that. First, we’ll borrow some example code from lm’s help page and fit a linear model:

# Create some data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Fit the model
lm.D9 <- lm(weight ~ group)

Now, we’ll send a notification containing the model’s slope and intercept:

httr::POST(maker_url, body=list(value1='lm complete',
           value2=coefficients(lm.D9)[[2]],
           value3=coefficients(lm.D9)[[1]]))
 
## Response [https://maker.ifttt.com/trigger/r_status/with/key/ci740p2XeuKq35nfHohG9Z]
##   Date: 2015-06-18 19:01
##   Status: 200
##   Content-Type: text/html; charset=utf-8
##   Size: 48 B

Here, we’ve sent a little message as value1, the slope as value2, and the intercept as value3.

Notification from our lm example

Hopefully, this little example has demonstrated how IFTTT’s Maker Channel can be used to connect R to a whole lot of online services. Have at it!

]]>
http://bconnelly.net/2015/06/connecting-r-to-everything-with-ifttt/#comments
Creating Reproducible Software Environments with Packrat Brian Connelly Tue, 15 Jul 2014 09:53:00 -0700 http://bconnelly.net/2014/07/creating-reproducible-software-environments-with-packrat/ http://bconnelly.net/2014/07/creating-reproducible-software-environments-with-packrat/ datapubshowtoopen accesspackagerreproducibilitypackrat Open science has grown tremendously in the past few years. While there’s still a long way to go, the availability of data, software, and other materials is making it possible to re-use these products to expand upon previous work and apply them to new areas. Through responsible conduct of research (RCR) training, journal requirements, changing individual and institutional principles, and open access evangelism, it’s now much more common for researchers to package their work with the intention of sharing it with others. What exactly this entails depends on a lot of things, including the field of research, the type of data, and how the data were processed and analyzed. At a minimum, one would hope for all of the appropriate data, a description of the software used for analysis or the software itself, and some metadata describing the data and how it was processed.

]]>
Open science has grown tremendously in the past few years. While there’s still a long way to go, the availability of data, software, and other materials is making it possible to re-use these products to expand upon previous work and apply them to new areas. Through responsible conduct of research (RCR) training, journal requirements, changing individual and institutional principles, and open access evangelism, it’s now much more common for researchers to package their work with the intention of sharing it with others. What exactly this entails depends on a lot of things, including the field of research, the type of data, and how the data were processed and analyzed. At a minimum, one would hope for all of the appropriate data, a description of the software used for analysis or the software itself, and some metadata describing the data and how it was processed.

One aspect that has received little or no attention is the software environment that was used. By software environment, I mean all of the relevant software that was used by the project. This includes not just “R 3.1.0” or “Python 2.7.8”, but also the additional modules and packages and their specific versions. Given the rapid rate at which environments like R and Python are changing, in particular their commonly-used (and extremely powerful) add-on packages like dplyr, data.table, Pandas, and SciPy, it can very easily become the case that scripts run in a specific environment no longer properly work after a just short period of time. That puts a clock on the reproducibility of your project, which can limit its impact and usefulness.

This post describes Packrat, a package management tool for R. Packrat allows you to easily keep track of the software environment used for a particular project. It also makes the process of recreating that environment almost transparent. Because of how painless it makes sharing reproducible projects—either openly with the public, with collaborators, or just between different machines—Packrat is an essential tool for R.

Although this post is specific to R and Packrat, I hope that it helps users of other environments to think about how they can create reproducible software environments for their own work. Python users should definitely check out virtualenv (and A non-magical introduction to Pip and Virtualenv for Python beginners). These tools are very easy to use, so using them to help make your projects reproducible, either from the beginning or at the end, is a minimal investment that can really pay off.

The Big Picture

First, let’s get an overall idea of how Packrat works. You’ll need to have all the data and scripts related to your project in its own directory. How these files are named and organized is up to you. My personal preference is to create an R package that contains all the data, scripts, and documentation for the project, but that’s not necessary.

Once you initialize Packrat on that directory, it will keep track of which version of R is used as well as which packages are used in your scripts, including their version information and source code. This makes your project completely self-contained. Whenever (and wherever) this project is opened, Packrat will make sure that the packages that it uses are available. This means that the project can now be shared, and anyone using it can do so with the same software environment.

Note that because Packrat stores all of the packages used as well as their source code, projects using Packrat will require more disk space.

Installing Packrat

Packrat is still under development, but it can be installed using devtools.

install.packages("devtools")
devtools::install_github("rstudio/packrat")

Initializing Packrat for a Project

Once your project’s scripts are collected in their own directory, we can initialize Packrat to monitor those scripts.

packrat::init()

init will find any packages used by the scripts in your project and download the source code for the version used. It will also store the version of R and a copy of itself, so Packrat does not need to already be installed on other machines in order for your project to be used. All of this will be stored in the newly-created directory named packrat. It will also create an .Rprofile file in the project directory. This script runs automatically when the project opened, and takes care of the heavy lifting. It will make sure that each of the packages on which the project depends are available, which includes building them if necessary. The collection of packages managed by Packrat is called the private library. Your R session will only know about packages available in this library.

Installing Packages

If you decide to use additional packages after the project has been initialized, you can add them by installing them. For example, if we decide to rearrange a data set using reshape2, we can install it and its dependencies into to the private library.

install.packages("reshape2")

Keeping Synchronized

Typically, the private library changes when packages are installed or removed or when a package is used or no longer used in a script. By default, Packrat will keep a close eye on your scripts, making sure that the private library is kept up to date. This section demonstrates how to manage the private library using the status command.

Installing Packages

We’ve already seen how to install packages to the private library. Continuing with our previous example, if your scripts don’t yet use reshape2, Packrat will detect this and tell you that it and its dependencies are in the private library, but that they are not needed.

packrat::status()
The following packages are installed but not needed:
             _       
    plyr       1.8.1 
    Rcpp       0.11.2
    reshape2   1.4   

Use packrat::clean() to remove them. Or, if they are actually needed
by your project, add `library(packagename)` calls to a .R file
somewhere in your project.

Once you start using reshape2 in your scripts, packrat will report that everything is fine.

packrat::status()
Up to date.

When a Package is Not Available

Since starting with Packrat, my most common mistake is to start using a package in my scripts without first installing that package to the private library. When this happens, status will let you know (and the scripts won’t work). Let’s say I add functions from pushoverr to my scripts without first installing it:

packrat::status()
Error in getPackageRecords(inferredPkgNames, project = project, lib.loc = lib.loc,  : 
  Unable to retrieve package records for the following pacakges:
- 'bitops', 'digest', 'httr', 'pushoverr', 'RCurl'

Although this error message is somewhat cryptic, it shows that the library does not contain pushoverr or its dependencies. The problem can be resolved by installing pushoverr.

install.packages("pushoverr")
packrat::status()
Up to date.

When a Package is No Longer Used

Packrat will also detect the other situation: when you stop using a library in your project’s scripts. For example, if we no longer use pushoverr, status will display the packages included in the private library that are no longer used.

packrat::status()
The following packages are installed but not needed:
              _         
    bitops      1.0-6   
    digest      0.6.4   
    httr        0.3     
    pushoverr   0.1.3   
    RCurl       1.95-4.1

Use packrat::clean() to remove them. Or, if they are actually needed
by your project, add `library(packagename)` calls to a .R file
somewhere in your project.

Packrat makes it pretty clear how to handle this situation. Either the packages can be removed from the private library with clean, or the packages can be used in a script.

If we run status again after clean, we can see that Packrat still keeps track of packages after they’ve been removed from the private library.

packrat::status()
The following packages are tracked by packrat, but are no longer available in the local library nor present in your code:
              _         
    bitops      1.0-6   
    digest      0.6.4   
    httr        0.3     
    pushoverr   0.1.3   
    RCurl       1.95-4.1

You can call packrat::snapshot() to remove these packages from the lockfile, or if you intend to use these packages, use packrat::restore() to restore them to your private library.

Running snapshot will tell Packrat to stop monitoring any packages that are no longer installed. The private library can then be cleaned. Otherwise, if we’ve decided that we do want to use pushoverr, we can re-install it and its dependencies with restore.

Removing Packages

Packages can be removed from the private library using remove.packages. For example, we can remove the pushoverr package from our private library:

remove.packages("pushoverr")

If pushoverr is still used by a script, Packrat will let you know:

packrat::status()
Error in getPackageRecords(inferredPkgNames, project = project, lib.loc = lib.loc,  : 
  Unable to retrieve package records for the following pacakges:
- 'pushoverr'

Once an uninstalled package is no longer being used in your scripts, Packrat will give you the option to remove that package and its dependencies from your private library by running clean.

packrat::status()
The following packages are tracked by packrat, but are no longer available in the local library nor present in your code:
              _      
    pushoverr   0.1.3

You can call packrat::snapshot() to remove these packages from the lockfile, or if you intend to use these packages, use packrat::restore() to restore them to your private library.

The following packages are installed but not needed:
           _         
    bitops   1.0-6   
    digest   0.6.4   
    httr     0.3     
    RCurl    1.95-4.1

Use packrat::clean() to remove them. Or, if they are actually needed
by your project, add `library(packagename)` calls to a .R file
somewhere in your project.

Bundling Up the Project

Once the project has reached a state where it’s ready to be shared (or moved to your other computer), it can be “bundled” up using Packrat’s bundle function.

packrat::bundle()
The packrat project has been bundled at:
- "/home/alice/testprojet/packrat/bundles/testproject-2014-07-15.tar.gz"

We can now share the resulting file, /home/alice/testprojet/packrat/bundles/testproject-2014-07-15.tar.gz.

Opening a Bundled Project

There are two ways a bundled project can be “unbundled”. If Packrat is installed on the target machine, the unbundle function can be used.

packrat::unbundle(bundle="testproject-2014-07-15.tar.gz", where="/home/bob/projects")

Otherwise, the bundled project can simply be untarred and un-gzipped using most file archiving projects. Mac OS X users can double click on the bundled file in Finder. Users of Linux or other Unix-like systems (including OS X) can unbundle the project from the command line.

tar -xzf testproject-2014-07-15.tar.gz

This will unbundle the project into the current directory. If we want to mimic the behavior of unbundle from before and extract the project into /home/bob/projects.

We can also mimic the previous unbundle call and specify a location to install the project, we can add that to the command:

tar -xzf testproject-2014-07-15.tar.gz --directory /home/bob/projects

With either option, the testproject directory will no be available on the target machine. When R is started from that directory, Packrat will do its magic and make sure the software environment is the same as on the source machine.

Wrapping Up

That covers the basics of Packrat! The init, status, clean, bundle, and unbundle commands should be all you need to maintain projects with portable and reproducible software environments. Packrat does offer some additional functions for more directly managing the state of private libraries (see ?packrat::snapshot) as well as capabilities for easily moving in and out of Packrat projects (see ?packrat::packrat_mode for details).

Because Packrat is being developed by the fantastic people at RStudio, it is also nicely integrated into the latest versions of RStudio. If that’s the environment you use, be sure to check out their guide for Using Packrat with RStudio.

Save this post as a PDF

]]>
http://bconnelly.net/2014/07/creating-reproducible-software-environments-with-packrat/#comments
Plotting Microtiter Plate Maps Brian Connelly Thu, 01 May 2014 07:54:00 -0700 http://bconnelly.net/2014/05/plotting-microtiter-plate-maps/ http://bconnelly.net/2014/05/plotting-microtiter-plate-maps/ analysiscsvdplyrggplot2howtorvisualization I recently wrote about my workflow for Analyzing Microbial Growth with R. Perhaps the most important part of that process is the plate map, which describes the different experimental variables and where they occur. In the example case, the plate map described which strain was growing and in which environment for each of the wells used in a 96-well microtiter plate. Until recently, I’ve always created two plate maps. The first one is hand-drawn using pens and markers and sat on the bench with me when I started an experiment. By marking the wells with different colors, line types, and whatever other hieroglyphics I decide on, I can keep track of where everything is and how to inoculate the wells.

]]>
I recently wrote about my workflow for Analyzing Microbial Growth with R. Perhaps the most important part of that process is the plate map, which describes the different experimental variables and where they occur. In the example case, the plate map described which strain was growing and in which environment for each of the wells used in a 96-well microtiter plate. Until recently, I’ve always created two plate maps. The first one is hand-drawn using pens and markers and sat on the bench with me when I started an experiment. By marking the wells with different colors, line types, and whatever other hieroglyphics I decide on, I can keep track of where everything is and how to inoculate the wells.

A Plate Map

The second is a CSV file that contains a row for each well that I use and columns describing the values of each of my experimental variables. This file contains all of the information that I had on my hand-drawn plate map, but in a format that I can later merge with my result data to produce a fully-annotated data set. The fully-annotated data set is the perfect format for plotting with tools like ggplot2 or for sharing with others.

 
Well Strain Environment
1    B2      A           1
2    B3      B           1
3    B4      C           1
4    B5     <NA>         1
5    B6      A           2
6    B7      B           2
7    B8      C           2
8    B9     <NA>         2
9   B10      A           3
10  B11      B           3

But when talking with Carrie Glenney, whom I’ve been convincing of the awesomeness of the CSV/dplyr/ggplot workflow, I realized that there’s really no need to have two separate plate maps. Since all the information is in the CSV plate map, why bother drawing one out on paper? This post describes how I’ve started using ggplot2 to create a nice plate map image that I can print and take with me to the bench or paste in my lab notebook.

Reading in the Plate Map

First, load load your plate map file into R. You may need to first change your working directory with setwd or give read.csv the full path of the plate map file.

 
platemap <- read.csv("platemap.csv")

If you don’t yet have a plate map of your own, you can use this sample plate map.

Extracting Row and Column Numbers

In my plate maps, I refer to each well by its row-column pair, like “C6”. To make things easier to draw, we’re going to be splitting those well IDs into their row and column numbers. So for “C6”, we’ll get row 3 and column 6. This process is easy with dplyr’s mutate function. If you haven’t installed dplyr, you can get it by running install.packages('dplyr').

 
library(dplyr)

platemap <- mutate(platemap,
                   Row=as.numeric(match(toupper(substr(Well, 1, 1)), LETTERS)),
                   Column=as.numeric(substr(Well, 2, 5)))

Once this is done, the platemap data frame will now have two additional columns, Row and Column, which contain the row and column numbers associated with the well in the Well column, respectively.

Drawing the Plate

Microtiter plates are arranged in a grid, so it’s not a big leap to think about a plate as a plot containing the row values along the Y axis and the column values along the X axis. So let’s use ggplot2 to create a scatter plot of all of the wells in the plate map. We’ll also give it a title.

 
library(ggplot2)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(size=10) +
    labs(title="Plate Layout for My Experiment")

First Plot

As you can see, this plot doesn’t tell us anything about our experiment other than the wells it uses and their location.

Showing Empty Wells

I often don’t use all 96 wells in my experiments. It is useful, however, to show all of them. This makes it obvious which wells are used and helps orient your eyes when shifting between the plate map and the plate. Because of this, we’ll create some white circles with a light grey border for all 96 wells below the points that we’ve already created. We’ll also change the aspect ratio of the plot so that it better matches the proportions of a 96-well plate.

 
ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim = c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    labs(title="Plate Layout for My Experiment")

Plot of chunk plot2 blank wells and aspect ratio

Flipping the Axis

Now that we are showing all 96 wells, one thing becomes clear—the plot arranges the rows from 1 on the bottom to 8 at the top, which is opposite of how microtiter plates are labeled. Fortunately, we can easily flip the Y axis. While we’re at it, we’ll also tell the Y axis to use letters instead of numbers and to draw these labels for each value. Similarly, we’ll label each column value along the X axis.

 
ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment")

Axes flipped

For those who would like to mimic the look of a microtiter plate even more closely, I have some bad news. It’s not possible to place the X axis labels above the plot. Not without some complicated tricks, at least.

Removing Grids and other Plot Elements

Although the plot is starting to look a lot like a microtiter plate, there’s still some unnecessary “chart junk”, such as grids and tick marks along the axes. To create a more straightforward plate map, we can apply a theme that will strip these elements out. My theme for doing this (theme_bdc_microtiter) is available as part of the ggplot2bdc package. Follow that link for installation instructions. Once installed, we can now apply the theme:

 
library(ggplot2bdc)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

added plot theme

Highlighting Experimental Variables

Now that our plot is nicely formatted, it’s time to get back to the main point of all of this—displaying the values of the different experimental variables.

You’ll first need to think about how to best encode each of these values. For this, ggplot provides a number of aesthetics, such as color, shape, size, and opacity. There are no one-size-fits-all rules for this. If you’re interested in this topic, Jacques Bertin’s classic Semiology of Graphics has some great information, and Jeff Heer and Mike Bostock’s Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design is very interesting. After a little experimentation you should be able to figure out which encodings best represent your data.

You’ll also need to consider the data types of the experimental variables, because it’s not possible to map a shape or some other discrete property to continuous values.

Here, we’ll show the different environments using shapes, and the different strains using color. When R imported the plate map, it interpreted the Environment variable as continuous (not a crazy assumption, since it has values 1, 2, and 3). We’re first going to be transforming it to a categorical variable (factor in R speak) so that we can map it to a shape. We’ll then pass our encodings to ggplot as the aes argument to geom_point.

 
platemap$Environment <- as.factor(platemap$Environment)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(aes(shape=Environment, colour=Strain), size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

full plot

Changing Colors, Shapes, Etc.

By default, ggplot will use a default ordering of shapes and colors. If you’d prefer to use a different set, either because they make the data more easy to interpret (see Sharon Lin and Jeffrey Heer’s fascinating The Right Colors Make Data Easier To Read) or for some other reason, we can adjust them. I’ll change the colors used to blue, red, and black, which I normally associate with these strains. Although these colors aren’t quite as aesthetically pleasing as ggplot2’s defaults, I use them because they are the colors of markers I have at my bench.

 
ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(aes(shape=Environment, colour=Strain), size=10) +
    scale_color_manual(values=c("A"="blue", "B"="red", "C"="black")) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

colors

Wrap-Up

And that’s all it takes! You can now save the plot using ggsave, print it, add it to some slides, or anything else. In the future, I’ll describe a similar visualizations that can be made that allow exploration of the annotated data set, which contains the plate map information along with the actual data.

Many thanks to Sarah Hammarlund for her comments on a draft of this post!

Save this post as a PDF

]]>
http://bconnelly.net/2014/05/plotting-microtiter-plate-maps/#comments
Analyzing Microbial Growth with R Brian Connelly Wed, 09 Apr 2014 17:19:00 -0700 http://bconnelly.net/2014/04/analyzing-microbial-growth-with-r/ http://bconnelly.net/2014/04/analyzing-microbial-growth-with-r/ analysiscsvdatapubsdplyrevolutionfitnessggplot2growthhowtorreshape2 In experimental evolution research, few things are more important than growth. Both the rate of growth and the resulting yield can provide direct insights into a strain or species’ fitness. Whether one strain with a trait of interest can outgrow (and outcompete) another that possesses a variation of that trait often depends primarily on the fitnesses of the two strains.

]]>
In experimental evolution research, few things are more important than growth. Both the rate of growth and the resulting yield can provide direct insights into a strain or species’ fitness. Whether one strain with a trait of interest can outgrow (and outcompete) another that possesses a variation of that trait often depends primarily on the fitnesses of the two strains.

Zach Blount and his mountain of Petri dishes (B. Baer)
Zach Blount and his mountain of Petri dishes (Photo: Brian Baer)

Because of its importance both for painting the big picture and for properly designing experiments, I spend a very large portion of my time studying the growth of different candidate bacterial strains in different environments. This usually means counting colonies that grow on mountains of Petri dishes or by using a spectrophotometer to measure the absorbance of light as it passes through populations growing in clear microtiter plates. In science, replication is key, so once my eyes have glazed from counting colonies, or once the plate reader dings from across the lab to tell me that it’s done, my job becomes assembling all the data from the replicate populations of each strain and in each environment to try to figure out what’s going on. Do the growth rates show the earth-shattering result that I’m hoping for, or will I have to tweak my experimental design and go through it all again? The latter is almost always the case.

Because analyzing growth is both so fundamental to what I do, and because it is something I repeat ad nauseam, having a solid and easy-to-tweak pipeline for analyzing growth data is a must. Repeating the same analyses over and over again is not only unpleasant, but it also eats up a lot of time.

Here, I’m going to describe my workflow for analyzing growth data. It has evolved quite a bit over the past few years, and I’m sure it will continue to do so. As a concrete example, I’ll use some data from a kinetic read in a spectrophotometer, which means I have information about growth for each well in a 96-well microtiter plate measured periodically over time. I have chosen a more data dense form of analysis to highlight how easy it can be to analyze these data. However, I use the same pipeline for colony counts, single reads in the spec, and even for data from simulations. The following is an overview of the process:

  • Import the raw data, aggregating from multiple sources if necessary
  • Reformat the data to make it “tidy”: each record corresponds to one observation
  • Annotate the data, adding information about experimental variables
  • Group the replicate data
  • Calculate statistics (e.g., mean and variation) for each group
  • Plot the data and statistics

This workflow uses R, which is entirely a matter of preference. Each of these steps can be done in other environments using similar tools. I’ve previously written about grouping data and calculating group summaries in Summarizing Data in Python with Pandas, which covers the most important stages of the pipeline.

If you’d like to work along, I’ve included links for sample data in the section where they’re used. The entire workflow is also listed at the end, so you can easily copy and paste it into your own scripts.

For all of this, you’ll need an up-to-date version of R or RStudio. You’ll also need the dplyr, reshape2, and ggplot2 packages, which can be installed by running the following:

install.packages(c('reshape2', 'dplyr', 'ggplot2'))

The Initial State of the Data

To get started, I’ll put on my TV chef apron and pull some pre-cooked data out of the oven. This unfortunate situation occurs because all of the different software that I’ve used for reading absorbance measurements export data in slightly different formats. Even different versions of the same software can export differently.

So we’ll start with this CSV file from one of my own experiments. Following along with my data will hopefully be informative, but it is no substitute for using your own. So if you’ve got it, I really hope you will use your own data instead. Working this way allows you to see how each of these steps transforms your data, and how the process all comes together. For importing, try playing with the formatting options to read.table. There’s also a rich set of command line tools that make creating and manipulating tabular data quick and easy if that’s more your style. No matter how you get there, I’d recommend saving the data as a CSV (see write.csv) as soon as you’ve dealt with the import so that you never again have to face that unpleasant step.

rawdata <- read.csv("data/growth-raw.csv")

Each row of the example file contains the Time (in seconds), Temperature (in degrees Celsius), and absorbance readings at 600 nm for the 96 wells in a microtiter plate over a 24-hour period. These 96 values each have their own column in the row. To see the layout of this data frame, run summary(rawdata). Because of its large size, I’m not including the output of that command here.

Even microtiter plates can be mountainous, as Peter Conlin's bench shows
Even microtiter plates can be mountainous, as Peter Conlin's bench shows.

No matter if your software exports as text, XML, or something else, this basic layout is very common. Unfortunately, it’s also very difficult to work with, because there’s no easy way to add more information about what each well represents. In order to be aware of which well corresponds to which treatment, you’ll most likely have to keep referring either to your memory, your lab notes, or something else to remind yourself how the plate was laid out. Not only is this very inconvenient for analysis—your scripts will consist of statements like treatment_1_avg <- (B4 + E6 + H1) / 3, which are incomprehensible in just about all contexts besides perhaps Battleship—but it also almost guarantees a miserable experience when looking back on your data after even just a few days. In the next step, we’ll be re-arranging the data and adding more information about the experiment itself. Not only will this make the analysis much easier, but it’ll also help sharing the data with others or your future self.

Tidying the Data

As we saw before, our original data set contains one row for each point in time, where each row has the absorbance value for each of our 96 wells. We’re now going to follow the principles of Tidy Data and rearrange the data so that each row contains the value for one read of one well. As you will soon see, this means that each point in time will correspond to 96 rows of data.

To do this rearranging, we’re going to be using the melt function from reshape2. With melt, you specify which columns are identity variables, and which columns are measured variables. Identity variables contain information about the measurement, such as what was measured (e.g., which strain or environment), how (e.g., light absorbance at 600 nm), when, and so on. These are kind of like the 5 Ws of Journalism for your experiment. Measured variables contain the actual values that were observed.

library(reshape2)

reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well",
                 value.name="OD600")

In the example data, our identity variables are Time and Temperature, while our measured variable is absorbance at 600 nm, which we’ll call OD600. Each of these will be represented as a column in the output. The output, which we’re storing in a data frame named reshaped, will also contain a Well column that contains the well from which data were collected. The Well value for each record will correspond to the name of the column that the data came from in the original data set.

Now that our data are less “wide”, we can take a peek at its structure and its first few records:

summary(reshaped)
 
       Time        Temperature        Well            OD600       
  Min.   :    0   Min.   :28.2   A1     :  4421   Min.   :0.0722  
  1st Qu.:20080   1st Qu.:37.0   A2     :  4421   1st Qu.:0.0810  
  Median :42180   Median :37.0   A3     :  4421   Median :0.0970  
  Mean   :42226   Mean   :37.0   A4     :  4421   Mean   :0.3970  
  3rd Qu.:64280   3rd Qu.:37.0   A5     :  4421   3rd Qu.:0.6343  
  Max.   :86380   Max.   :37.1   A6     :  4421   Max.   :1.6013  
                                 (Other):397890
head(reshaped)
 
   Time Temperature Well  OD600
 1    0        28.2   A1 0.0777
 2   20        28.9   A1 0.0778
 3   40        29.3   A1 0.0779
 4   60        29.8   A1 0.0780
 5   80        30.2   A1 0.0779
 6  100        30.6   A1 0.0780

There’s a good chance that this format will make you a little bit uncomfortable. How are you supposed to do things like see what the average readings across wells B4, E6, and H1 are? Remember, we decided that doing it that way—although perhaps seemingly logical at the time—was not the best way to go because of the pain and suffering that it will cause your future self and anyone else who has to look at your data. What’s so special about B4, E6, and H1 anyway? You may know the answer to this now, but will you in 6 months? 6 days?

Annotating the Data

Based solely on the example data set, you would have no way of knowing that it includes information about three bacterial strains (A, B, and C) grown in three different environments (1, 2, and 3). Now we’re going to take advantage of our newly-rearranged data by annotating it with this information about the experiment.

One of the most important pieces of this pipeline is a plate map, which I create when designing any experiments that use microtiter plates (see my templates here and here. These plate maps describe the experimental variables tested (e.g., strain and environment) and what their values are in each of the wells. I keep the plate map at my bench and use it to make sure I don’t completely forget what I’m doing while inoculating the wells.

A plate map for our example data

For the analysis, we’ll be using a CSV version of the plate map pictured. This file specifies where the different values of the experimental variables occur on the plate. Its columns describe the wells and each of the experimental variables, and each row contains a well and the values of the experimental variables for that well.

In this sample plate map file, each row contains a well along with the letter of the strain that was in that well and the number of the environment in which it grew. If you look closely this plate map, you’ll notice that I had four replicate populations for each treatment. In some of the wells, the strain is NA. These are control wells that just contained growth medium. Don’t worry about these, we’ll filter them out later on.

platemap <- read.csv("data/platemap.csv")
head(platemap, n=10)
 
    Well Strain Environment
 1    B2      A           1
 2    B3      B           1
 3    B4      C           1
 4    B5   <NA>           1
 5    B6      A           2
 6    B7      B           2
 7    B8      C           2
 8    B9   <NA>           2
 9   B10      A           3
 10  B11      B           3

We can combine the information in this plate map with the reshaped data by pairing the data by their Well value. In other words, for each row of the reshaped data, we’ll find the row in the plate map that has the same Well. The result will be a data frame in which each row contains the absorbance of a well at a given point in time as well as information about what was actually in that well.

To combine the data with the plate map, we’ll use the inner_join function from dplyr, indicating that Well is the common column. Inner join is a term from databases that means to find the intersection of two data sets.

library(dplyr)

# Combine the reshaped data with the plate map, pairing them by Well value
annotated <- inner_join(reshaped, platemap, by="Well")

# Take a peek at the first few records in annotated
head(annotated)
 
   Time Temperature Well  OD600 Strain Environment
 1    0        28.2   B2 0.6100      A           1
 2   20        28.9   B2 0.5603      A           1
 3   40        29.3   B2 0.1858      A           1
 4   60        29.8   B2 0.1733      A           1
 5   80        30.2   B2 0.1713      A           1
 6  100        30.6   B2 0.1714      A           1

This produces a new table named annotated that contains the combination of our absorbance data with the information from the plate map. The inner join will also drop data for all of the wells in our data set that do not have a corresponding entry in the plate map. So if you don’t use a row or two in the microtiter plate, just don’t include those rows in the plate map (there’s nothing to describe anyway). Since the inner join takes care of matching the well data with its information, an added benefit of the plate map approach is that it makes data from experiments with randomized well locations much more easy to analyze (unfortunately, it doesn’t help with the pipetting portion of those experiments).

Let’s pause right here and save this new annotated data set. Because it contains all information related to the experiment—both the measured variables and the complete set of identity variables—it’s now in an ideal format for analyzing and for sharing.

# Write the annotated data set to a CSV file
write.csv(annotated, "data-annotated.csv")

Grouping the Data

Now that the data set is annotated, we can arrange it into groups based on the different experimental variables. With the example data set, it makes sense to collect the four replicate populations of each treatment at each time point. Using this grouping, we can begin to compare the growth of the different strains in the different environments over time and make observations such as “Strain A grows faster than Strain B in Environment 1, and slower than Strain B in Environment 2. In other words, we’re ready to start learning what the data have to tell us about our experiment.

For this and the following step, we’re once again going to be using the dplyr package, which contains some really powerful (and fast) functions that allow you to easily filter, group, rearrange, and summarize your data. We’ll group the data by Strain, then by Environment, and then by Time, and store the grouping in grouped. As shown in the Venn diagram, this means that we’ll first separate the data based on the strain. Then we’ll separate the data within each of those piles by the environment. Finally, within these smaller collections, we’ll group the data by time.

grouped <- group_by(annotated, Strain, Environment, Time)
Grouping the data by strain, environment, and time
Grouping the data by Strain, Environment, and Time

What this means is that grouped contains all of the growth measurements for Strain A in Environment 1 at each point in Time, then all of the measurements for Strain A in Environment 2 at each point in Time, and so on. We’ll use this grouping in the next step to calculate some statistics about the measurements. For example, we’ll be able to calculate the average absorbance among the four replicates of Strain A</span> in Environment 1 over time and for each of the other treatments.

Calculating Statistics for Each Group

Now that we have our data partitioned into logical groups based on the different experimental variables, we can calculate summary statistics about each of those groups. For this, we’ll use dplyr’s summarise function, which allows you to execute one or more functions on any of the columns in the grouped data set. For example, to count the number of measurements, the average absorbance (from the OD600 column), and the standard deviation of absorbance values:

stats <- summarise(grouped, N=length(OD600), Average=mean(OD600), StDev=sd(OD600))

The resulting stats data set contains a row for each of the different groups. Each row contains the Strain, Environment, and Time that define that group as well as our sample size, average, and standard deviation, which are named N, Average, and StDev, respectively. With summarise, you can use apply any function to the group’s data that returns a single value, so we could easily replace the standard deviation with 95% confidence intervals:

# Create a function that calculates 95% confidence intervals for the given
# data vector using a t-distribution
conf_int95 <- function(data) {
    n <- length(data)
    error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
    return(error)
}

# Create summary for each group containing sample size, average OD600, and
# 95% confidence limits
stats <- summarise(grouped, N=length(OD600), Average=mean(OD600),
                   CI95=conf_int95(OD600))

Combining Grouping, Summarizing, and More

One of the neat things that dplyr provides is the ability to chain multiple operations together using magrittr’s %>% operator. This allows us to combine the grouping and summarizing from the last two steps (and filtering, sorting, etc.) into one line:

stats <- annotated %>%
          group_by(Environment, Strain, Time) %>%
          summarise(N=length(OD600), 
                    Average=mean(OD600),
                    CI95=conf_int95(OD600)) %>%
          filter(!is.na(Strain))

Note that I’ve put the input data set, annotated, at the beginning of the chain of commands and that group_by and summarise no longer receive an input data source. Instead, the data flows from annotated, through summarise, and finally through filter just like a pipe. The added filter removes data from the control wells, which had no strain.

Plotting the Results

Now that we have all of our data nicely annotated and summarized, a great way to start exploring it is through plots. For the sample data, we’d like to know how each strain grows in each of the environments tested. Using the ggplot2 package, we can quickly plot the average absorbance over time:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) + 
       geom_line() + 
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

single plot

The obvious problem with this plot is that although we can differentiate among the three strains, we can’t see the effect that environment has. This can be fixed easily, but before we do that, let’s quickly dissect what we did.

We’re using the ggplot function to create a plot. As arguments, we say that the data to plot will be coming from the stats data frame. aes allows us to define the aesthetics of our plot, which are basically what ggplot uses to determine various visual aspects of the plot. In this case, the x values will be coming from our Time column, which we divide by 3600 to convert seconds into hours. The corresponding y values will come from the Average column. Finally, we will color things such as lines, points, etc. based on the Strain column.

The ggplot function sets up a plot, but doesn’t actually draw anything until we tell it what to draw. This is part of the philosophy behind ggplot: graphics are built by adding layers of different graphic elements. These elements (and other options) are added using the + operator. In our example, we add a line plot using geom_line. We could instead make a scatter plot with geom_point, but because our data are so dense, the result isn’t quite as nice. We also label the axes using labs.

Back to the problem of not being able to differentiate among the environments. While we could use a different line type for each environment (using the linetype aesthetic), a more elegant solution would be to create a trellis chart. In a trellis chart (also called small multiples by Edward Tufte), the data are split up and displayed as individual subplots. Because these subplots use the same scales, it is easy to make comparisons. We can use ggplot’s facet_grid to create subplots based on the environments:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) + 
       geom_line() + 
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")
Trellis plot showing the growth of the strains over time for each environment
Trellis plot showing the growth of the strains over time for each environment

Let’s take it one step further and add shaded regions corresponding to the confidence intervals that we calculated. Since ggplot builds plots layer-by-layer, we’ll place the shaded regions below the lines by adding geom_ribbon before using geom_line. The ribbons will choose a fill color based on the Strain and not color the edges. Since growth is exponential, we’ll also plot our data using a log scale with scale_y_log10:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
       geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
                   color=NA, alpha=0.3) + 
       geom_line() +
       scale_y_log10() +
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")
Our final plot showing the growth of each strain as mean plus 95% confidence intervals for each environment
Our final plot showing the growth of each strain as mean plus 95% confidence intervals for each environment

In Conclusion

And that’s it! We can now clearly see the differences between strains as well as how the environment affects growth, which was the overall goal of the experiment. Whether or not these results match my hypothesis will be left as a mystery. Thanks to a few really powerful packages, all it took was a few lines of code to analyze and plot over 200,000 data points.

I’m planning to post a follow-up in the near future that builds upon what we’ve done here by using grofit to fit growth curves.

Complete Script

library(reshape2)
library(dplyr)
library(ggplot2)

# Read in the raw data and the platemap. You may need to first change your
# working directory with the setwd command.
rawdata <- read.csv("data/growth-raw.csv")
platemap <- read.csv("data/platemap.csv")

# Reshape the data. Instead of rows containing the Time, Temperature,
# and readings for each Well, rows will contain the Time, Temperature, a
# Well ID, and the reading at that Well.
reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well", 
                 value.name="OD600")

# Add information about the experiment from the plate map. For each Well
# defined in both the reshaped data and the platemap, each resulting row
# will contain the absorbance measurement as well as the additional columns
# and values from the platemap.
annotated <- inner_join(reshaped, platemap, by="Well")

# Save the annotated data as a CSV for storing, sharing, etc.
write.csv(annotated, "data-annotated.csv")

conf_int95 <- function(data) {
    n <- length(data)
    error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
    return(error)
}

# Group the data by the different experimental variables and calculate the
# sample size, average OD600, and 95% confidence limits around the mean
# among the replicates. Also remove all records where the Strain is NA.
stats <- annotated %>%
              group_by(Environment, Strain, Time) %>%
              summarise(N=length(OD600),
                        Average=mean(OD600),
                        CI95=conf_int95(OD600)) %>%
              filter(!is.na(Strain))

# Plot the average OD600 over time for each strain in each environment
ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
       geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
                   color=NA, alpha=0.3) + 
       geom_line() +
       scale_y_log10() +
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

Extending to Other Types of Data

I hope it’s also easy to see how this pipeline could be used in other situations. For example, to analyze colony counts or a single read from a plate reader, you could repeat the steps exactly as shown, but without Time as a variable. Otherwise, if there are more experimental variables, the only change needed would be to add a column to the plate map for each of them.

Acknowledgments

I’d like to thank Carrie Glenney and Jared Moore for their comments on this post and for test driving the code. Many thanks are also in order for Hadley Wickham, who developed each of the outstanding packages used here (and many others).

Save this post as a PDF

]]>
http://bconnelly.net/2014/04/analyzing-microbial-growth-with-r/#comments
Creating Colorblind-Friendly Figures Brian Connelly Wed, 16 Oct 2013 09:31:00 -0700 http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/ http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/ colorggplot2howtoplottingvisualizationr Color is often used to display an extra dimension in plots of scientific data. Unfortunately, everyone does not decode color in exactly the same way. This is especially true for those with color vision deficiency, which affects up to 8 percent of the population in its two most common forms. As a result, it has been estimated that the odds of a given plot reaching a reviewer with some form of color vision deficiency in a group of three males is approximately 22%. Hopefully, when we are creating figures, this number alone is compelling enough to always keep these viewers in mind. The truth, however, is that your figures aren‘t only seen by reviewers: they are seen by a much wider group that includes readers of your paper, members of the audience when you present your work, viewers of your lab‘s website, and potentially many others. As your audience grows, your choices in color become more and more important for effectively communicating your work.

]]>
Color is often used to display an extra dimension in plots of scientific data. Unfortunately, everyone does not decode color in exactly the same way. This is especially true for those with color vision deficiency, which affects up to 8 percent of the population in its two most common forms. As a result, it has been estimated that the odds of a given plot reaching a reviewer with some form of color vision deficiency in a group of three males is approximately 22%. Hopefully, when we are creating figures, this number alone is compelling enough to always keep these viewers in mind. The truth, however, is that your figures aren‘t only seen by reviewers: they are seen by a much wider group that includes readers of your paper, members of the audience when you present your work, viewers of your lab‘s website, and potentially many others. As your audience grows, your choices in color become more and more important for effectively communicating your work.

Although there are many outstanding tools for creating beautiful plots, practically all of them have default color palettes that can present decoding challenges for individuals with color vision deficiencies. This is an introduction to creating plots and figures using color palettes that are more accessible. For the examples below, I use the excellent ggplot2 library for R. The same ideas and colors can easily be transferred to your particular tool of choice.

Using Color to Represent Categorical Data

When using color to encode categorical data, such as blood type, gender, or strain of a bacteria, it is important to choose a color palette that has as many easily-differentiable colors as there are categories. The figure below shows one palette that can encode up to 8 values, and simulates how each of its colors is seen by someone with protanopia, deuteranopia, and tritanopia.

Colorblind-Friendly Palette

With ggplot2, the color palette for categorical data can be set using scale_color_manual (for points, lines, and outlines) and scale_fill_manual (for boxes, bars, and ribbons). The argument to either of these commands is a vector of colors, which can be defined by hex RGB triplet or by name. As an example, let’s take a look at the relationship between the weight and the corresponding price of diamonds in ggplot2’s included diamonds data set. We can use color to indicate the quality of the cut. Note that this data set is quite large, so this scatter plot might not be the most informative way to display these data.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
    geom_point()

Plot of weight, price, and cut using ggplot2's default color palette

scale_color_manual sets the color of the first category (chosen alphabetically in R unless an ordering is specified) using the first color given, the second category with the second color, and so on. Using the colors from the colorblind-safe palette shown above:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
    geom_point() +
    scale_color_manual(values=c("#000000", "#E69F00", "#56B4E9", "#009E73",
                                "#F0E442", "#0072B2", "#D55E00", "#CC79A7"))

Plot of diamond price as a function of weight using the colorblind-friendly palette

Otherwise, if you don’t want to have to remember the ordering of your categories, or if you want to apply specific colors to each category, you can manually define the color of each:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
    geom_point() +
    scale_color_manual(values=c("Fair"="#E69F00", "Good"="#56B4E9",
                                "Premium"="#009E73", "Ideal"="#F0E442",
                                "Very Good"="#0072B2"))

Plot of diamond price as a function of weight using the colorblind-friendly palette and assigning colors based on category.

Redundant Encodings

When describing a figure, it is a common tendency to refer to a specific color. Hopefully, you’re at least now convinced that not everyone sees color the same way, especially when using a standard red, green, blue color palette. It is also very common for figures to be printed in black and white or your printer to be low on magenta ink. To improve legibility when your figures aren’t reproduced exactly as created, consider using redundant encodings. As an example, we can use both shapes and colors to refer to categories:

ggplot(diamonds, aes(x=carat, y=price, color=cut, shape=cut)) +
    geom_point() +
    scale_color_manual(values=c("#000000", "#E69F00", "#56B4E9", "#009E73",
                                "#F0E442", "#0072B2", "#D55E00", "#CC79A7"))

Cut quality displayed using both color and point shape

The use of redundant encoding can also aid in figure captions, where referring to a category as “the blue squares” is helpful both for those with color vision deficiencies, and for those with printer troubles (all of us?). However, if the data can be represented with symbols equally as well as with colors, this does beg the question that should always be asked: Are colors absolutely necessary?

Using ColorBrewer Palettes

No discussion on color palettes would be complete without mentioning Cynthia Brewer’s ColorBrewer 2, an excellent source for color palettes that includes both colorblind-safe and print-friendly palettes.

ColorBrewer

Many graphics packages allow you to easily make use of the ColorBrewer palettes. In ggplot2, this is done with the scale_color_brewer command.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
    geom_point() +
    scale_color_brewer(palette="Dark2")

Plot of diamond price as a function of weight using the ColorBrewer's Dark2 palette and assigning colors based on category.

Using Color to Represent Continuous Values

When using color to represent continuous values, special care should be taken to ensure not only that colors chosen are differentiable, but also that viewers interpret changes in value of a given magnitude similarly throughout the spectrum. The rainbow color map, which is the default in many graphics packages, does not do this well. Color palettes that use variations, not only in hue, but also in saturation and lightness, can produce more linear changes in perception.

Greyscale and rainbow color palettes

Of course, color gradients can introduce additional problems for viewers with color vision deficiencies when certain areas of the spectrum are included. For these viewers, colors that vary uniformly in lightness, which is how the greyscale palette is made, are most accessible. Again, always ask yourself if the use of color conveys information that could be encoded in another way.

ggplot2 includes a number of functions for making continuous color scales such as scale_color_gradient, scale_color_continuous, and scale_color_grey. To demonstrate, I’ll switch to the mtcars data set, which contains, among other things, fuel economy for 32 cars manufactured in 1973-1974.

# Example borrowed from the geom_tile documentation
ggplot(mtcars, aes(y=factor(cyl), x=mpg)) +
    stat_density(aes(fill=..density..), geom="tile", position="identity")

Distribution of fuel economies as related to engine size among a sampling of cars

Fortunately, ggplot2 does a nice job in displaying continuous values with color by default. Otherwise, we can use the ColorBrewer package to fetch palettes from ColorBrewer (the “PuBuGn” palette in this case), and apply them using the scale_color_gradientn command:

ggplot(mtcars, aes(y=factor(cyl), x=mpg)) +
    stat_density(aes(fill=..density..), geom="tile", position="identity") +
    scale_fill_gradientn(colours=brewer.pal(n=8, name="PuBuGn"))

Distribution of fuel economies

Further Reading

]]>
http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/#comments
Working with CSVs on the Command Line Brian Connelly Mon, 23 Sep 2013 12:19:00 -0700 http://bconnelly.net/working-with-csvs-on-the-command-line/ http://bconnelly.net/working-with-csvs-on-the-command-line/ howtocsvsoftware Comma-separated values (CSV), and its close relatives (e.g., Tab-separated values) play a very important role in open access science. CSV is an informally-defined file format that stores tabular data (think spreadsheets) in plain text. Within the file, each row contains a record, and each field in that record is separated by a comma, tab, or some other character. This format offers several significant advantages. Because they are plain text, these files can be easily read and edited without the need for specialized or proprietary software. CSVs are also version-independent, so ten years down the road you won’t have to track down some ancient piece of software in order to revisit your data (or do the same for someone else’s data). Support for CSV files is built into most data analysis software, programming languages, and online services (see Some Useful Resources at the end of this article for links for your software of choice).

]]>
Comma-separated values (CSV), and its close relatives (e.g., Tab-separated values) play a very important role in open access science. CSV is an informally-defined file format that stores tabular data (think spreadsheets) in plain text. Within the file, each row contains a record, and each field in that record is separated by a comma, tab, or some other character. This format offers several significant advantages. Because they are plain text, these files can be easily read and edited without the need for specialized or proprietary software. CSVs are also version-independent, so ten years down the road you won’t have to track down some ancient piece of software in order to revisit your data (or do the same for someone else’s data). Support for CSV files is built into most data analysis software, programming languages, and online services (see Some Useful Resources at the end of this article for links for your software of choice).

To many, the command line (or shell) is a strange and scary place. Those familiar with working in the command line, however, wouldn’t trade its power, speed, and flexibility for anything, no matter how shiny or “user friendly” an alternative drag-and-drop interface might be. This power, speed, and flexibility extends to working with CSV files, and it’s here that I’d like to demonstrate a small slice of what can be done with just a few keystrokes. Even though the Unix command line has existed for decades, it’s still always just a few clicks away, whether you use Linux on some HPC hardware or cloud service, or whether you use Mac OS X on your laptop.

You don’t have to be a command line wizard to use the commands below, but you do need to know the basics of how to get to a command line (hint for OS X users, it’s through the Terminal app) and how to do things like changing directories (cd) and listing files (ls). If these topics are new to you, there’s a pretty good introduction linked below in Some Useful Resources. I’ve also included a brief description of the command line tools used in this cookbook below in A Quick Overview of the Commands Used in this Cookbook. In the next section, I’ll quickly introduce two of the command line’s most powerful features: pipes and output redirection. If you’re already familiar with these, you can jump right into the good stuff.

This guide is intended to be used like a cookbook. Although you can read it from start to finish (it’s not that long), each “recipe” is a self-contained guide for doing a particular task with a CSV file.

Although we’re focusing on comma-separated values, all of the commands below would work for data separated by tabs or any other separator. Just replace the comma character in the commands with your delimiter of choice. Most software that handles CSVs isn’t picky, and allows you to specify the delimiter used.

Contents

A Quick Introduction to Pipes and Redirection

Most of the commands available on the command line were developed with the same goals: to do a single job and to do it well. The real power of the command line comes from stringing these programs together to perform more complex tasks. By knowing just a handful of commands, just about anything is possible.

Writing Output to a File with Output Redirection

By default, most command line tools give their results by printing them to the screen. This might be ok when doing some calculations, but it’s not so useful if you’re transforming files (like CSVs) and intend to save them, share them, or use the results in some other software.

With output redirection, you can tell the command line to write the output of one or more commands to a file instead of printing it to the screen. To do this, just add the > character and the name of the file where you want the output to be stored to the end of the command. If the output file doesn’t already exist, > will create that file. If that file exists, > will replace its contents with the new output. If you’d rather add the output of your commands to an existing file, you can use > instead.

For example, the following command (which will be described in more detail in the first recipe) will print all of the lines of the file input.csv that do not begin with a # character:

grep -v "^#" input.csv

Instead, to save those results to a file named output.csv, we could use output rediraction and would run:

grep -v "^#" input.csv > output.csv

If output.csv already existed, its contents would be replaced. If we instead wanted to add the output of the current command to output.csv, we’d do:

grep "^#" input.csv >> output.csv

Connecting Programs with Pipes

Pipes, denoted with the | character, allow the output of one program to be sent as input to another program. This saves you from having to first save the output of the first program to a file, and then start the second program using that file as input. You can imagine that a workflow with many steps could generate a lot of intermediate files and require a lot of concentration to see things through from start to finish.

Using our previous example once again:

grep -v "^#" input.csv

Prints all of the lines of input.csv that do not contain the # character. The grep program searches for patterns within files (see A Quick Overview of the Commands Used in this Cookbook). Here, we’re using it in reverse and finding where a pattern does not exist. If we wanted to count the number of non-commented lines (i.e., those with #) in input.csv, we could use grep to strip out lines with #, and use a pipe to send the results to the wc program, which can count the number of lines in its input when used with the -l argument:

grep -v "^#" input.csv | wc -l

Pipes are used extensively in the recipes that follow. If you’re still not quite comfortable with how they work, try executing the commands in the recipes one by one to see what the output of each part is. Then try to visualize how they all come together.

Cleaning Up CSV Files

Removing Comments

Comments are additional information stored in the file that are not a part of the actual data set, but that either describe the data set or provide relevant information (metadata). Comments could include things like the name of the person who gathered the data, the date on which measurements were made, the equipment used, or notes about a particular record. Typically, comments occur on their own line and begin with a # sign as demonstrated below:

Temperature,Row,Column,Luminescence
# Luminescence of evolved strain
# Collected by Harvey I. Vibrio - 2012/06/27

26.3,0,0,7444.945
26.3,0,1,4845.375
26.3,0,2,4056.362
# Look at this luminescence!!!
26.3,0,3,4883.137
26.3,0,4,3593.289
26.3,0,5,2645.281
26.3,1,2,10507.588

Despite their common use, comments are not supported by all software and can occasionally cause problems when importing data. Fortunately, comments can quickly be stripped from a file using the grep command. The following command strips out all lines that begin with the # character from a file named input.csv:

grep -v "^#" input.csv

This command will print all the lines of input.csv that do not begin with #. The ^ character means the beginning of a line. To save this output as a file, we can redirect the output of grep to a new file called input-nocomments.csv:

grep -v "^#" input.csv > input-nocomments.csv

Now, input-nocomments.csv can be imported into programs that do not support comments.

Removing Headers

Headers are commonly used to name each column. They appear as a comma-separated list of column names at the top of the file. Sometimes, it is beneficial to remove headers from a CSV file when merging multiple files (see Combining Rows from Two or More CSVs below) or when the software you’re using to import the data is finicky. For example, R expects the header to be the first line of a file and cannot be preceded by comments. Assuming that the header is on the first line of input.csv, we can remove it with the cat and sed commands:

cat input.csv | sed "1 d"

This uses the Unix pipe (|) symbol to connect the output of the cat program, which just prints the contents of the file, to the sed program, which we use to delete the first line. The 1 can be replaced in the parameters to sed to allow a different number of lines to be skipped. As before, the results of this command are printed. To save these results as a new file named noheader.csv, the output can be redirected:

cat input.csv | sed "1 d" > noheader.csv

If our input file has comments before the header line, this command won’t quite have the effect that we want, because sed blindly removes the first line of the file. Through the power and beauty of the command line, however, we can combine our comment stripping ability with our line chopping ability:

cat input.csv | grep -v "^#" | sed "1 d" > onlydata.csv

This command uses cat to send the contents of the file to grep, which only prints out lines that do not begin with #. The output goes to sed, which removes the first line of the non-commented data. Finally, the results of this string of operations are placed in a new file named onlydata.csv.

Changing Delimiters

You can easily change the delimiter of a file using the tr command. To convert a tab-separated values file named input.tsv to a CSV file named input.csv:

cat input.tsv | tr "\\t" "," > input.csv

The two arguments given in this command indicate the character to change "\\t" (a tab), and the character to change it to "," (a comma). Similarly, we can convert a CSV file to a TSV file by reversing the arguments:

cat input.csv | tr "," "\\t" > input.tsv

Getting Information about the Data Set

Before working with a data set, you’d sometimes like to get a sense of how big it is and the types of values that it stores. This section provides a few ways of doing this.

Taking a Peek at the Data Set

In the previous recipe, we removed comments and the header from a CSV file. We did so, because we knew beforehand that they existed in the data set. If we’re not familiar with the file, we can usually determine whether or not there are headers or comments by looking at the first few lines of the file. Instead of opening the file, which could be quite large, with a text editor, we can quickly use the head command:

head input.csv

This command will display the first 10 lines of input.csv. To see more or fewer lines, we can specify the number using the -n argument:

head -n 7 input.csv

This should give us an idea whether or not there’s a header in the file and if there are comments within the first few lines. If we want to search the whole file for comments, we can use grep:

grep "#" input.csv

Which prints each line in input.csv that contains the # character. If we instead want to know the number of comments that occur in the file, we can send the output to the wc command, which can count the number of lines in the output of grep when used with the -l argument:

grep "#" input.csv | wc -l

If the output from this command is 0, then we know that input.csv contains no comments.

Determining the Number of Rows

The wc command lists the number of lines in a file when the -l argument is given.

wc -l input.csv

It does so indiscriminately, so its count will include headers and comments. Using previous recipes, though, we can strip these things from the file before counting the number of lines.

cat input.csv | grep -v "^#" | sed "1 d" | wc -l

This command will print out the input.csv file, send it to grep, which strips out lines containing #. The output of this is sent to sed, which strips out the header on the first line (assuming we have one—see previous recipe). The result of all this is then sent to wc, which counts and prints the number of lines, each of which is presumably a data record.

Determining the Number of Columns

Similarly, we can determine the number of columns in the data set by looking at the non-comment lines:

cat input.csv | grep -v "^#" | awk "{print NF}" FS=,

This command uses grep to strip out commented rows, and sends the remaining rows to awk, which reports the number of comma-separated fields (the FS argument) in each line. We can send these results to the uniq command to only output the unique numbers of columns present in the file:

cat input.csv | grep -v "^#" | awk "{print NF}" FS=, | uniq

Typically, all rows in a CSV file contain the same number of records. In that case, the command will return one number. If more than one number is returned, there is some variation in the number of records per line in the file, which could cause other problems.

Combining CSVs

The following recipes deal with combining data from two or more CSV files.

Combining Rows from Two or More CSVs

If some of our records are stored in one file, say input1.csv, and other records are stored in one or more others, we can combine these files (vertically) into one large data file called combined.csv using cat:

cat input1.csv input2.csv input3.csv > combined.csv

It is important to note that each of the files should contain the same number of columns and have those columns occurring in the same order. Recipes for extracting specific columns in order to rearrange them are included later.

If each of the input files has a header, each of those header lines will be copied into combined.csv. The easiest way to deal with this is to strip the header line from each file as we combine them:

cat input1.csv > combined.csv
cat input2.csv | sed "1 d" >> combined.csv
cat input3.csv | sed "1 d" >> combined.csv

You may have noticed that we copied input1.csv as-is into combined.csv. By doing this, we’re using the header from the first file as the header for our new combined file. The remaining lines use sed to filter out the first line before adding the contents of input2.csv and input3.csv to combined.csv (note the use of >>, which appends the additional data to the output file). The file that results from these commands will contain the header and data from input1.csv and just the data from input2.csv and input3.csv.

Combining Columns from Multiple CSVs

If the fields of your data are spread across multiple files, these can be combined (horizontally) using paste:

paste -d , input1.csv input2.csv > combined.csv

Here, the -d , argument says that the contents from input1.csv and input2.csv will be separated (delimited) by a comma. By default, paste will separate columns using a tab.

It is important to note here that doing this assumes that the order of the data in each input file is the same and that the record(s) in row 8 of input1.csv correspond to the record(s) in row 8 of input2.csv.

If you’re interested in extracting specific columns from your input file(s), see the recipe for Extracting Columns.

Extracting Data Subsets

Extracting Specific Row Numbers from CSVs

Data entries in CSV files can be extracted either by specific line numbers or by matching some criteria. The former is more straightforward and can be done in a number of ways depending on which lines you’re interested in.

If we’re interested in the first few rows of data, we can use the head command. For example, to extract the first 13 lines from input.csv and save the results to first13.csv, we could execute:

head -n 13 input.csv > first13.csv

To make sure we’re extracting only data and not headers and comments, we could first strip these following the Removing Headers recipe:

cat input.csv | grep -v "^#" | sed "1 d" | head -n 13 > first13.csv

Similarly, if we’re interested in extracting the last 8 rows from the input file, we could use the tail command:

tail -n 8 input.csv > last8.csv

As with the previous example, we may want to first strip out any comments. When we take rows from the end, we’re not getting the header. One easy way to do this would be to first extract the header from the input file using head, and then pull out the last 8 columns:

head -n 1 input.csv > last8.csv
tail -n 8 input.csv >> last8.csv

As we’ve seen before, the > creates a new file (or overwrites an existing one) named last8.csv, and writes the output of head to it. The >> then appends the results of tail to this file. If you’d like to do these two steps in one line (that still contains two commands), these lines can be combined with ;:

head -n 1 input.csv > last8.csv ; tail -n 8 input.csv >> last8.csv

If for some reason you know the specific line numbers of interest, you can carve out those lines using sed. For example, to extract lines 10 through 14:

sed -n "10,14 p" input.csv

Extracting Rows Based on Some Value

The other case is when we’re interested in selecting rows by a certain value. For example, we may have phone book stored as a CSV, where each row contains the name of a person or business and their number. If we wanted to create a CSV that contains all of the places (or people!) with Pizza in their name, we could use grep to find all rows in the file containing that word and save the results to a new phone book of pizza places named pizza.csv:

grep "pizza" input.csv > pizza.csv

As with the previous example, pizza.csv will not contain headers, so we can use head:

head -n 1 input.csv > pizza.csv
grep "pizza" input.csv >> pizza.csv

Note that pizza.csv will also contain any comments that include the word pizza, so you may want to first strip comments from your input. Note that grep is case sensitive, so it will match records containing pizza but not ones containing pizza. We can make grep case insensitive by adding the -i argument:

grep -i "pizza" input.csv >> pizza.csv

If you’re interested in extracting rows based on criteria other than matching some specified text, such as rows containing some value greater than a number of interest, I’d recommend moving away from command line tools and creating a script in something like Python or R.

Extracting Columns

Specific columns can also be easily extracted from CSVs. For example, if we wanted to extract columns 2, 4, 5, and 6 from input.csv:

cut -d , -f 2,4-6 input.csv

Here, the -d , tells cut that columns are separated by commas, and -f 2,4-6 tells it to extract column 2 and columns 4-6. The -f argument can take a single column number or a comma-separated list of numbers and ranges.

Working with Compressed CSV Files

If our data set gets very large, we can compress the CSV file using a file compression tool such as bzip2 or gzip to shrink the file size (sometimes dramatically). With slight modification, any of the commands shown in this cookbook can be used with compressed CSV files. Additionally, tools such as Python and R have bzip2 and gzip support, so they can read and write these compressed CSV files as well.

Compressing CSV Files with bzip2

bzip2 is an open source compression tool available on most Linux and OS X machines. To compress your input file, run:

bzip2 input.csv

This will produce a file named input.csv.bz2. If you should ever need to uncompress this file, use the --decompress or -d arguments:

bzip2 -d input.csv.bz2

Working with bzip2-Compressed CSV Files

If you look at input.csv.bz2, it will look like an unintelligible chunk of characters. This is exactly how the file will look to any of the other utilities we’ve used in this cookbook. These utilities expect to work with plain text, so they usually can’t handle compressed data. Fortunately, we can easily give these tools the uncompressed data without actually uncompressing the file. To do so, we’ll use the bzcat command, which prints out the uncompressed contents of the file, and then use pipes to send that output to the right tools. For example, if we want to extract the third column of a compressed CSV file (see Extracting Columns) and save the results to col3.csv, we could run:

bzcat input.csv.bz2 | cut -d , -f 3 > col3.csv

Since we’re all about dealing with compressed CSVs, we could also save the results to a bzipped file after passing the results through the bzip2 command:

bzcat input.csv.bz2 | cut -d , -f 3 | bzip2 > col3.csv.bz2

Compressing CSV Files with gzip

Some older machines will not have bzip2 installed. These machines likely have another open source compression tool called gzip installed. Similarly to bzip2, files can be compressed by passing them to the gzip command:

gzip input.csv

which produces input.csv.gz. To uncompress this file, just use the gunzip command:

gunzip input.csv.gz

Working with gzip-Compressed CSV Files

Working with gzip-compressed CSV files is the same as described in Working with bzip2-Compressed CSV Files with the exception of using the gzcat command instead of bzcat. If you cannot find gzcat on the machine, try zcat.

A Quick Overview of the Commands Used in this Cookbook

This section provides a quick description of each of the commands used and a link to their man page (manual).

  • awk: AWK is a programming language (and command) that is great for processing text (man page)
  • cat: concatenates and prints files (man page)
  • cut: removes specific bytes, characters, or fields from files (man page)
  • grep: finds lines matching (or not matching) a given pattern (man page)
  • head: outputs the first part of a file (man page)
  • sed: perform simple manipulations to text in a file (man page)
  • tail: outputs the end of a file (man page)
  • tr: changes (translates) characters within a file (man page)
  • wc: counts the number of characters, words, and lines in a file (man page)

Some Useful Resources

CSV Format

Software Support for CSV

Introduction to Unix (also Linux, OS X, etc.) and The Command Line

Save this post as a PDF

]]>
http://bconnelly.net/working-with-csvs-on-the-command-line/#comments