In experimental evolution research, few things are more important than growth. Both the rate of growth and the resulting yield can provide direct insights into a strain or species’ fitness. Whether one strain with a trait of interest can outgrow (and outcompete) another that possesses a variation of that trait often depends primarily on the fitnesses of the two strains.

Zach Blount and his mountain of Petri dishes (Photo: Brian Baer)

Because of its importance both for painting the big picture and for properly designing experiments, I spend a very large portion of my time studying the growth of different candidate bacterial strains in different environments. This usually means counting colonies that grow on mountains of Petri dishes or by using a spectrophotometer to measure the absorbance of light as it passes through populations growing in clear microtiter plates. In science, replication is key, so once my eyes have glazed from counting colonies, or once the plate reader dings from across the lab to tell me that it’s done, my job becomes assembling all the data from the replicate populations of each strain and in each environment to try to figure out what’s going on. Do the growth rates show the earth-shattering result that I’m hoping for, or will I have to tweak my experimental design and go through it all again? The latter is almost always the case.

Because analyzing growth is both so fundamental to what I do, and because it is something I repeat ad nauseam, having a solid and easy-to-tweak pipeline for analyzing growth data is a must. Repeating the same analyses over and over again is not only unpleasant, but it also eats up a lot of time.

Here, I’m going to describe my workflow for analyzing growth data. It has evolved quite a bit over the past few years, and I’m sure it will continue to do so. As a concrete example, I’ll use some data from a kinetic read in a spectrophotometer, which means I have information about growth for each well in a 96-well microtiter plate measured periodically over time. I have chosen a more data dense form of analysis to highlight how easy it can be to analyze these data. However, I use the same pipeline for colony counts, single reads in the spec, and even for data from simulations. The following is an overview of the process:

• Import the raw data, aggregating from multiple sources if necessary
• Reformat the data to make it “tidy”: each record corresponds to one observation
• Group the replicate data
• Calculate statistics (e.g., mean and variation) for each group
• Plot the data and statistics

This workflow uses R, which is entirely a matter of preference. Each of these steps can be done in other environments using similar tools. I’ve previously written about grouping data and calculating group summaries in Summarizing Data in Python with Pandas, which covers the most important stages of the pipeline.

If you’d like to work along, I’ve included links for sample data in the section where they’re used. The entire workflow is also listed at the end, so you can easily copy and paste it into your own scripts.

For all of this, you’ll need an up-to-date version of R or RStudio. You’ll also need the dplyr, reshape2, and ggplot2 packages, which can be installed by running the following:

install.packages(c('reshape2', 'dplyr', 'ggplot2'))


## The Initial State of the Data

To get started, I’ll put on my TV chef apron and pull some pre-cooked data out of the oven. This unfortunate situation occurs because all of the different software that I’ve used for reading absorbance measurements export data in slightly different formats. Even different versions of the same software can export differently.

So we’ll start with this CSV file from one of my own experiments. Following along with my data will hopefully be informative, but it is no substitute for using your own. So if you’ve got it, I really hope you will use your own data instead. Working this way allows you to see how each of these steps transforms your data, and how the process all comes together. For importing, try playing with the formatting options to read.table. There’s also a rich set of command line tools that make creating and manipulating tabular data quick and easy if that’s more your style. No matter how you get there, I’d recommend saving the data as a CSV (see write.csv) as soon as you’ve dealt with the import so that you never again have to face that unpleasant step.

rawdata <- read.csv("data/growth-raw.csv")


Each row of the example file contains the Time (in seconds), Temperature (in degrees Celsius), and absorbance readings at 600 nm for the 96 wells in a microtiter plate over a 24-hour period. These 96 values each have their own column in the row. To see the layout of this data frame, run summary(rawdata). Because of its large size, I’m not including the output of that command here.

Even microtiter plates can be mountainous, as Peter Conlin's bench shows.

No matter if your software exports as text, XML, or something else, this basic layout is very common. Unfortunately, it’s also very difficult to work with, because there’s no easy way to add more information about what each well represents. In order to be aware of which well corresponds to which treatment, you’ll most likely have to keep referring either to your memory, your lab notes, or something else to remind yourself how the plate was laid out. Not only is this very inconvenient for analysis—your scripts will consist of statements like treatment_1_avg <- (B4 + E6 + H1) / 3, which are incomprehensible in just about all contexts besides perhaps Battleship—but it also almost guarantees a miserable experience when looking back on your data after even just a few days. In the next step, we’ll be re-arranging the data and adding more information about the experiment itself. Not only will this make the analysis much easier, but it’ll also help sharing the data with others or your future self.

## Tidying the Data

As we saw before, our original data set contains one row for each point in time, where each row has the absorbance value for each of our 96 wells. We’re now going to follow the principles of Tidy Data and rearrange the data so that each row contains the value for one read of one well. As you will soon see, this means that each point in time will correspond to 96 rows of data.

To do this rearranging, we’re going to be using the melt function from reshape2. With melt, you specify which columns are identity variables, and which columns are measured variables. Identity variables contain information about the measurement, such as what was measured (e.g., which strain or environment), how (e.g., light absorbance at 600 nm), when, and so on. These are kind of like the 5 Ws of Journalism for your experiment. Measured variables contain the actual values that were observed.

library(reshape2)

reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well",
value.name="OD600")


In the example data, our identity variables are Time and Temperature, while our measured variable is absorbance at 600 nm, which we’ll call OD600. Each of these will be represented as a column in the output. The output, which we’re storing in a data frame named reshaped, will also contain a Well column that contains the well from which data were collected. The Well value for each record will correspond to the name of the column that the data came from in the original data set.

Now that our data are less “wide”, we can take a peek at its structure and its first few records:

summary(reshaped)


Time        Temperature        Well            OD600
Min.   :    0   Min.   :28.2   A1     :  4421   Min.   :0.0722
1st Qu.:20080   1st Qu.:37.0   A2     :  4421   1st Qu.:0.0810
Median :42180   Median :37.0   A3     :  4421   Median :0.0970
Mean   :42226   Mean   :37.0   A4     :  4421   Mean   :0.3970
3rd Qu.:64280   3rd Qu.:37.0   A5     :  4421   3rd Qu.:0.6343
Max.   :86380   Max.   :37.1   A6     :  4421   Max.   :1.6013
(Other):397890

head(reshaped)


Time Temperature Well  OD600
1    0        28.2   A1 0.0777
2   20        28.9   A1 0.0778
3   40        29.3   A1 0.0779
4   60        29.8   A1 0.0780
5   80        30.2   A1 0.0779
6  100        30.6   A1 0.0780


There’s a good chance that this format will make you a little bit uncomfortable. How are you supposed to do things like see what the average readings across wells B4, E6, and H1 are? Remember, we decided that doing it that way—although perhaps seemingly logical at the time—was not the best way to go because of the pain and suffering that it will cause your future self and anyone else who has to look at your data. What’s so special about B4, E6, and H1 anyway? You may know the answer to this now, but will you in 6 months? 6 days?

## Annotating the Data

Based solely on the example data set, you would have no way of knowing that it includes information about three bacterial strains (A, B, and C) grown in three different environments (1, 2, and 3). Now we’re going to take advantage of our newly-rearranged data by annotating it with this information about the experiment.

One of the most important pieces of this pipeline is a plate map, which I create when designing any experiments that use microtiter plates (see my templates here and here. These plate maps describe the experimental variables tested (e.g., strain and environment) and what their values are in each of the wells. I keep the plate map at my bench and use it to make sure I don’t completely forget what I’m doing while inoculating the wells.

For the analysis, we’ll be using a CSV version of the plate map pictured. This file specifies where the different values of the experimental variables occur on the plate. Its columns describe the wells and each of the experimental variables, and each row contains a well and the values of the experimental variables for that well.

In this sample plate map file, each row contains a well along with the letter of the strain that was in that well and the number of the environment in which it grew. If you look closely this plate map, you’ll notice that I had four replicate populations for each treatment. In some of the wells, the strain is NA. These are control wells that just contained growth medium. Don’t worry about these, we’ll filter them out later on.

platemap <- read.csv("data/platemap.csv")


Well Strain Environment
1    B2      A           1
2    B3      B           1
3    B4      C           1
4    B5   <NA>           1
5    B6      A           2
6    B7      B           2
7    B8      C           2
8    B9   <NA>           2
9   B10      A           3
10  B11      B           3


We can combine the information in this plate map with the reshaped data by pairing the data by their Well value. In other words, for each row of the reshaped data, we’ll find the row in the plate map that has the same Well. The result will be a data frame in which each row contains the absorbance of a well at a given point in time as well as information about what was actually in that well.

To combine the data with the plate map, we’ll use the inner_join function from dplyr, indicating that Well is the common column. Inner join is a term from databases that means to find the intersection of two data sets.

library(dplyr)

# Combine the reshaped data with the plate map, pairing them by Well value
annotated <- inner_join(reshaped, platemap, by="Well")

# Take a peek at the first few records in annotated


Time Temperature Well  OD600 Strain Environment
1    0        28.2   B2 0.6100      A           1
2   20        28.9   B2 0.5603      A           1
3   40        29.3   B2 0.1858      A           1
4   60        29.8   B2 0.1733      A           1
5   80        30.2   B2 0.1713      A           1
6  100        30.6   B2 0.1714      A           1


This produces a new table named annotated that contains the combination of our absorbance data with the information from the plate map. The inner join will also drop data for all of the wells in our data set that do not have a corresponding entry in the plate map. So if you don’t use a row or two in the microtiter plate, just don’t include those rows in the plate map (there’s nothing to describe anyway). Since the inner join takes care of matching the well data with its information, an added benefit of the plate map approach is that it makes data from experiments with randomized well locations much more easy to analyze (unfortunately, it doesn’t help with the pipetting portion of those experiments).

Let’s pause right here and save this new annotated data set. Because it contains all information related to the experiment—both the measured variables and the complete set of identity variables—it’s now in an ideal format for analyzing and for sharing.

# Write the annotated data set to a CSV file
write.csv(annotated, "data-annotated.csv")


## Grouping the Data

Now that the data set is annotated, we can arrange it into groups based on the different experimental variables. With the example data set, it makes sense to collect the four replicate populations of each treatment at each time point. Using this grouping, we can begin to compare the growth of the different strains in the different environments over time and make observations such as “Strain A grows faster than Strain B in Environment 1, and slower than Strain B in Environment 2. In other words, we’re ready to start learning what the data have to tell us about our experiment.

For this and the following step, we’re once again going to be using the dplyr package, which contains some really powerful (and fast) functions that allow you to easily filter, group, rearrange, and summarize your data. We’ll group the data by Strain, then by Environment, and then by Time, and store the grouping in grouped. As shown in the Venn diagram, this means that we’ll first separate the data based on the strain. Then we’ll separate the data within each of those piles by the environment. Finally, within these smaller collections, we’ll group the data by time.

grouped <- group_by(annotated, Strain, Environment, Time)

Grouping the data by Strain, Environment, and Time

What this means is that grouped contains all of the growth measurements for Strain A in Environment 1 at each point in Time, then all of the measurements for Strain A in Environment 2 at each point in Time, and so on. We’ll use this grouping in the next step to calculate some statistics about the measurements. For example, we’ll be able to calculate the average absorbance among the four replicates of Strain A</span> in Environment 1 over time and for each of the other treatments.

## Calculating Statistics for Each Group

Now that we have our data partitioned into logical groups based on the different experimental variables, we can calculate summary statistics about each of those groups. For this, we’ll use dplyr’s summarise function, which allows you to execute one or more functions on any of the columns in the grouped data set. For example, to count the number of measurements, the average absorbance (from the OD600 column), and the standard deviation of absorbance values:

stats <- summarise(grouped, N=length(OD600), Average=mean(OD600), StDev=sd(OD600))


The resulting stats data set contains a row for each of the different groups. Each row contains the Strain, Environment, and Time that define that group as well as our sample size, average, and standard deviation, which are named N, Average, and StDev, respectively. With summarise, you can use apply any function to the group’s data that returns a single value, so we could easily replace the standard deviation with 95% confidence intervals:

# Create a function that calculates 95% confidence intervals for the given
# data vector using a t-distribution
conf_int95 <- function(data) {
n <- length(data)
error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
return(error)
}

# Create summary for each group containing sample size, average OD600, and
# 95% confidence limits
stats <- summarise(grouped, N=length(OD600), Average=mean(OD600),
CI95=conf_int95(OD600))


### Combining Grouping, Summarizing, and More

One of the neat things that dplyr provides is the ability to chain multiple operations together using magrittr’s %>% operator. This allows us to combine the grouping and summarizing from the last two steps (and filtering, sorting, etc.) into one line:

stats <- annotated %>%
group_by(Environment, Strain, Time) %>%
summarise(N=length(OD600),
Average=mean(OD600),
CI95=conf_int95(OD600)) %>%
filter(!is.na(Strain))


Note that I’ve put the input data set, annotated, at the beginning of the chain of commands and that group_by and summarise no longer receive an input data source. Instead, the data flows from annotated, through summarise, and finally through filter just like a pipe. The added filter removes data from the control wells, which had no strain.

## Plotting the Results

Now that we have all of our data nicely annotated and summarized, a great way to start exploring it is through plots. For the sample data, we’d like to know how each strain grows in each of the environments tested. Using the ggplot2 package, we can quickly plot the average absorbance over time:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
geom_line() +
labs(x="Time (Hours)", y="Absorbance at 600 nm")


The obvious problem with this plot is that although we can differentiate among the three strains, we can’t see the effect that environment has. This can be fixed easily, but before we do that, let’s quickly dissect what we did.

We’re using the ggplot function to create a plot. As arguments, we say that the data to plot will be coming from the stats data frame. aes allows us to define the aesthetics of our plot, which are basically what ggplot uses to determine various visual aspects of the plot. In this case, the x values will be coming from our Time column, which we divide by 3600 to convert seconds into hours. The corresponding y values will come from the Average column. Finally, we will color things such as lines, points, etc. based on the Strain column.

The ggplot function sets up a plot, but doesn’t actually draw anything until we tell it what to draw. This is part of the philosophy behind ggplot: graphics are built by adding layers of different graphic elements. These elements (and other options) are added using the + operator. In our example, we add a line plot using geom_line. We could instead make a scatter plot with geom_point, but because our data are so dense, the result isn’t quite as nice. We also label the axes using labs.

Back to the problem of not being able to differentiate among the environments. While we could use a different line type for each environment (using the linetype aesthetic), a more elegant solution would be to create a trellis chart. In a trellis chart (also called small multiples by Edward Tufte), the data are split up and displayed as individual subplots. Because these subplots use the same scales, it is easy to make comparisons. We can use ggplot’s facet_grid to create subplots based on the environments:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
geom_line() +
facet_grid(Environment ~ .) +
labs(x="Time (Hours)", y="Absorbance at 600 nm")

Trellis plot showing the growth of the strains over time for each environment

Let’s take it one step further and add shaded regions corresponding to the confidence intervals that we calculated. Since ggplot builds plots layer-by-layer, we’ll place the shaded regions below the lines by adding geom_ribbon before using geom_line. The ribbons will choose a fill color based on the Strain and not color the edges. Since growth is exponential, we’ll also plot our data using a log scale with scale_y_log10:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
color=NA, alpha=0.3) +
geom_line() +
scale_y_log10() +
facet_grid(Environment ~ .) +
labs(x="Time (Hours)", y="Absorbance at 600 nm")

Our final plot showing the growth of each strain as mean plus 95% confidence intervals for each environment

## In Conclusion

And that’s it! We can now clearly see the differences between strains as well as how the environment affects growth, which was the overall goal of the experiment. Whether or not these results match my hypothesis will be left as a mystery. Thanks to a few really powerful packages, all it took was a few lines of code to analyze and plot over 200,000 data points.

I’m planning to post a follow-up in the near future that builds upon what we’ve done here by using grofit to fit growth curves.

## Complete Script

library(reshape2)
library(dplyr)
library(ggplot2)

# Read in the raw data and the platemap. You may need to first change your
# working directory with the setwd command.

# Reshape the data. Instead of rows containing the Time, Temperature,
# and readings for each Well, rows will contain the Time, Temperature, a
# Well ID, and the reading at that Well.
reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well",
value.name="OD600")

# Add information about the experiment from the plate map. For each Well
# defined in both the reshaped data and the platemap, each resulting row
# will contain the absorbance measurement as well as the additional columns
# and values from the platemap.
annotated <- inner_join(reshaped, platemap, by="Well")

# Save the annotated data as a CSV for storing, sharing, etc.
write.csv(annotated, "data-annotated.csv")

conf_int95 <- function(data) {
n <- length(data)
error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
return(error)
}

# Group the data by the different experimental variables and calculate the
# sample size, average OD600, and 95% confidence limits around the mean
# among the replicates. Also remove all records where the Strain is NA.
stats <- annotated %>%
group_by(Environment, Strain, Time) %>%
summarise(N=length(OD600),
Average=mean(OD600),
CI95=conf_int95(OD600)) %>%
filter(!is.na(Strain))

# Plot the average OD600 over time for each strain in each environment
ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
color=NA, alpha=0.3) +
geom_line() +
scale_y_log10() +
facet_grid(Environment ~ .) +
labs(x="Time (Hours)", y="Absorbance at 600 nm")


## Extending to Other Types of Data

I hope it’s also easy to see how this pipeline could be used in other situations. For example, to analyze colony counts or a single read from a plate reader, you could repeat the steps exactly as shown, but without Time as a variable. Otherwise, if there are more experimental variables, the only change needed would be to add a column to the plate map for each of them.

## Acknowledgments

I’d like to thank Carrie Glenney and Jared Moore for their comments on this post and for test driving the code. Many thanks are also in order for Hadley Wickham, who developed each of the outstanding packages used here (and many others).