Brian Connelly: csv Articles tagged 'csv' on Brian Connelly en-us http://bconnelly.net Mon, 28 Nov 2016 15:10:40 -0800 Mon, 28 Nov 2016 15:10:40 -0800 Jekyll v3.3.1 Plotting Microtiter Plate Maps Brian Connelly Thu, 01 May 2014 07:54:00 -0700 http://bconnelly.net/2014/05/plotting-microtiter-plate-maps/ http://bconnelly.net/2014/05/plotting-microtiter-plate-maps/ analysiscsvdplyrggplot2howtorvisualization I recently wrote about my workflow for Analyzing Microbial Growth with R. Perhaps the most important part of that process is the plate map, which describes the different experimental variables and where they occur. In the example case, the plate map described which strain was growing and in which environment for each of the wells used in a 96-well microtiter plate. Until recently, I’ve always created two plate maps. The first one is hand-drawn using pens and markers and sat on the bench with me when I started an experiment. By marking the wells with different colors, line types, and whatever other hieroglyphics I decide on, I can keep track of where everything is and how to inoculate the wells.

]]>
I recently wrote about my workflow for Analyzing Microbial Growth with R. Perhaps the most important part of that process is the plate map, which describes the different experimental variables and where they occur. In the example case, the plate map described which strain was growing and in which environment for each of the wells used in a 96-well microtiter plate. Until recently, I’ve always created two plate maps. The first one is hand-drawn using pens and markers and sat on the bench with me when I started an experiment. By marking the wells with different colors, line types, and whatever other hieroglyphics I decide on, I can keep track of where everything is and how to inoculate the wells.

A Plate Map

The second is a CSV file that contains a row for each well that I use and columns describing the values of each of my experimental variables. This file contains all of the information that I had on my hand-drawn plate map, but in a format that I can later merge with my result data to produce a fully-annotated data set. The fully-annotated data set is the perfect format for plotting with tools like ggplot2 or for sharing with others.

 
Well Strain Environment
1    B2      A           1
2    B3      B           1
3    B4      C           1
4    B5     <NA>         1
5    B6      A           2
6    B7      B           2
7    B8      C           2
8    B9     <NA>         2
9   B10      A           3
10  B11      B           3

But when talking with Carrie Glenney, whom I’ve been convincing of the awesomeness of the CSV/dplyr/ggplot workflow, I realized that there’s really no need to have two separate plate maps. Since all the information is in the CSV plate map, why bother drawing one out on paper? This post describes how I’ve started using ggplot2 to create a nice plate map image that I can print and take with me to the bench or paste in my lab notebook.

Reading in the Plate Map

First, load load your plate map file into R. You may need to first change your working directory with setwd or give read.csv the full path of the plate map file.

 
platemap <- read.csv("platemap.csv")

If you don’t yet have a plate map of your own, you can use this sample plate map.

Extracting Row and Column Numbers

In my plate maps, I refer to each well by its row-column pair, like “C6”. To make things easier to draw, we’re going to be splitting those well IDs into their row and column numbers. So for “C6”, we’ll get row 3 and column 6. This process is easy with dplyr’s mutate function. If you haven’t installed dplyr, you can get it by running install.packages('dplyr').

 
library(dplyr)

platemap <- mutate(platemap,
                   Row=as.numeric(match(toupper(substr(Well, 1, 1)), LETTERS)),
                   Column=as.numeric(substr(Well, 2, 5)))

Once this is done, the platemap data frame will now have two additional columns, Row and Column, which contain the row and column numbers associated with the well in the Well column, respectively.

Drawing the Plate

Microtiter plates are arranged in a grid, so it’s not a big leap to think about a plate as a plot containing the row values along the Y axis and the column values along the X axis. So let’s use ggplot2 to create a scatter plot of all of the wells in the plate map. We’ll also give it a title.

 
library(ggplot2)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(size=10) +
    labs(title="Plate Layout for My Experiment")

First Plot

As you can see, this plot doesn’t tell us anything about our experiment other than the wells it uses and their location.

Showing Empty Wells

I often don’t use all 96 wells in my experiments. It is useful, however, to show all of them. This makes it obvious which wells are used and helps orient your eyes when shifting between the plate map and the plate. Because of this, we’ll create some white circles with a light grey border for all 96 wells below the points that we’ve already created. We’ll also change the aspect ratio of the plot so that it better matches the proportions of a 96-well plate.

 
ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim = c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    labs(title="Plate Layout for My Experiment")

Plot of chunk plot2 blank wells and aspect ratio

Flipping the Axis

Now that we are showing all 96 wells, one thing becomes clear—the plot arranges the rows from 1 on the bottom to 8 at the top, which is opposite of how microtiter plates are labeled. Fortunately, we can easily flip the Y axis. While we’re at it, we’ll also tell the Y axis to use letters instead of numbers and to draw these labels for each value. Similarly, we’ll label each column value along the X axis.

 
ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment")

Axes flipped

For those who would like to mimic the look of a microtiter plate even more closely, I have some bad news. It’s not possible to place the X axis labels above the plot. Not without some complicated tricks, at least.

Removing Grids and other Plot Elements

Although the plot is starting to look a lot like a microtiter plate, there’s still some unnecessary “chart junk”, such as grids and tick marks along the axes. To create a more straightforward plate map, we can apply a theme that will strip these elements out. My theme for doing this (theme_bdc_microtiter) is available as part of the ggplot2bdc package. Follow that link for installation instructions. Once installed, we can now apply the theme:

 
library(ggplot2bdc)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

added plot theme

Highlighting Experimental Variables

Now that our plot is nicely formatted, it’s time to get back to the main point of all of this—displaying the values of the different experimental variables.

You’ll first need to think about how to best encode each of these values. For this, ggplot provides a number of aesthetics, such as color, shape, size, and opacity. There are no one-size-fits-all rules for this. If you’re interested in this topic, Jacques Bertin’s classic Semiology of Graphics has some great information, and Jeff Heer and Mike Bostock’s Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design is very interesting. After a little experimentation you should be able to figure out which encodings best represent your data.

You’ll also need to consider the data types of the experimental variables, because it’s not possible to map a shape or some other discrete property to continuous values.

Here, we’ll show the different environments using shapes, and the different strains using color. When R imported the plate map, it interpreted the Environment variable as continuous (not a crazy assumption, since it has values 1, 2, and 3). We’re first going to be transforming it to a categorical variable (factor in R speak) so that we can map it to a shape. We’ll then pass our encodings to ggplot as the aes argument to geom_point.

 
platemap$Environment <- as.factor(platemap$Environment)

ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(aes(shape=Environment, colour=Strain), size=10) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

full plot

Changing Colors, Shapes, Etc.

By default, ggplot will use a default ordering of shapes and colors. If you’d prefer to use a different set, either because they make the data more easy to interpret (see Sharon Lin and Jeffrey Heer’s fascinating The Right Colors Make Data Easier To Read) or for some other reason, we can adjust them. I’ll change the colors used to blue, red, and black, which I normally associate with these strains. Although these colors aren’t quite as aesthetically pleasing as ggplot2’s defaults, I use them because they are the colors of markers I have at my bench.

 
ggplot(data=platemap, aes(x=Column, y=Row)) +
    geom_point(data=expand.grid(seq(1, 12), seq(1, 8)), aes(x=Var1, y=Var2),
               color="grey90", fill="white", shape=21, size=6) +
    geom_point(aes(shape=Environment, colour=Strain), size=10) +
    scale_color_manual(values=c("A"="blue", "B"="red", "C"="black")) +
    coord_fixed(ratio=(13/12)/(9/8), xlim=c(0.5, 12.5), ylim=c(0.5, 8.5)) +
    scale_y_reverse(breaks=seq(1, 8), labels=LETTERS[1:8]) +
    scale_x_continuous(breaks=seq(1, 12)) +
    labs(title="Plate Layout for My Experiment") +
    theme_bdc_microtiter()

colors

Wrap-Up

And that’s all it takes! You can now save the plot using ggsave, print it, add it to some slides, or anything else. In the future, I’ll describe a similar visualizations that can be made that allow exploration of the annotated data set, which contains the plate map information along with the actual data.

Many thanks to Sarah Hammarlund for her comments on a draft of this post!

Save this post as a PDF

]]>
http://bconnelly.net/2014/05/plotting-microtiter-plate-maps/#comments
Analyzing Microbial Growth with R Brian Connelly Wed, 09 Apr 2014 17:19:00 -0700 http://bconnelly.net/2014/04/analyzing-microbial-growth-with-r/ http://bconnelly.net/2014/04/analyzing-microbial-growth-with-r/ analysiscsvdatapubsdplyrevolutionfitnessggplot2growthhowtorreshape2 In experimental evolution research, few things are more important than growth. Both the rate of growth and the resulting yield can provide direct insights into a strain or species’ fitness. Whether one strain with a trait of interest can outgrow (and outcompete) another that possesses a variation of that trait often depends primarily on the fitnesses of the two strains.

]]>
In experimental evolution research, few things are more important than growth. Both the rate of growth and the resulting yield can provide direct insights into a strain or species’ fitness. Whether one strain with a trait of interest can outgrow (and outcompete) another that possesses a variation of that trait often depends primarily on the fitnesses of the two strains.

Zach Blount and his mountain of Petri dishes (B. Baer)
Zach Blount and his mountain of Petri dishes (Photo: Brian Baer)

Because of its importance both for painting the big picture and for properly designing experiments, I spend a very large portion of my time studying the growth of different candidate bacterial strains in different environments. This usually means counting colonies that grow on mountains of Petri dishes or by using a spectrophotometer to measure the absorbance of light as it passes through populations growing in clear microtiter plates. In science, replication is key, so once my eyes have glazed from counting colonies, or once the plate reader dings from across the lab to tell me that it’s done, my job becomes assembling all the data from the replicate populations of each strain and in each environment to try to figure out what’s going on. Do the growth rates show the earth-shattering result that I’m hoping for, or will I have to tweak my experimental design and go through it all again? The latter is almost always the case.

Because analyzing growth is both so fundamental to what I do, and because it is something I repeat ad nauseam, having a solid and easy-to-tweak pipeline for analyzing growth data is a must. Repeating the same analyses over and over again is not only unpleasant, but it also eats up a lot of time.

Here, I’m going to describe my workflow for analyzing growth data. It has evolved quite a bit over the past few years, and I’m sure it will continue to do so. As a concrete example, I’ll use some data from a kinetic read in a spectrophotometer, which means I have information about growth for each well in a 96-well microtiter plate measured periodically over time. I have chosen a more data dense form of analysis to highlight how easy it can be to analyze these data. However, I use the same pipeline for colony counts, single reads in the spec, and even for data from simulations. The following is an overview of the process:

  • Import the raw data, aggregating from multiple sources if necessary
  • Reformat the data to make it “tidy”: each record corresponds to one observation
  • Annotate the data, adding information about experimental variables
  • Group the replicate data
  • Calculate statistics (e.g., mean and variation) for each group
  • Plot the data and statistics

This workflow uses R, which is entirely a matter of preference. Each of these steps can be done in other environments using similar tools. I’ve previously written about grouping data and calculating group summaries in Summarizing Data in Python with Pandas, which covers the most important stages of the pipeline.

If you’d like to work along, I’ve included links for sample data in the section where they’re used. The entire workflow is also listed at the end, so you can easily copy and paste it into your own scripts.

For all of this, you’ll need an up-to-date version of R or RStudio. You’ll also need the dplyr, reshape2, and ggplot2 packages, which can be installed by running the following:

install.packages(c('reshape2', 'dplyr', 'ggplot2'))

The Initial State of the Data

To get started, I’ll put on my TV chef apron and pull some pre-cooked data out of the oven. This unfortunate situation occurs because all of the different software that I’ve used for reading absorbance measurements export data in slightly different formats. Even different versions of the same software can export differently.

So we’ll start with this CSV file from one of my own experiments. Following along with my data will hopefully be informative, but it is no substitute for using your own. So if you’ve got it, I really hope you will use your own data instead. Working this way allows you to see how each of these steps transforms your data, and how the process all comes together. For importing, try playing with the formatting options to read.table. There’s also a rich set of command line tools that make creating and manipulating tabular data quick and easy if that’s more your style. No matter how you get there, I’d recommend saving the data as a CSV (see write.csv) as soon as you’ve dealt with the import so that you never again have to face that unpleasant step.

rawdata <- read.csv("data/growth-raw.csv")

Each row of the example file contains the Time (in seconds), Temperature (in degrees Celsius), and absorbance readings at 600 nm for the 96 wells in a microtiter plate over a 24-hour period. These 96 values each have their own column in the row. To see the layout of this data frame, run summary(rawdata). Because of its large size, I’m not including the output of that command here.

Even microtiter plates can be mountainous, as Peter Conlin's bench shows
Even microtiter plates can be mountainous, as Peter Conlin's bench shows.

No matter if your software exports as text, XML, or something else, this basic layout is very common. Unfortunately, it’s also very difficult to work with, because there’s no easy way to add more information about what each well represents. In order to be aware of which well corresponds to which treatment, you’ll most likely have to keep referring either to your memory, your lab notes, or something else to remind yourself how the plate was laid out. Not only is this very inconvenient for analysis—your scripts will consist of statements like treatment_1_avg <- (B4 + E6 + H1) / 3, which are incomprehensible in just about all contexts besides perhaps Battleship—but it also almost guarantees a miserable experience when looking back on your data after even just a few days. In the next step, we’ll be re-arranging the data and adding more information about the experiment itself. Not only will this make the analysis much easier, but it’ll also help sharing the data with others or your future self.

Tidying the Data

As we saw before, our original data set contains one row for each point in time, where each row has the absorbance value for each of our 96 wells. We’re now going to follow the principles of Tidy Data and rearrange the data so that each row contains the value for one read of one well. As you will soon see, this means that each point in time will correspond to 96 rows of data.

To do this rearranging, we’re going to be using the melt function from reshape2. With melt, you specify which columns are identity variables, and which columns are measured variables. Identity variables contain information about the measurement, such as what was measured (e.g., which strain or environment), how (e.g., light absorbance at 600 nm), when, and so on. These are kind of like the 5 Ws of Journalism for your experiment. Measured variables contain the actual values that were observed.

library(reshape2)

reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well",
                 value.name="OD600")

In the example data, our identity variables are Time and Temperature, while our measured variable is absorbance at 600 nm, which we’ll call OD600. Each of these will be represented as a column in the output. The output, which we’re storing in a data frame named reshaped, will also contain a Well column that contains the well from which data were collected. The Well value for each record will correspond to the name of the column that the data came from in the original data set.

Now that our data are less “wide”, we can take a peek at its structure and its first few records:

summary(reshaped)
 
       Time        Temperature        Well            OD600       
  Min.   :    0   Min.   :28.2   A1     :  4421   Min.   :0.0722  
  1st Qu.:20080   1st Qu.:37.0   A2     :  4421   1st Qu.:0.0810  
  Median :42180   Median :37.0   A3     :  4421   Median :0.0970  
  Mean   :42226   Mean   :37.0   A4     :  4421   Mean   :0.3970  
  3rd Qu.:64280   3rd Qu.:37.0   A5     :  4421   3rd Qu.:0.6343  
  Max.   :86380   Max.   :37.1   A6     :  4421   Max.   :1.6013  
                                 (Other):397890
head(reshaped)
 
   Time Temperature Well  OD600
 1    0        28.2   A1 0.0777
 2   20        28.9   A1 0.0778
 3   40        29.3   A1 0.0779
 4   60        29.8   A1 0.0780
 5   80        30.2   A1 0.0779
 6  100        30.6   A1 0.0780

There’s a good chance that this format will make you a little bit uncomfortable. How are you supposed to do things like see what the average readings across wells B4, E6, and H1 are? Remember, we decided that doing it that way—although perhaps seemingly logical at the time—was not the best way to go because of the pain and suffering that it will cause your future self and anyone else who has to look at your data. What’s so special about B4, E6, and H1 anyway? You may know the answer to this now, but will you in 6 months? 6 days?

Annotating the Data

Based solely on the example data set, you would have no way of knowing that it includes information about three bacterial strains (A, B, and C) grown in three different environments (1, 2, and 3). Now we’re going to take advantage of our newly-rearranged data by annotating it with this information about the experiment.

One of the most important pieces of this pipeline is a plate map, which I create when designing any experiments that use microtiter plates (see my templates here and here. These plate maps describe the experimental variables tested (e.g., strain and environment) and what their values are in each of the wells. I keep the plate map at my bench and use it to make sure I don’t completely forget what I’m doing while inoculating the wells.

A plate map for our example data

For the analysis, we’ll be using a CSV version of the plate map pictured. This file specifies where the different values of the experimental variables occur on the plate. Its columns describe the wells and each of the experimental variables, and each row contains a well and the values of the experimental variables for that well.

In this sample plate map file, each row contains a well along with the letter of the strain that was in that well and the number of the environment in which it grew. If you look closely this plate map, you’ll notice that I had four replicate populations for each treatment. In some of the wells, the strain is NA. These are control wells that just contained growth medium. Don’t worry about these, we’ll filter them out later on.

platemap <- read.csv("data/platemap.csv")
head(platemap, n=10)
 
    Well Strain Environment
 1    B2      A           1
 2    B3      B           1
 3    B4      C           1
 4    B5   <NA>           1
 5    B6      A           2
 6    B7      B           2
 7    B8      C           2
 8    B9   <NA>           2
 9   B10      A           3
 10  B11      B           3

We can combine the information in this plate map with the reshaped data by pairing the data by their Well value. In other words, for each row of the reshaped data, we’ll find the row in the plate map that has the same Well. The result will be a data frame in which each row contains the absorbance of a well at a given point in time as well as information about what was actually in that well.

To combine the data with the plate map, we’ll use the inner_join function from dplyr, indicating that Well is the common column. Inner join is a term from databases that means to find the intersection of two data sets.

library(dplyr)

# Combine the reshaped data with the plate map, pairing them by Well value
annotated <- inner_join(reshaped, platemap, by="Well")

# Take a peek at the first few records in annotated
head(annotated)
 
   Time Temperature Well  OD600 Strain Environment
 1    0        28.2   B2 0.6100      A           1
 2   20        28.9   B2 0.5603      A           1
 3   40        29.3   B2 0.1858      A           1
 4   60        29.8   B2 0.1733      A           1
 5   80        30.2   B2 0.1713      A           1
 6  100        30.6   B2 0.1714      A           1

This produces a new table named annotated that contains the combination of our absorbance data with the information from the plate map. The inner join will also drop data for all of the wells in our data set that do not have a corresponding entry in the plate map. So if you don’t use a row or two in the microtiter plate, just don’t include those rows in the plate map (there’s nothing to describe anyway). Since the inner join takes care of matching the well data with its information, an added benefit of the plate map approach is that it makes data from experiments with randomized well locations much more easy to analyze (unfortunately, it doesn’t help with the pipetting portion of those experiments).

Let’s pause right here and save this new annotated data set. Because it contains all information related to the experiment—both the measured variables and the complete set of identity variables—it’s now in an ideal format for analyzing and for sharing.

# Write the annotated data set to a CSV file
write.csv(annotated, "data-annotated.csv")

Grouping the Data

Now that the data set is annotated, we can arrange it into groups based on the different experimental variables. With the example data set, it makes sense to collect the four replicate populations of each treatment at each time point. Using this grouping, we can begin to compare the growth of the different strains in the different environments over time and make observations such as “Strain A grows faster than Strain B in Environment 1, and slower than Strain B in Environment 2. In other words, we’re ready to start learning what the data have to tell us about our experiment.

For this and the following step, we’re once again going to be using the dplyr package, which contains some really powerful (and fast) functions that allow you to easily filter, group, rearrange, and summarize your data. We’ll group the data by Strain, then by Environment, and then by Time, and store the grouping in grouped. As shown in the Venn diagram, this means that we’ll first separate the data based on the strain. Then we’ll separate the data within each of those piles by the environment. Finally, within these smaller collections, we’ll group the data by time.

grouped <- group_by(annotated, Strain, Environment, Time)
Grouping the data by strain, environment, and time
Grouping the data by Strain, Environment, and Time

What this means is that grouped contains all of the growth measurements for Strain A in Environment 1 at each point in Time, then all of the measurements for Strain A in Environment 2 at each point in Time, and so on. We’ll use this grouping in the next step to calculate some statistics about the measurements. For example, we’ll be able to calculate the average absorbance among the four replicates of Strain A</span> in Environment 1 over time and for each of the other treatments.

Calculating Statistics for Each Group

Now that we have our data partitioned into logical groups based on the different experimental variables, we can calculate summary statistics about each of those groups. For this, we’ll use dplyr’s summarise function, which allows you to execute one or more functions on any of the columns in the grouped data set. For example, to count the number of measurements, the average absorbance (from the OD600 column), and the standard deviation of absorbance values:

stats <- summarise(grouped, N=length(OD600), Average=mean(OD600), StDev=sd(OD600))

The resulting stats data set contains a row for each of the different groups. Each row contains the Strain, Environment, and Time that define that group as well as our sample size, average, and standard deviation, which are named N, Average, and StDev, respectively. With summarise, you can use apply any function to the group’s data that returns a single value, so we could easily replace the standard deviation with 95% confidence intervals:

# Create a function that calculates 95% confidence intervals for the given
# data vector using a t-distribution
conf_int95 <- function(data) {
    n <- length(data)
    error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
    return(error)
}

# Create summary for each group containing sample size, average OD600, and
# 95% confidence limits
stats <- summarise(grouped, N=length(OD600), Average=mean(OD600),
                   CI95=conf_int95(OD600))

Combining Grouping, Summarizing, and More

One of the neat things that dplyr provides is the ability to chain multiple operations together using magrittr’s %>% operator. This allows us to combine the grouping and summarizing from the last two steps (and filtering, sorting, etc.) into one line:

stats <- annotated %>%
          group_by(Environment, Strain, Time) %>%
          summarise(N=length(OD600), 
                    Average=mean(OD600),
                    CI95=conf_int95(OD600)) %>%
          filter(!is.na(Strain))

Note that I’ve put the input data set, annotated, at the beginning of the chain of commands and that group_by and summarise no longer receive an input data source. Instead, the data flows from annotated, through summarise, and finally through filter just like a pipe. The added filter removes data from the control wells, which had no strain.

Plotting the Results

Now that we have all of our data nicely annotated and summarized, a great way to start exploring it is through plots. For the sample data, we’d like to know how each strain grows in each of the environments tested. Using the ggplot2 package, we can quickly plot the average absorbance over time:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) + 
       geom_line() + 
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

single plot

The obvious problem with this plot is that although we can differentiate among the three strains, we can’t see the effect that environment has. This can be fixed easily, but before we do that, let’s quickly dissect what we did.

We’re using the ggplot function to create a plot. As arguments, we say that the data to plot will be coming from the stats data frame. aes allows us to define the aesthetics of our plot, which are basically what ggplot uses to determine various visual aspects of the plot. In this case, the x values will be coming from our Time column, which we divide by 3600 to convert seconds into hours. The corresponding y values will come from the Average column. Finally, we will color things such as lines, points, etc. based on the Strain column.

The ggplot function sets up a plot, but doesn’t actually draw anything until we tell it what to draw. This is part of the philosophy behind ggplot: graphics are built by adding layers of different graphic elements. These elements (and other options) are added using the + operator. In our example, we add a line plot using geom_line. We could instead make a scatter plot with geom_point, but because our data are so dense, the result isn’t quite as nice. We also label the axes using labs.

Back to the problem of not being able to differentiate among the environments. While we could use a different line type for each environment (using the linetype aesthetic), a more elegant solution would be to create a trellis chart. In a trellis chart (also called small multiples by Edward Tufte), the data are split up and displayed as individual subplots. Because these subplots use the same scales, it is easy to make comparisons. We can use ggplot’s facet_grid to create subplots based on the environments:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) + 
       geom_line() + 
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")
Trellis plot showing the growth of the strains over time for each environment
Trellis plot showing the growth of the strains over time for each environment

Let’s take it one step further and add shaded regions corresponding to the confidence intervals that we calculated. Since ggplot builds plots layer-by-layer, we’ll place the shaded regions below the lines by adding geom_ribbon before using geom_line. The ribbons will choose a fill color based on the Strain and not color the edges. Since growth is exponential, we’ll also plot our data using a log scale with scale_y_log10:

ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
       geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
                   color=NA, alpha=0.3) + 
       geom_line() +
       scale_y_log10() +
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")
Our final plot showing the growth of each strain as mean plus 95% confidence intervals for each environment
Our final plot showing the growth of each strain as mean plus 95% confidence intervals for each environment

In Conclusion

And that’s it! We can now clearly see the differences between strains as well as how the environment affects growth, which was the overall goal of the experiment. Whether or not these results match my hypothesis will be left as a mystery. Thanks to a few really powerful packages, all it took was a few lines of code to analyze and plot over 200,000 data points.

I’m planning to post a follow-up in the near future that builds upon what we’ve done here by using grofit to fit growth curves.

Complete Script

library(reshape2)
library(dplyr)
library(ggplot2)

# Read in the raw data and the platemap. You may need to first change your
# working directory with the setwd command.
rawdata <- read.csv("data/growth-raw.csv")
platemap <- read.csv("data/platemap.csv")

# Reshape the data. Instead of rows containing the Time, Temperature,
# and readings for each Well, rows will contain the Time, Temperature, a
# Well ID, and the reading at that Well.
reshaped <- melt(rawdata, id=c("Time", "Temperature"), variable.name="Well", 
                 value.name="OD600")

# Add information about the experiment from the plate map. For each Well
# defined in both the reshaped data and the platemap, each resulting row
# will contain the absorbance measurement as well as the additional columns
# and values from the platemap.
annotated <- inner_join(reshaped, platemap, by="Well")

# Save the annotated data as a CSV for storing, sharing, etc.
write.csv(annotated, "data-annotated.csv")

conf_int95 <- function(data) {
    n <- length(data)
    error <- qt(0.975, df=n-1) * sd(data)/sqrt(n)
    return(error)
}

# Group the data by the different experimental variables and calculate the
# sample size, average OD600, and 95% confidence limits around the mean
# among the replicates. Also remove all records where the Strain is NA.
stats <- annotated %>%
              group_by(Environment, Strain, Time) %>%
              summarise(N=length(OD600),
                        Average=mean(OD600),
                        CI95=conf_int95(OD600)) %>%
              filter(!is.na(Strain))

# Plot the average OD600 over time for each strain in each environment
ggplot(data=stats, aes(x=Time/3600, y=Average, color=Strain)) +
       geom_ribbon(aes(ymin=Average-CI95, ymax=Average+CI95, fill=Strain),
                   color=NA, alpha=0.3) + 
       geom_line() +
       scale_y_log10() +
       facet_grid(Environment ~ .) +
       labs(x="Time (Hours)", y="Absorbance at 600 nm")

Extending to Other Types of Data

I hope it’s also easy to see how this pipeline could be used in other situations. For example, to analyze colony counts or a single read from a plate reader, you could repeat the steps exactly as shown, but without Time as a variable. Otherwise, if there are more experimental variables, the only change needed would be to add a column to the plate map for each of them.

Acknowledgments

I’d like to thank Carrie Glenney and Jared Moore for their comments on this post and for test driving the code. Many thanks are also in order for Hadley Wickham, who developed each of the outstanding packages used here (and many others).

Save this post as a PDF

]]>
http://bconnelly.net/2014/04/analyzing-microbial-growth-with-r/#comments
Summarizing Data in Python with Pandas Brian Connelly Tue, 22 Oct 2013 12:59:00 -0700 http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/ http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/ analysiscsvdatapandaspython Like many, I often divide my computational work between Python and R. For a while, I’ve primarily done analysis in R. And with the power of data frames and packages that operate on them like reshape, my data manipulation and aggregation has moved more and more into the R world as well. Perhaps my favorite tool of all has been plyr, which allows you to easily split up a data set into subsets based on some criteria, apply a function or set of functions to those pieces, and combine those results back together (a.k.a. “split-apply-combine”). For example, I often use this to split up a data set by treatment, calculate some summary stats for each treatment, and put these statistics back together for comparison. With R and these excellent packages, these steps are about as painless (I actually enjoy them, but that’s probably not normal) as it gets. Because of this, R has long been the choice for doing this kind of work.

]]>
Like many, I often divide my computational work between Python and R. For a while, I’ve primarily done analysis in R. And with the power of data frames and packages that operate on them like reshape, my data manipulation and aggregation has moved more and more into the R world as well. Perhaps my favorite tool of all has been plyr, which allows you to easily split up a data set into subsets based on some criteria, apply a function or set of functions to those pieces, and combine those results back together (a.k.a. “split-apply-combine”). For example, I often use this to split up a data set by treatment, calculate some summary stats for each treatment, and put these statistics back together for comparison. With R and these excellent packages, these steps are about as painless (I actually enjoy them, but that’s probably not normal) as it gets. Because of this, R has long been the choice for doing this kind of work.

About two years ago, I discovered pandas, a Python library for performing data analysis. The goal of pandas is to provide data structures and functions that make data analysis in Python just as easy (if not easier) than in R. At the time, it was young and growing very quickly, so although I could see the huge potential, I wasn’t quite ready to make the switch. Still, the premise is extremely exciting for anyone who does at least some work in Python. Because of that, I’ve checked in on Pandas now and then to see how it would handle jobs for which I’d instinctively reach for R. On a whim, I decided to try it out yesterday for a split-apply-combine job and was pleasantly surprised both by how easily it can be done with pandas, but also by how quickly it produced the results.

The Data

As an example, let’s take a look at some data produced by several groups in Ben Kerr’s Experimental Evolutionary Ecology class. This data set contains the fitness of a flocculated strain of Escherichia coli relative to a non-floculated strain when grown alone in either spatially-structured (dish) or spatially-unstructured (tube) environments.

import pandas

# Load the data into a DataFrame
data = pandas.read_csv('TradeoffData.csv')

data.head(n=10)
 
  Group Treatment  Replicate  RelativeFitness
0   BKB      Tube          1         0.869963
1   BKB      Tube          2         1.000363
2   BKB      Tube          3         0.982935
3   BAC      Tube          1         0.810392
4   BAC      Tube          2         0.795107
5   JDK      Tube          1         0.849204
6   JDK      Tube          2         0.917637
7   JDK      Tube          3         0.905323
8   ETH      Tube          1         0.930821
9   ETH      Tube          2         0.958183

Splitting the Data

Now that we have the data loaded, let’s divide it into groups. First, we could group the data by Treatment, which allows us to compare the Tube versus the Dish treatments:

bytreatment = data.groupby('Treatment')

This stores the grouping in a pandas DataFrameGroupBy object, which you will see if you try to print it.

Since RelativeFitness is the value we’re interested in with these data, lets look at information about the distribution of RelativeFitness values within the groups. To get some basic statistics, we can use the describe() method:

bytreatment['RelativeFitness'].describe()
 
Treatment       
Dish       count    32.000000
           mean      1.456359
           std       0.184792
           min       0.955221
           25%       1.429005
           50%       1.510884
           75%       1.581340
           max       1.699276
Tube       count    32.000000
           mean      0.929589
           std       0.050153
           min       0.795107
           25%       0.915050
           50%       0.939089
           75%       0.953505
           max       1.000363
dtype: float64

Based on this, we can clearly see that the mean AverageFitness values differ by quite a bit between the two treatments.

It is often the case that the data could be grouped by more than one category. For example, we may want to look at the RelativeFitness values both per Treatment and per Group to see if each group saw similar differences between the treatments.

bygroup_treatment = data.groupby(['Group', 'Treatment'])

We can then get similar statistics for this two-level grouping

bygroup_treatment['RelativeFitness'].describe()
 
Group  Treatment       
BAC    Dish       count    2.000000
                  mean     1.633628
                  std      0.026313
                  min      1.615022
                  25%      1.624325
                  50%      1.633628
                  75%      1.642931
                  max      1.652234
       Tube       count    2.000000
                  mean     0.802749
                  std      0.010808
                  min      0.795107
                  25%      0.798928
                  50%      0.802749
                  75%      0.806570
...
SWI    Dish       mean     1.451796
                  std      0.079652
                  min      1.362556
                  25%      1.419848
                  50%      1.477141
                  75%      1.496417
                  max      1.515692
       Tube       count    3.000000
                  mean     0.918647
                  std      0.009692
                  min      0.909023
                  25%      0.913768
                  50%      0.918514
                  75%      0.923459
                  max      0.928405
Length: 176, dtype: float64

Where the ... indicates that pandas isn’t printing all of the results here (although they do exist and can be iterated over).

Applying Functions to the Subsets

As we saw, the describe() method produces some very useful statistics about the RelativeFitness values for the grouped data. Pandas includes a number of common ones such as mean(), max(), median(), etc.:

bygroup_treatment['RelativeFitness'].mean()
 
Group  Treatment
BAC    Dish         1.633628
       Tube         0.802749
BKB    Dish         1.315682
       Tube         0.951087
DOS    Dish         1.587148
       Tube         0.945595
ECO    Dish         1.561197
       Tube         0.971033
ETH    Dish         1.482941
       Tube         0.934431
FIT    Dish         1.001960
       Tube         0.956410
H2W    Dish         1.525228
       Tube         0.902636
HHE    Dish         1.424773
       Tube         0.939615
JDK    Dish         1.546707
       Tube         0.890721
PPP    Dish         1.547974
       Tube         0.970277
SWI    Dish         1.451796
       Tube         0.918647
Name: RelativeFitness, dtype: float64

Aside from these, arbitrary functions can also be applied to groups of data—including ones you write yourself—using the aggregate method. Here, we’ll use NumPy’s sum function, which will calculate the sum of the RelativeFitness values per group (which doesn’t really mean anything):

bygroup_treatment['RelativeFitness'].aggregate(np.sum)
 
Group  Treatment
BAC    Dish         3.267256
       Tube         1.605498
BKB    Dish         3.947045
       Tube         2.853261
DOS    Dish         4.761443
       Tube         2.836784
ECO    Dish         4.683590
       Tube         2.913100
ETH    Dish         4.448824
       Tube         2.803292
FIT    Dish         3.005880
       Tube         2.869231
H2W    Dish         4.575683
       Tube         2.707909
HHE    Dish         4.274319
       Tube         2.818846
JDK    Dish         4.640121
       Tube         2.672164
PPP    Dish         4.643923
       Tube         2.910832
SWI    Dish         4.355389
       Tube         2.755942
Name: RelativeFitness, dtype: float64

We can also apply multiple functions to the groups using agg():

bygroup_treatment['RelativeFitness'].agg([np.sum, np.mean, np.std, len])
 
                      sum      mean       std  len
Group Treatment                                   
BAC   Dish       3.267256  1.633628  0.026313    2
      Tube       1.605498  0.802749  0.010808    2
BKB   Dish       3.947045  1.315682  0.179156    3
      Tube       2.853261  0.951087  0.070794    3
DOS   Dish       4.761443  1.587148  0.006740    3
      Tube       2.836784  0.945595  0.005768    3
ECO   Dish       4.683590  1.561197  0.124910    3
      Tube       2.913100  0.971033  0.029120    3
ETH   Dish       4.448824  1.482941  0.124571    3
      Tube       2.803292  0.934431  0.022169    3
FIT   Dish       3.005880  1.001960  0.041853    3
      Tube       2.869231  0.956410  0.038808    3
H2W   Dish       4.575683  1.525228  0.014052    3
      Tube       2.707909  0.902636  0.043792    3
HHE   Dish       4.274319  1.424773  0.085070    3
      Tube       2.818846  0.939615  0.015305    3
JDK   Dish       4.640121  1.546707  0.036218    3
      Tube       2.672164  0.890721  0.036479    3
PPP   Dish       4.643923  1.547974  0.036531    3
      Tube       2.910832  0.970277  0.020897    3
SWI   Dish       4.355389  1.451796  0.079652    3
      Tube       2.755942  0.918647  0.009692    3

With this, the sky’s the limit. Although the example I used was for experimental evolution data using microbes, the same concepts can be used to split-apply-combine any other kind of data that can be grouped into categories.

Save this post as a PDF

]]>
http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/#comments
Working with CSVs on the Command Line Brian Connelly Mon, 23 Sep 2013 12:19:00 -0700 http://bconnelly.net/working-with-csvs-on-the-command-line/ http://bconnelly.net/working-with-csvs-on-the-command-line/ howtocsvsoftware Comma-separated values (CSV), and its close relatives (e.g., Tab-separated values) play a very important role in open access science. CSV is an informally-defined file format that stores tabular data (think spreadsheets) in plain text. Within the file, each row contains a record, and each field in that record is separated by a comma, tab, or some other character. This format offers several significant advantages. Because they are plain text, these files can be easily read and edited without the need for specialized or proprietary software. CSVs are also version-independent, so ten years down the road you won’t have to track down some ancient piece of software in order to revisit your data (or do the same for someone else’s data). Support for CSV files is built into most data analysis software, programming languages, and online services (see Some Useful Resources at the end of this article for links for your software of choice).

]]>
Comma-separated values (CSV), and its close relatives (e.g., Tab-separated values) play a very important role in open access science. CSV is an informally-defined file format that stores tabular data (think spreadsheets) in plain text. Within the file, each row contains a record, and each field in that record is separated by a comma, tab, or some other character. This format offers several significant advantages. Because they are plain text, these files can be easily read and edited without the need for specialized or proprietary software. CSVs are also version-independent, so ten years down the road you won’t have to track down some ancient piece of software in order to revisit your data (or do the same for someone else’s data). Support for CSV files is built into most data analysis software, programming languages, and online services (see Some Useful Resources at the end of this article for links for your software of choice).

To many, the command line (or shell) is a strange and scary place. Those familiar with working in the command line, however, wouldn’t trade its power, speed, and flexibility for anything, no matter how shiny or “user friendly” an alternative drag-and-drop interface might be. This power, speed, and flexibility extends to working with CSV files, and it’s here that I’d like to demonstrate a small slice of what can be done with just a few keystrokes. Even though the Unix command line has existed for decades, it’s still always just a few clicks away, whether you use Linux on some HPC hardware or cloud service, or whether you use Mac OS X on your laptop.

You don’t have to be a command line wizard to use the commands below, but you do need to know the basics of how to get to a command line (hint for OS X users, it’s through the Terminal app) and how to do things like changing directories (cd) and listing files (ls). If these topics are new to you, there’s a pretty good introduction linked below in Some Useful Resources. I’ve also included a brief description of the command line tools used in this cookbook below in A Quick Overview of the Commands Used in this Cookbook. In the next section, I’ll quickly introduce two of the command line’s most powerful features: pipes and output redirection. If you’re already familiar with these, you can jump right into the good stuff.

This guide is intended to be used like a cookbook. Although you can read it from start to finish (it’s not that long), each “recipe” is a self-contained guide for doing a particular task with a CSV file.

Although we’re focusing on comma-separated values, all of the commands below would work for data separated by tabs or any other separator. Just replace the comma character in the commands with your delimiter of choice. Most software that handles CSVs isn’t picky, and allows you to specify the delimiter used.

Contents

A Quick Introduction to Pipes and Redirection

Most of the commands available on the command line were developed with the same goals: to do a single job and to do it well. The real power of the command line comes from stringing these programs together to perform more complex tasks. By knowing just a handful of commands, just about anything is possible.

Writing Output to a File with Output Redirection

By default, most command line tools give their results by printing them to the screen. This might be ok when doing some calculations, but it’s not so useful if you’re transforming files (like CSVs) and intend to save them, share them, or use the results in some other software.

With output redirection, you can tell the command line to write the output of one or more commands to a file instead of printing it to the screen. To do this, just add the > character and the name of the file where you want the output to be stored to the end of the command. If the output file doesn’t already exist, > will create that file. If that file exists, > will replace its contents with the new output. If you’d rather add the output of your commands to an existing file, you can use > instead.

For example, the following command (which will be described in more detail in the first recipe) will print all of the lines of the file input.csv that do not begin with a # character:

grep -v "^#" input.csv

Instead, to save those results to a file named output.csv, we could use output rediraction and would run:

grep -v "^#" input.csv > output.csv

If output.csv already existed, its contents would be replaced. If we instead wanted to add the output of the current command to output.csv, we’d do:

grep "^#" input.csv >> output.csv

Connecting Programs with Pipes

Pipes, denoted with the | character, allow the output of one program to be sent as input to another program. This saves you from having to first save the output of the first program to a file, and then start the second program using that file as input. You can imagine that a workflow with many steps could generate a lot of intermediate files and require a lot of concentration to see things through from start to finish.

Using our previous example once again:

grep -v "^#" input.csv

Prints all of the lines of input.csv that do not contain the # character. The grep program searches for patterns within files (see A Quick Overview of the Commands Used in this Cookbook). Here, we’re using it in reverse and finding where a pattern does not exist. If we wanted to count the number of non-commented lines (i.e., those with #) in input.csv, we could use grep to strip out lines with #, and use a pipe to send the results to the wc program, which can count the number of lines in its input when used with the -l argument:

grep -v "^#" input.csv | wc -l

Pipes are used extensively in the recipes that follow. If you’re still not quite comfortable with how they work, try executing the commands in the recipes one by one to see what the output of each part is. Then try to visualize how they all come together.

Cleaning Up CSV Files

Removing Comments

Comments are additional information stored in the file that are not a part of the actual data set, but that either describe the data set or provide relevant information (metadata). Comments could include things like the name of the person who gathered the data, the date on which measurements were made, the equipment used, or notes about a particular record. Typically, comments occur on their own line and begin with a # sign as demonstrated below:

Temperature,Row,Column,Luminescence
# Luminescence of evolved strain
# Collected by Harvey I. Vibrio - 2012/06/27

26.3,0,0,7444.945
26.3,0,1,4845.375
26.3,0,2,4056.362
# Look at this luminescence!!!
26.3,0,3,4883.137
26.3,0,4,3593.289
26.3,0,5,2645.281
26.3,1,2,10507.588

Despite their common use, comments are not supported by all software and can occasionally cause problems when importing data. Fortunately, comments can quickly be stripped from a file using the grep command. The following command strips out all lines that begin with the # character from a file named input.csv:

grep -v "^#" input.csv

This command will print all the lines of input.csv that do not begin with #. The ^ character means the beginning of a line. To save this output as a file, we can redirect the output of grep to a new file called input-nocomments.csv:

grep -v "^#" input.csv > input-nocomments.csv

Now, input-nocomments.csv can be imported into programs that do not support comments.

Removing Headers

Headers are commonly used to name each column. They appear as a comma-separated list of column names at the top of the file. Sometimes, it is beneficial to remove headers from a CSV file when merging multiple files (see Combining Rows from Two or More CSVs below) or when the software you’re using to import the data is finicky. For example, R expects the header to be the first line of a file and cannot be preceded by comments. Assuming that the header is on the first line of input.csv, we can remove it with the cat and sed commands:

cat input.csv | sed "1 d"

This uses the Unix pipe (|) symbol to connect the output of the cat program, which just prints the contents of the file, to the sed program, which we use to delete the first line. The 1 can be replaced in the parameters to sed to allow a different number of lines to be skipped. As before, the results of this command are printed. To save these results as a new file named noheader.csv, the output can be redirected:

cat input.csv | sed "1 d" > noheader.csv

If our input file has comments before the header line, this command won’t quite have the effect that we want, because sed blindly removes the first line of the file. Through the power and beauty of the command line, however, we can combine our comment stripping ability with our line chopping ability:

cat input.csv | grep -v "^#" | sed "1 d" > onlydata.csv

This command uses cat to send the contents of the file to grep, which only prints out lines that do not begin with #. The output goes to sed, which removes the first line of the non-commented data. Finally, the results of this string of operations are placed in a new file named onlydata.csv.

Changing Delimiters

You can easily change the delimiter of a file using the tr command. To convert a tab-separated values file named input.tsv to a CSV file named input.csv:

cat input.tsv | tr "\\t" "," > input.csv

The two arguments given in this command indicate the character to change "\\t" (a tab), and the character to change it to "," (a comma). Similarly, we can convert a CSV file to a TSV file by reversing the arguments:

cat input.csv | tr "," "\\t" > input.tsv

Getting Information about the Data Set

Before working with a data set, you’d sometimes like to get a sense of how big it is and the types of values that it stores. This section provides a few ways of doing this.

Taking a Peek at the Data Set

In the previous recipe, we removed comments and the header from a CSV file. We did so, because we knew beforehand that they existed in the data set. If we’re not familiar with the file, we can usually determine whether or not there are headers or comments by looking at the first few lines of the file. Instead of opening the file, which could be quite large, with a text editor, we can quickly use the head command:

head input.csv

This command will display the first 10 lines of input.csv. To see more or fewer lines, we can specify the number using the -n argument:

head -n 7 input.csv

This should give us an idea whether or not there’s a header in the file and if there are comments within the first few lines. If we want to search the whole file for comments, we can use grep:

grep "#" input.csv

Which prints each line in input.csv that contains the # character. If we instead want to know the number of comments that occur in the file, we can send the output to the wc command, which can count the number of lines in the output of grep when used with the -l argument:

grep "#" input.csv | wc -l

If the output from this command is 0, then we know that input.csv contains no comments.

Determining the Number of Rows

The wc command lists the number of lines in a file when the -l argument is given.

wc -l input.csv

It does so indiscriminately, so its count will include headers and comments. Using previous recipes, though, we can strip these things from the file before counting the number of lines.

cat input.csv | grep -v "^#" | sed "1 d" | wc -l

This command will print out the input.csv file, send it to grep, which strips out lines containing #. The output of this is sent to sed, which strips out the header on the first line (assuming we have one—see previous recipe). The result of all this is then sent to wc, which counts and prints the number of lines, each of which is presumably a data record.

Determining the Number of Columns

Similarly, we can determine the number of columns in the data set by looking at the non-comment lines:

cat input.csv | grep -v "^#" | awk "{print NF}" FS=,

This command uses grep to strip out commented rows, and sends the remaining rows to awk, which reports the number of comma-separated fields (the FS argument) in each line. We can send these results to the uniq command to only output the unique numbers of columns present in the file:

cat input.csv | grep -v "^#" | awk "{print NF}" FS=, | uniq

Typically, all rows in a CSV file contain the same number of records. In that case, the command will return one number. If more than one number is returned, there is some variation in the number of records per line in the file, which could cause other problems.

Combining CSVs

The following recipes deal with combining data from two or more CSV files.

Combining Rows from Two or More CSVs

If some of our records are stored in one file, say input1.csv, and other records are stored in one or more others, we can combine these files (vertically) into one large data file called combined.csv using cat:

cat input1.csv input2.csv input3.csv > combined.csv

It is important to note that each of the files should contain the same number of columns and have those columns occurring in the same order. Recipes for extracting specific columns in order to rearrange them are included later.

If each of the input files has a header, each of those header lines will be copied into combined.csv. The easiest way to deal with this is to strip the header line from each file as we combine them:

cat input1.csv > combined.csv
cat input2.csv | sed "1 d" >> combined.csv
cat input3.csv | sed "1 d" >> combined.csv

You may have noticed that we copied input1.csv as-is into combined.csv. By doing this, we’re using the header from the first file as the header for our new combined file. The remaining lines use sed to filter out the first line before adding the contents of input2.csv and input3.csv to combined.csv (note the use of >>, which appends the additional data to the output file). The file that results from these commands will contain the header and data from input1.csv and just the data from input2.csv and input3.csv.

Combining Columns from Multiple CSVs

If the fields of your data are spread across multiple files, these can be combined (horizontally) using paste:

paste -d , input1.csv input2.csv > combined.csv

Here, the -d , argument says that the contents from input1.csv and input2.csv will be separated (delimited) by a comma. By default, paste will separate columns using a tab.

It is important to note here that doing this assumes that the order of the data in each input file is the same and that the record(s) in row 8 of input1.csv correspond to the record(s) in row 8 of input2.csv.

If you’re interested in extracting specific columns from your input file(s), see the recipe for Extracting Columns.

Extracting Data Subsets

Extracting Specific Row Numbers from CSVs

Data entries in CSV files can be extracted either by specific line numbers or by matching some criteria. The former is more straightforward and can be done in a number of ways depending on which lines you’re interested in.

If we’re interested in the first few rows of data, we can use the head command. For example, to extract the first 13 lines from input.csv and save the results to first13.csv, we could execute:

head -n 13 input.csv > first13.csv

To make sure we’re extracting only data and not headers and comments, we could first strip these following the Removing Headers recipe:

cat input.csv | grep -v "^#" | sed "1 d" | head -n 13 > first13.csv

Similarly, if we’re interested in extracting the last 8 rows from the input file, we could use the tail command:

tail -n 8 input.csv > last8.csv

As with the previous example, we may want to first strip out any comments. When we take rows from the end, we’re not getting the header. One easy way to do this would be to first extract the header from the input file using head, and then pull out the last 8 columns:

head -n 1 input.csv > last8.csv
tail -n 8 input.csv >> last8.csv

As we’ve seen before, the > creates a new file (or overwrites an existing one) named last8.csv, and writes the output of head to it. The >> then appends the results of tail to this file. If you’d like to do these two steps in one line (that still contains two commands), these lines can be combined with ;:

head -n 1 input.csv > last8.csv ; tail -n 8 input.csv >> last8.csv

If for some reason you know the specific line numbers of interest, you can carve out those lines using sed. For example, to extract lines 10 through 14:

sed -n "10,14 p" input.csv

Extracting Rows Based on Some Value

The other case is when we’re interested in selecting rows by a certain value. For example, we may have phone book stored as a CSV, where each row contains the name of a person or business and their number. If we wanted to create a CSV that contains all of the places (or people!) with Pizza in their name, we could use grep to find all rows in the file containing that word and save the results to a new phone book of pizza places named pizza.csv:

grep "pizza" input.csv > pizza.csv

As with the previous example, pizza.csv will not contain headers, so we can use head:

head -n 1 input.csv > pizza.csv
grep "pizza" input.csv >> pizza.csv

Note that pizza.csv will also contain any comments that include the word pizza, so you may want to first strip comments from your input. Note that grep is case sensitive, so it will match records containing pizza but not ones containing pizza. We can make grep case insensitive by adding the -i argument:

grep -i "pizza" input.csv >> pizza.csv

If you’re interested in extracting rows based on criteria other than matching some specified text, such as rows containing some value greater than a number of interest, I’d recommend moving away from command line tools and creating a script in something like Python or R.

Extracting Columns

Specific columns can also be easily extracted from CSVs. For example, if we wanted to extract columns 2, 4, 5, and 6 from input.csv:

cut -d , -f 2,4-6 input.csv

Here, the -d , tells cut that columns are separated by commas, and -f 2,4-6 tells it to extract column 2 and columns 4-6. The -f argument can take a single column number or a comma-separated list of numbers and ranges.

Working with Compressed CSV Files

If our data set gets very large, we can compress the CSV file using a file compression tool such as bzip2 or gzip to shrink the file size (sometimes dramatically). With slight modification, any of the commands shown in this cookbook can be used with compressed CSV files. Additionally, tools such as Python and R have bzip2 and gzip support, so they can read and write these compressed CSV files as well.

Compressing CSV Files with bzip2

bzip2 is an open source compression tool available on most Linux and OS X machines. To compress your input file, run:

bzip2 input.csv

This will produce a file named input.csv.bz2. If you should ever need to uncompress this file, use the --decompress or -d arguments:

bzip2 -d input.csv.bz2

Working with bzip2-Compressed CSV Files

If you look at input.csv.bz2, it will look like an unintelligible chunk of characters. This is exactly how the file will look to any of the other utilities we’ve used in this cookbook. These utilities expect to work with plain text, so they usually can’t handle compressed data. Fortunately, we can easily give these tools the uncompressed data without actually uncompressing the file. To do so, we’ll use the bzcat command, which prints out the uncompressed contents of the file, and then use pipes to send that output to the right tools. For example, if we want to extract the third column of a compressed CSV file (see Extracting Columns) and save the results to col3.csv, we could run:

bzcat input.csv.bz2 | cut -d , -f 3 > col3.csv

Since we’re all about dealing with compressed CSVs, we could also save the results to a bzipped file after passing the results through the bzip2 command:

bzcat input.csv.bz2 | cut -d , -f 3 | bzip2 > col3.csv.bz2

Compressing CSV Files with gzip

Some older machines will not have bzip2 installed. These machines likely have another open source compression tool called gzip installed. Similarly to bzip2, files can be compressed by passing them to the gzip command:

gzip input.csv

which produces input.csv.gz. To uncompress this file, just use the gunzip command:

gunzip input.csv.gz

Working with gzip-Compressed CSV Files

Working with gzip-compressed CSV files is the same as described in Working with bzip2-Compressed CSV Files with the exception of using the gzcat command instead of bzcat. If you cannot find gzcat on the machine, try zcat.

A Quick Overview of the Commands Used in this Cookbook

This section provides a quick description of each of the commands used and a link to their man page (manual).

  • awk: AWK is a programming language (and command) that is great for processing text (man page)
  • cat: concatenates and prints files (man page)
  • cut: removes specific bytes, characters, or fields from files (man page)
  • grep: finds lines matching (or not matching) a given pattern (man page)
  • head: outputs the first part of a file (man page)
  • sed: perform simple manipulations to text in a file (man page)
  • tail: outputs the end of a file (man page)
  • tr: changes (translates) characters within a file (man page)
  • wc: counts the number of characters, words, and lines in a file (man page)

Some Useful Resources

CSV Format

Software Support for CSV

Introduction to Unix (also Linux, OS X, etc.) and The Command Line

Save this post as a PDF

]]>
http://bconnelly.net/working-with-csvs-on-the-command-line/#comments