Like many, I often divide my computational work between Python and R. For a while, I’ve primarily done analysis in R. And with the power of data frames and packages that operate on them like reshape, my data manipulation and aggregation has moved more and more into the R world as well. Perhaps my favorite tool of all has been plyr, which allows you to easily split up a data set into subsets based on some criteria, apply a function or set of functions to those pieces, and combine those results back together (a.k.a. “split-apply-combine”). For example, I often use this to split up a data set by treatment, calculate some summary stats for each treatment, and put these statistics back together for comparison. With R and these excellent packages, these steps are about as painless (I actually enjoy them, but that’s probably not normal) as it gets. Because of this, R has long been the choice for doing this kind of work.

About two years ago, I discovered pandas, a Python library for performing data analysis. The goal of pandas is to provide data structures and functions that make data analysis in Python just as easy (if not easier) than in R. At the time, it was young and growing very quickly, so although I could see the huge potential, I wasn’t quite ready to make the switch. Still, the premise is extremely exciting for anyone who does at least some work in Python. Because of that, I’ve checked in on Pandas now and then to see how it would handle jobs for which I’d instinctively reach for R. On a whim, I decided to try it out yesterday for a split-apply-combine job and was pleasantly surprised both by how easily it can be done with pandas, but also by how quickly it produced the results.

The Data

As an example, let’s take a look at some data produced by several groups in Ben Kerr’s Experimental Evolutionary Ecology class. This data set contains the fitness of a flocculated strain of Escherichia coli relative to a non-floculated strain when grown alone in either spatially-structured (dish) or spatially-unstructured (tube) environments.

import pandas

# Load the data into a DataFrame
data = pandas.read_csv('TradeoffData.csv')

data.head(n=10)
 
  Group Treatment  Replicate  RelativeFitness
0   BKB      Tube          1         0.869963
1   BKB      Tube          2         1.000363
2   BKB      Tube          3         0.982935
3   BAC      Tube          1         0.810392
4   BAC      Tube          2         0.795107
5   JDK      Tube          1         0.849204
6   JDK      Tube          2         0.917637
7   JDK      Tube          3         0.905323
8   ETH      Tube          1         0.930821
9   ETH      Tube          2         0.958183

Splitting the Data

Now that we have the data loaded, let’s divide it into groups. First, we could group the data by Treatment, which allows us to compare the Tube versus the Dish treatments:

bytreatment = data.groupby('Treatment')

This stores the grouping in a pandas DataFrameGroupBy object, which you will see if you try to print it.

Since RelativeFitness is the value we’re interested in with these data, lets look at information about the distribution of RelativeFitness values within the groups. To get some basic statistics, we can use the describe() method:

bytreatment['RelativeFitness'].describe()
 
Treatment       
Dish       count    32.000000
           mean      1.456359
           std       0.184792
           min       0.955221
           25%       1.429005
           50%       1.510884
           75%       1.581340
           max       1.699276
Tube       count    32.000000
           mean      0.929589
           std       0.050153
           min       0.795107
           25%       0.915050
           50%       0.939089
           75%       0.953505
           max       1.000363
dtype: float64

Based on this, we can clearly see that the mean AverageFitness values differ by quite a bit between the two treatments.

It is often the case that the data could be grouped by more than one category. For example, we may want to look at the RelativeFitness values both per Treatment and per Group to see if each group saw similar differences between the treatments.

bygroup_treatment = data.groupby(['Group', 'Treatment'])

We can then get similar statistics for this two-level grouping

bygroup_treatment['RelativeFitness'].describe()
 
Group  Treatment       
BAC    Dish       count    2.000000
                  mean     1.633628
                  std      0.026313
                  min      1.615022
                  25%      1.624325
                  50%      1.633628
                  75%      1.642931
                  max      1.652234
       Tube       count    2.000000
                  mean     0.802749
                  std      0.010808
                  min      0.795107
                  25%      0.798928
                  50%      0.802749
                  75%      0.806570
...
SWI    Dish       mean     1.451796
                  std      0.079652
                  min      1.362556
                  25%      1.419848
                  50%      1.477141
                  75%      1.496417
                  max      1.515692
       Tube       count    3.000000
                  mean     0.918647
                  std      0.009692
                  min      0.909023
                  25%      0.913768
                  50%      0.918514
                  75%      0.923459
                  max      0.928405
Length: 176, dtype: float64

Where the ... indicates that pandas isn’t printing all of the results here (although they do exist and can be iterated over).

Applying Functions to the Subsets

As we saw, the describe() method produces some very useful statistics about the RelativeFitness values for the grouped data. Pandas includes a number of common ones such as mean(), max(), median(), etc.:

bygroup_treatment['RelativeFitness'].mean()
 
Group  Treatment
BAC    Dish         1.633628
       Tube         0.802749
BKB    Dish         1.315682
       Tube         0.951087
DOS    Dish         1.587148
       Tube         0.945595
ECO    Dish         1.561197
       Tube         0.971033
ETH    Dish         1.482941
       Tube         0.934431
FIT    Dish         1.001960
       Tube         0.956410
H2W    Dish         1.525228
       Tube         0.902636
HHE    Dish         1.424773
       Tube         0.939615
JDK    Dish         1.546707
       Tube         0.890721
PPP    Dish         1.547974
       Tube         0.970277
SWI    Dish         1.451796
       Tube         0.918647
Name: RelativeFitness, dtype: float64

Aside from these, arbitrary functions can also be applied to groups of data—including ones you write yourself—using the aggregate method. Here, we’ll use NumPy’s sum function, which will calculate the sum of the RelativeFitness values per group (which doesn’t really mean anything):

bygroup_treatment['RelativeFitness'].aggregate(np.sum)
 
Group  Treatment
BAC    Dish         3.267256
       Tube         1.605498
BKB    Dish         3.947045
       Tube         2.853261
DOS    Dish         4.761443
       Tube         2.836784
ECO    Dish         4.683590
       Tube         2.913100
ETH    Dish         4.448824
       Tube         2.803292
FIT    Dish         3.005880
       Tube         2.869231
H2W    Dish         4.575683
       Tube         2.707909
HHE    Dish         4.274319
       Tube         2.818846
JDK    Dish         4.640121
       Tube         2.672164
PPP    Dish         4.643923
       Tube         2.910832
SWI    Dish         4.355389
       Tube         2.755942
Name: RelativeFitness, dtype: float64

We can also apply multiple functions to the groups using agg():

bygroup_treatment['RelativeFitness'].agg([np.sum, np.mean, np.std, len])
 
                      sum      mean       std  len
Group Treatment                                   
BAC   Dish       3.267256  1.633628  0.026313    2
      Tube       1.605498  0.802749  0.010808    2
BKB   Dish       3.947045  1.315682  0.179156    3
      Tube       2.853261  0.951087  0.070794    3
DOS   Dish       4.761443  1.587148  0.006740    3
      Tube       2.836784  0.945595  0.005768    3
ECO   Dish       4.683590  1.561197  0.124910    3
      Tube       2.913100  0.971033  0.029120    3
ETH   Dish       4.448824  1.482941  0.124571    3
      Tube       2.803292  0.934431  0.022169    3
FIT   Dish       3.005880  1.001960  0.041853    3
      Tube       2.869231  0.956410  0.038808    3
H2W   Dish       4.575683  1.525228  0.014052    3
      Tube       2.707909  0.902636  0.043792    3
HHE   Dish       4.274319  1.424773  0.085070    3
      Tube       2.818846  0.939615  0.015305    3
JDK   Dish       4.640121  1.546707  0.036218    3
      Tube       2.672164  0.890721  0.036479    3
PPP   Dish       4.643923  1.547974  0.036531    3
      Tube       2.910832  0.970277  0.020897    3
SWI   Dish       4.355389  1.451796  0.079652    3
      Tube       2.755942  0.918647  0.009692    3

With this, the sky’s the limit. Although the example I used was for experimental evolution data using microbes, the same concepts can be used to split-apply-combine any other kind of data that can be grouped into categories.

Save this post as a PDF

Brian Connelly
About Brian Connelly
NSF postdoctoral research fellow working in the Kerr Lab at the University of Washington and the BEACON Center for the Study of Evolution in Action. I use computational, mathematical, and microbial systems to understand how environmental conditions influence the evolution of cooperation.