Like many, I often divide my computational work between Python and R. For a while, I’ve primarily done analysis in R. And with the power of data frames and packages that operate on them like reshape, my data manipulation and aggregation has moved more and more into the R world as well. Perhaps my favorite tool of all has been plyr, which allows you to easily split up a data set into subsets based on some criteria, apply a function or set of functions to those pieces, and combine those results back together (a.k.a. “split-apply-combine”). For example, I often use this to split up a data set by treatment, calculate some summary stats for each treatment, and put these statistics back together for comparison. With R and these excellent packages, these steps are about as painless (I actually enjoy them, but that’s probably not normal) as it gets. Because of this, R has long been the choice for doing this kind of work.

About two years ago, I discovered pandas, a Python library for performing data analysis. The goal of pandas is to provide data structures and functions that make data analysis in Python just as easy (if not easier) than in R. At the time, it was young and growing very quickly, so although I could see the huge potential, I wasn’t quite ready to make the switch. Still, the premise is extremely exciting for anyone who does at least some work in Python. Because of that, I’ve checked in on Pandas now and then to see how it would handle jobs for which I’d instinctively reach for R. On a whim, I decided to try it out yesterday for a split-apply-combine job and was pleasantly surprised both by how easily it can be done with pandas, but also by how quickly it produced the results.

## The Data

As an example, let’s take a look at some data produced by several groups in Ben Kerr’s Experimental Evolutionary Ecology class. This data set contains the fitness of a flocculated strain of Escherichia coli relative to a non-floculated strain when grown alone in either spatially-structured (dish) or spatially-unstructured (tube) environments.

import pandas

# Load the data into a DataFrame



Group Treatment  Replicate  RelativeFitness
0   BKB      Tube          1         0.869963
1   BKB      Tube          2         1.000363
2   BKB      Tube          3         0.982935
3   BAC      Tube          1         0.810392
4   BAC      Tube          2         0.795107
5   JDK      Tube          1         0.849204
6   JDK      Tube          2         0.917637
7   JDK      Tube          3         0.905323
8   ETH      Tube          1         0.930821
9   ETH      Tube          2         0.958183


## Splitting the Data

Now that we have the data loaded, let’s divide it into groups. First, we could group the data by Treatment, which allows us to compare the Tube versus the Dish treatments:

bytreatment = data.groupby('Treatment')


This stores the grouping in a pandas DataFrameGroupBy object, which you will see if you try to print it.

Since RelativeFitness is the value we’re interested in with these data, lets look at information about the distribution of RelativeFitness values within the groups. To get some basic statistics, we can use the describe() method:

bytreatment['RelativeFitness'].describe()


Treatment
Dish       count    32.000000
mean      1.456359
std       0.184792
min       0.955221
25%       1.429005
50%       1.510884
75%       1.581340
max       1.699276
Tube       count    32.000000
mean      0.929589
std       0.050153
min       0.795107
25%       0.915050
50%       0.939089
75%       0.953505
max       1.000363
dtype: float64

Based on this, we can clearly see that the mean AverageFitness values differ by quite a bit between the two treatments.

It is often the case that the data could be grouped by more than one category. For example, we may want to look at the RelativeFitness values both per Treatment and per Group to see if each group saw similar differences between the treatments.

bygroup_treatment = data.groupby(['Group', 'Treatment'])


We can then get similar statistics for this two-level grouping

bygroup_treatment['RelativeFitness'].describe()


Group  Treatment
BAC    Dish       count    2.000000
mean     1.633628
std      0.026313
min      1.615022
25%      1.624325
50%      1.633628
75%      1.642931
max      1.652234
Tube       count    2.000000
mean     0.802749
std      0.010808
min      0.795107
25%      0.798928
50%      0.802749
75%      0.806570
...
SWI    Dish       mean     1.451796
std      0.079652
min      1.362556
25%      1.419848
50%      1.477141
75%      1.496417
max      1.515692
Tube       count    3.000000
mean     0.918647
std      0.009692
min      0.909023
25%      0.913768
50%      0.918514
75%      0.923459
max      0.928405
Length: 176, dtype: float64

Where the ... indicates that pandas isn’t printing all of the results here (although they do exist and can be iterated over).

## Applying Functions to the Subsets

As we saw, the describe() method produces some very useful statistics about the RelativeFitness values for the grouped data. Pandas includes a number of common ones such as mean(), max(), median(), etc.:

bygroup_treatment['RelativeFitness'].mean()


Group  Treatment
BAC    Dish         1.633628
Tube         0.802749
BKB    Dish         1.315682
Tube         0.951087
DOS    Dish         1.587148
Tube         0.945595
ECO    Dish         1.561197
Tube         0.971033
ETH    Dish         1.482941
Tube         0.934431
FIT    Dish         1.001960
Tube         0.956410
H2W    Dish         1.525228
Tube         0.902636
HHE    Dish         1.424773
Tube         0.939615
JDK    Dish         1.546707
Tube         0.890721
PPP    Dish         1.547974
Tube         0.970277
SWI    Dish         1.451796
Tube         0.918647
Name: RelativeFitness, dtype: float64

Aside from these, arbitrary functions can also be applied to groups of data—including ones you write yourself—using the aggregate method. Here, we’ll use NumPy’s sum function, which will calculate the sum of the RelativeFitness values per group (which doesn’t really mean anything):

bygroup_treatment['RelativeFitness'].aggregate(np.sum)


Group  Treatment
BAC    Dish         3.267256
Tube         1.605498
BKB    Dish         3.947045
Tube         2.853261
DOS    Dish         4.761443
Tube         2.836784
ECO    Dish         4.683590
Tube         2.913100
ETH    Dish         4.448824
Tube         2.803292
FIT    Dish         3.005880
Tube         2.869231
H2W    Dish         4.575683
Tube         2.707909
HHE    Dish         4.274319
Tube         2.818846
JDK    Dish         4.640121
Tube         2.672164
PPP    Dish         4.643923
Tube         2.910832
SWI    Dish         4.355389
Tube         2.755942
Name: RelativeFitness, dtype: float64

We can also apply multiple functions to the groups using agg():

bygroup_treatment['RelativeFitness'].agg([np.sum, np.mean, np.std, len])


sum      mean       std  len
Group Treatment
BAC   Dish       3.267256  1.633628  0.026313    2
Tube       1.605498  0.802749  0.010808    2
BKB   Dish       3.947045  1.315682  0.179156    3
Tube       2.853261  0.951087  0.070794    3
DOS   Dish       4.761443  1.587148  0.006740    3
Tube       2.836784  0.945595  0.005768    3
ECO   Dish       4.683590  1.561197  0.124910    3
Tube       2.913100  0.971033  0.029120    3
ETH   Dish       4.448824  1.482941  0.124571    3
Tube       2.803292  0.934431  0.022169    3
FIT   Dish       3.005880  1.001960  0.041853    3
Tube       2.869231  0.956410  0.038808    3
H2W   Dish       4.575683  1.525228  0.014052    3
Tube       2.707909  0.902636  0.043792    3
HHE   Dish       4.274319  1.424773  0.085070    3
Tube       2.818846  0.939615  0.015305    3
JDK   Dish       4.640121  1.546707  0.036218    3
Tube       2.672164  0.890721  0.036479    3
PPP   Dish       4.643923  1.547974  0.036531    3
Tube       2.910832  0.970277  0.020897    3
SWI   Dish       4.355389  1.451796  0.079652    3
Tube       2.755942  0.918647  0.009692    3

With this, the sky’s the limit. Although the example I used was for experimental evolution data using microbes, the same concepts can be used to split-apply-combine any other kind of data that can be grouped into categories.

##### About Brian Connelly
NSF postdoctoral research fellow working in the Kerr Lab at the University of Washington and the BEACON Center for the Study of Evolution in Action. I use computational, mathematical, and microbial systems to understand how environmental conditions influence the evolution of cooperation.