Python Pandas Pitfalls: hard lessons learned over time

Pitfall

I use python pandas all the time! I use it in jupyter notebooks for cleaning, profiling, and visualizing data, in scripts that load data into of read data from databases, in Docker containers running data-driven services, and on and on.

Over my time using pandas, I have had some great successes, but I have also caused myself a good bit of pain by not following some best practices. I wanted to share a couple of these scenarios with you, so you can avoid any related pain.

Note, the pandas website includes a great caveats and gotchas page that details many more potential hazards. Check it out.

Also, all the example dataframe and operations are recreated in the this ipython notebook.

Renaming Columns

There are many use cases for renaming columns in your dataframe. You may simply change your mind when working in a python shell, or you may want to rename column names that were inferred when reading in a CSV. In any event, let's assume that you have an existing dataframe, data_df, and you would like to rename the columns:

   old_column_1 old_column_2  old_column_3
0             1            a           1.2
1             8            z           3.5
2             4            s          34.7
3            23            b          12.1

But let's also assume that you did some kind of nifty transformations on this dataframe before renaming the columns, and these transformations included, for example, setting one column as an index and then resetting this index:

data_df = data_df.set_index('old_column_3').reset_index()

Now, data_df looks like this:

   old_column_3  old_column_1 old_column_2
0           1.2             1            a
1           3.5             8            z
2          34.7             4            s
3          12.1            23            b

As you can see, the ordering of the columns has been modified in my transformations. There are innumerable other ways to produce such a scenario, but the important thing is this: if you bang out a bunch of transformations with pandas dataframes, the resulting ordering of columns may not be what you expect. As such, you could fall into this first pitfall which is: brute force renaming columns with an array of column names.

Let me illustrate. In the above case, I want to rename old_column_1 to new_column_1, etc., and I might assume that the column ordering was preserved in my transformations. Under this assumption, I might try to rename the columns using:

data_df.columns = ['new_column_1', 'new_column_2', 'new_column_3']

If I did this, data_df would end up looking like this:

   new_column_1  new_column_2 new_column_3
0           1.2             1            a
1           3.5             8            z
2          34.7             4            s
3          12.1            23            b

The data from old_column_3 would actually get renamed to new_column_1. More importantly, any subsequent operations I was intending to apply to old_column_1's data would be applied to old_column_3's data. This could be a huge problem. If old_column_1 represented, say, company revenue and old_column_3 represented company costs, you may report negative profits to your boss when in actuality the company made money.

A better approach is to utilize pandas.DataFrame.rename. rename allows you to pass explicit dictionary of old to new column mappings that will correctly rename columns regardless of column ordering:

newcols = {
    'old_column_1': 'new_column_1', 
    'old_column_2': 'new_column_2', 
    'old_column_3': 'new_column_3'
}
data_df.rename(columns=newcols, inplace=True)

Then, we get what we want:

   new_column_3  new_column_1 new_column_2
0           1.2             1            a
1           3.5             8            z
2          34.7             4            s
3          12.1            23            b

Inserting Rows / Indices

Pandas makes it easy for you to set a column of a dataframe as an index. After doing this, you may want to reference a value in the dataframe at a particular index value or you may want to insert a new row with a corresponding new index value. These operations can trip you up if you're not careful. For example, let's take a similar dataframe to the one above:

   column_1 column_2  column_3
0         1        a       1.2
1         8        z       3.5
2         4        s      34.7
3        23        b      12.1

Now let's set the index as column_3.

          column_1 column_2
column_3                   
1.2              1        a
3.5              8        z
34.7             4        s
12.1            23        b

If we want to see the values in the row with index 34.7, we could use data_df.ix[34.7]. This yields:

column_1    4
column_2    s
Name: 34.7, dtype: object

Using some reasonable logic, we could assume, based on the above, that a good way to insert a new row with index 4.5 would be:

data_df.ix[4.5] = [17, 'e']

In fact, this would actually work (in this case):

          column_1 column_2
column_3                   
1.2              1        a
3.5              8        z
34.7             4        s
12.1            23        b
4.5             17        e

On the other hand, if we would like to insert a new row with an index value of an integer, e.g., 3, we would run into problems using this approach. Performing:

data_df.ix[3] = [9, 'y']

would not actually add a new row with index value 5, it would replace the third indexed row (zero indexed in python) with the given values:

          column_1 column_2
column_3                   
1.2              1        a
3.5              8        z
34.7             4        s
12.1             9        y
4.5             17        e

So, our second pitfall is: trying to reference a numerical valued index with the index value and accidentally overwriting another row of the dataframe. However, this can be avoided by using the loc function instead of ix:

data_df.loc[3] = [9, 'y']

which yields

          column_1 column_2
column_3                   
1.2              1        a
3.5              8        z
34.7             4        s
12.1             9        y
4.5             17        e
3.0              9        y

See more about indexing and selecting data in the pandas documentation here.

comments powered by Disqus