

Some further digging establishes the reasons for this - Pandas implements additional optimisations in many use cases, some of these in C code. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation. Running both of the above with timeit, at 10 runs of 100 repeats each, returns the following result:įrom the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. Comparison 1 - Summation #įor the list, we will utilise a straightforward looping construct.
Python compare dictionaries code#
You can also find the source code for this blog post on GitHub. The dataset, as well as the Haversine function we will use, is the same one used by Sofia Hesler in her Pycon 2017 presentation. We will do the comparison using 2 different functions: a simple summation, and a Haversine function. To run our experiment on real data, we will use a dataset containing a list of the coordinates of all New York hotels. This we demonstrate by examining the use case of element-wise assignment*. The question then arises: given the increased complexity and overhead of a Pandas DataFrame, is it true then that we should always default to using python Lists of dictionaries when performance is the primary consideration? Both data structures look similar enough to perform the same tasks - we can even look at lists of dictionaries as simply a less complex Pandas DataFrame (each row in a DataFrame corresponds to each dictionary in the list).

While in the initial stages of a project, sometimes we have to choose between storing data with Pandas DataFrames or in native python lists of dictionaries.
