Hash-Caching query results in Python

Developing data science products will in most cases require extensive testing and tuning models using historic data. At Gousto this is the case for a number of optimisation and machine learning algorithms powering our operations and marketing efforts. In turn, the majority of these cases will involve large volume data queried from a database. In order to maintain agility in developing your data-driven product the last thing you want is to wait for queries to complete while testing iterations to your code.

To prevent re-running large queries we can construct a simple Python method to cross-reference a hashed SQL query against stored query results saved under hash-filenames. Keep reading to find out how!

At Gousto we use Amazon Redshift as our data-warehouse, so the code below is based on PostgreSQL. However, the general approach should work regardless of your data warehouse and (R)DBMS.

What we’re looking for is a method that will allow us to:

  • retrieve data from file if a particular query was previously executed
    or
  • open a connection to the relevant database, execute the query and save it to file for speedy retrieval in the future

Step 1: hashing the query

The question is, how do we efficiently recognise whether we have executed a particular query before? The answer is that we can use a hash function to map a query to a unique string or hash and name the resulting data frame after that hash.

The Python standard library contains the hashlib module that will do the trick for us. One of the common hash functions available in the module is the SHA1 hash function. Despite a recent discovery it will be fine for our purposes. The SHA1 function expects a string input, which means we need to encode our query to make it a valid input. Next, we can use the hexdigest() method to store the hash in a more compact string containing only hexadecimal digits. Assuming we can pass a saved SQL query as sql_query:

1
query_hash = hashlib.sha1((sql_query.encode()).hexdigest()

In developing the data feed-in to a particular model the query may change structurally, meaning the hash will change as well. Once you are tuning a model, however, only particular parameters in the query are likely to change, for example the time frame over which evaluate a query. We therefore need to ensure we hash queries complete with these parameters, passed to the format method as a dictionary:

1
query_hash = hashlib.sha1((sql_query.format(parameters).encode()).hexdigest()

Step 2: Caching your data

Next we want to check whether we already have data stored, e.g. as a csv, under a particular hash as its filename. To separate the data from our code, we’ll create a file_path to look for the data in a _cache folder in the same directory. If the filepath exist, we read the csv. Otherwise we will execute the query complete with parameter settings through an existing connection, create the cache folder and save the csv to the filepath. For this example we use Pandas methods for reading data from locally stored csv files and executing a query, but other approaches can be used to the same effect.

1
2
3
4
5
6
7
8
9
file_path = os.path.join("_cache","{}.csv".format(query_hash))

if os.path.exists(file_path):
df_raw = pd.read_csv(file_path)
else:
df_raw = pd.read_sql(sql_query, con=connection, params=parameters)
if not os.path.isdir("_cache"):
os.makedirs("_cache")
df_raw.to_csv(file_path, index=False)

Step 3: Handling the database connection

Finally, we’d like the query to be executed using an existing database connection if possible or set up a single-use connection otherwise, using some method open_connection() defined elsewhere in our code.

On top of that it is important to properly close the single-execution connection when the query is complete or close whichever connection we’re using in the event we break the process manually.

Pulling the previous snippets together and integrating some control flow to handle opening and closing the connection, we have the final result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import hashlib
import os.path

import pandas as pd

def execute_query(sql_query, parameters=None, conn=None):
"""
Method to query data from Redshift and return pandas dataframe

Parameters
----------
sql_query : str
saved SQL query
parameters : dict, optional
populates named placeholders in query template.
conn : database string URI, optional
connection created with open_connection()

Returns
-------
df_raw : DataFrame
Pandas DataFrame with raw data resulting from query
"""

# If no existing connection object is passed, open a new connection
new_conn = False
if conn is None:
connection = open_connection()
new_conn = True

# Hash the query
query_hash = hashlib.sha1((sql_query.format(parameters).encode()).hexdigest()

# Create the filepath
file_path = os.path.join("_cache","{}.csv".format(query_hash))

# Read the file or execute query
if os.path.exists(file_path):
df_raw = pd.read_csv(file_path)
else:
try:
df_raw = pd.read_sql(sql_query, con=connection, params=parameters)
except (KeyboardInterrupt, SystemExit):
connecton.close()
if not os.path.isdir("_cache"):
os.makedirs("_cache")
df_raw.to_csv(file_path, index=False)

# Close single-execution connection
if new_conn:
connection.close()

return df_raw

As we grow and develop our in-house toolsets we will be looking to open-source some of our work as a python package in the future, so watch this space!

Marc Jansen, Data Scientist
Dejan Petelin, Head of Data Science