The problem
Let’s say you have a table with millions of rows (in the order of hundreds of millions). In this scenario, you should think twice before issuing a query against such table. It’s usually not acceptable to do something like this:
SELECT *
FROM my_big_table
ORDER BY some_columnThe execution time of such a query will be huge, and your client won’t see anything until this query is done processing and you have the results in the application. So how can we execute these queries without too much performance penalty?
Using OFFSET / LIMIT as a solution
A common alternative is to paginate the query, like so:
SELECT *
FROM my_big_table
OFFSET 0
LIMIT 5000And then do …
SELECT *
FROM my_big_table
OFFSET 5001
LIMIT 5000And so on, until nothing gets returned.
This works, with a few caveats. First, in order to effectively use the offset and limit combination, you should use an ORDER BY clause in your query.
This is because the order of the results is not be guaranteed in most common databases and thus different offsets can actually return repeated data. The problem with applying ordering in these queries is a matter of how big your table is, its indexes, etc. For example, In Redshift — a big data database provided by AWS — such a query is only acceptable if you include the SORTKEY ‘s in your order by clause.
Another problem with this approach is that you’re going to have to issue multiple queries with this format, where the only thing that changes is the OFFSET. We know that the more an application accesses a database, the more it slows down, so avoiding database calls when possible is always preferable. If the query itself is not so complicated and is well restricted, this cost may be acceptable. Sometimes, though, it is not.
Using Streaming as a solution
Another approach that I recently played around at work is streaming the query results. Streaming data is when the entire result of a query is not loaded at once in memory but the rows are loaded as soon as they are calculated internally by the database and transferred to your application.
Here’s how to do streaming in Ruby + Postgres using the pg gem:
Let’s go over this code line by line. On line 1 we open the postgres connection. If you’re using Ruby on Rails, this should become ActiveRecord::Base.connection.raw_connection (The raw_connection part is to bypass the Postgres connection adapter used in ActiveRecord. This is important in Rails because the postgres adapter, as of this writing, does not support this functionality by default).
Line 4 sends the specified query to the database and returns immediately (you won’t have access to the results right away).
Line 5 is the most important bit in this code. set_single_row_mode will inform Postgres to load the results of the query one row at a time as they are calculated by the server. Please be aware that this only works if you have previously called the method send_query .
connection.get_result will load the first row (wrapped with the usual PG::Result object). Note that this will only load one row! In order to completely execute the query, we need to load all the rows from the query, but that would mean that we would have to call .get_result possibly a million times (before it returns nil . A nil return value from get_result means that the query has executed with success and that no more rows are left to load).
In order to avoid repeated calls to get_result , we use stream_each, which will load each row from the result and send it to the application. Also note that the database returns back to normal mode (which is loading, at once, the whole result set of queries) after this query finishes.
You can now respond to the client as soon as you get the first few sets of rows and thus build your response incrementally. This is a good alternative to sending several padded queries to the database with the offset/limit combo, because you’re requesting several different queries, all the while possibly opening and closing different database connections.
The drawbacks
Unfortunately, this doesn’t avoid memory leak problems, as you can still load millions of rows into your application.
Another issue is that you must ensure proper transactional behaviour while processing the results in single row mode. For example, if you load (and do something with) the first 10 rows and an error occurs (perhaps in your application an exception is raised, or an internal database error occurs) the query doesn’t necessarily abort (nor will a rollback command be issued), which means that everything that you’ve done with the previous rows is now valid and processed without any error. In these cases you have to ensure that you have some way of undoing the changes that you’ve made with the previous rows, otherwise your database and/or application state may become corrupted and/or invalid.
At the end of the day though, it is always a matter of trade-offs. You need to analyse the situation and figure out which approach works best for your application, your clients and your needs. This is what a software engineer does.
Nowadays I work at Runtime Revolution. Working here has been, and continues to be, a great learning experience. I’ve matured professionally as a developer, focusing on building and maintaining large scale Ruby on Rails applications.