Build SQL queries with temporary table vs WHERE IN
When writing SQL queries to check if a value matches against a list of given values, we normally use the IN
operator. For example, suppose we have a requests
table, and it contains an account_id
column that is referencing accounts
table. If we want to find a list of requests which only belongs to certain accounts, we would write a SQL query like:
SELECT requests.* FROM requests WHERE requests.account_id IN (1,2,3, ...)
Where (1,2,3, ...)
represents a list of account IDs that we care about.
If we write this using Rails’ Active Record, it would be something like:
account_ids = [1, 2, 3, ...]
Request.where(account_id: account_ids)
Performance issue may come, when the size of the list becomes large. Moreover, some people found it’s slower than using ANY
when the query includes subquery in Postgres. Meanwhile, things could get even worse when there is no index on the matching column (requests.account_id
in our example) or the query planner of the database system fails to use the index.
Temporary Table
As someone mentioned in the above links, instead of using IN
or ANY
, we can create a TEMP TABLE
and then join this table to find matched records. In the following section, I’m going to give you an example on how to write SQL query using temporary table in Ruby.
If we use the same example as described above, the first step is to create a temporary table to hold the selected account records. In PostgreSQL, the query looks like this:
CREATE TEMP TABLE selected_accounts (
"account_id" INT8,
CONSTRAINT "id_pkey" PRIMARY KEY ("account_id")
)
ON COMMIT DROP
and then the next step is to insert the account ids into this table:
INSERT INTO selected_accounts VALUES (1), (2), (3), ...
Note that there are many other ways that might be more efficient to load the data than the above one. Such as using COPY
, which I will show you later.
Finally, you can find the results by joining it with requests
table:
SELECT requests.* FROM requests INNER JOIN selected_accounts ON requests.account_id = selected_accounts.account_id
We need to combine these 3 steps into one transaction, and since we set ON COMMIT DROP
when creating the temp table, the table is automatically dropped at the end of the transaction.
We can wrap the entire temp table query in one transaction in Ruby on Rails with Postgres, like this:
db_connection = ActiveRecord::Base.connection.raw_connection
db_connection.transaction do
db_connection.exec(%{
CREATE TEMP TABLE selected_accounts (
"account_id" INT8,
CONSTRAINT "id_pkey" PRIMARY KEY ("account_id")
)
ON COMMIT DROP;
})
db_connection.copy_data('COPY selected_accounts (account_id) FROM STDIN') do
db_connection.put_copy_data([1, 2, 3, ...])
end
db_connection.exec %{
SELECT requests.* FROM requests INNER JOIN selected_accounts
ON requests.account_id = selected_accounts.account_id
}
end
The above code also shows you how to use COPY
command to load the data, which may be more practical in real use cases, since most of the time, you will be dealing with a large number of records.
The code uses database connection from ActiveRecord::Base
, but in fact, you don’t even need to use Rails to make the above query work. For PostgreSQL, we can use pg gem to connect to database:
db_connection = PG.connect(host: 'localhost', dbname: 'xxx', account: 'xxx', password: 'xxx')
db_connection.transaction do
... ...
end
Benchmark
The SQL query that uses temporary table looks complicate and requires more steps, but how about the performance comparing to IN
or ANY
? I wrote a benchmark script to give you an idea on the performance of each approach. You can find it on Github.
By default, the code generates 500,000 rows in a requests
table for 5,000 different account IDs, and the account IDs are just integer numbers, but you can play with different numbers to see how the performance goes.
The testing queries try to get records based on a select list of account IDs by using WHERE IN ()
, WHERE ANY ()
and TEMP TABLE
, and it’s a little different than the above example. The benchmark runs against PostgreSQL 9.4.5, and the results are shown as below:
With index on requests.account_id:
user system total real
use_where_in 0.000000 0.000000 0.000000 ( 0.242052)
use_where_any 0.010000 0.000000 0.010000 ( 0.219573)
use_tmp_table 0.000000 0.000000 0.000000 ( 0.193532)
Without index on requests.account_id:
user system total real
use_where_in 0.000000 0.000000 0.000000 ( 6.290538)
use_where_any 0.000000 0.000000 0.000000 ( 0.192422)
use_tmp_table 0.000000 0.000000 0.000000 ( 0.193068)
As you can see, in our example, if there’s no index on account_id
column, WHERE IN ()
approach gives you the worst performance, but if index is added on account_id
, the performance is pretty close. Check out the repository to get more detailed information.
Conclusion
The performance may vary in different versions of Postgres, or different SQL queries, especially when there are complicated expressions in subquery, so whether use WHERE IN
or TEMP TABLE
depends on your own use cases. This post mainly tries to show you how to use temporary table.
Query with temp WHERE IN
or WHERE ANY
requires less code than TEMP TABLE
, but TEMP TABLE
seems to have more stable performance in most of the cases. There is also an old post talking about problems using WHERE IN
that is worth to read. Overall, it’s better to run some tests on your use cases before making the decision, and don’t forget to use command like EXPLAIN ANALYZE
to verify if index is properly used.