Build SQL queries with temporary table vs WHERE IN

When writing SQL queries to check if a value matches against a list of given values, we normally use the IN operator. For example, suppose we have a requests table, and it contains an account_id column that is referencing accounts table. If we want to find a list of requests which only belongs to certain accounts, we would write a SQL query like:

SELECT requests.* FROM requests WHERE requests.account_id IN (1,2,3, ...)

Where (1,2,3, ...) represents a list of account IDs that we care about.

If we write this using Rails’ Active Record, it would be something like:

account_ids = [1, 2, 3, ...]
Request.where(account_id: account_ids)

Performance issue may come, when the size of the list becomes large. Moreover, some people found it’s slower than using ANY when the query includes subquery in Postgres. Meanwhile, things could get even worse when there is no index on the matching column (requests.account_id in our example) or the query planner of the database system fails to use the index.

Temporary Table

As someone mentioned in the above links, instead of using IN or ANY, we can create a TEMP TABLE and then join this table to find matched records. In the following section, I’m going to give you an example on how to write SQL query using temporary table in Ruby.

If we use the same example as described above, the first step is to create a temporary table to hold the selected account records. In PostgreSQL, the query looks like this:

CREATE TEMP TABLE selected_accounts (
  "account_id" INT8,
  CONSTRAINT "id_pkey" PRIMARY KEY ("account_id")
)
ON COMMIT DROP

and then the next step is to insert the account ids into this table:

INSERT INTO selected_accounts VALUES (1), (2), (3), ...

Note that there are many other ways that might be more efficient to load the data than the above one. Such as using COPY, which I will show you later.

Finally, you can find the results by joining it with requests table:

SELECT requests.* FROM requests INNER JOIN selected_accounts ON requests.account_id = selected_accounts.account_id

We need to combine these 3 steps into one transaction, and since we set ON COMMIT DROP when creating the temp table, the table is automatically dropped at the end of the transaction.

We can wrap the entire temp table query in one transaction in Ruby on Rails with Postgres, like this:

db_connection = ActiveRecord::Base.connection.raw_connection
db_connection.transaction do
  db_connection.exec(%{
    CREATE TEMP TABLE selected_accounts (
      "account_id" INT8,
      CONSTRAINT "id_pkey" PRIMARY KEY ("account_id")
    )
    ON COMMIT DROP;
  })
  db_connection.copy_data('COPY selected_accounts (account_id) FROM STDIN') do
    db_connection.put_copy_data([1, 2, 3, ...])
  end
  db_connection.exec %{
    SELECT requests.* FROM requests INNER JOIN selected_accounts
    ON requests.account_id = selected_accounts.account_id
  }
end

The above code also shows you how to use COPY command to load the data, which may be more practical in real use cases, since most of the time, you will be dealing with a large number of records.

The code uses database connection from ActiveRecord::Base, but in fact, you don’t even need to use Rails to make the above query work. For PostgreSQL, we can use pg gem to connect to database:

db_connection = PG.connect(host: 'localhost', dbname: 'xxx', account: 'xxx', password: 'xxx')
db_connection.transaction do
  ... ...
end

Benchmark

The SQL query that uses temporary table looks complicate and requires more steps, but how about the performance comparing to IN or ANY? I wrote a benchmark script to give you an idea on the performance of each approach. You can find it on Github.

By default, the code generates 500,000 rows in a requests table for 5,000 different account IDs, and the account IDs are just integer numbers, but you can play with different numbers to see how the performance goes.

The testing queries try to get records based on a select list of account IDs by using WHERE IN (), WHERE ANY () and TEMP TABLE, and it’s a little different than the above example. The benchmark runs against PostgreSQL 9.4.5, and the results are shown as below:

With index on requests.account_id:
                   user     system      total        real
use_where_in    0.000000   0.000000   0.000000 (  0.242052)
use_where_any   0.010000   0.000000   0.010000 (  0.219573)
use_tmp_table   0.000000   0.000000   0.000000 (  0.193532)

Without index on requests.account_id:
                    user     system      total        real
use_where_in    0.000000   0.000000   0.000000 (  6.290538)
use_where_any   0.000000   0.000000   0.000000 (  0.192422)
use_tmp_table   0.000000   0.000000   0.000000 (  0.193068)

As you can see, in our example, if there’s no index on account_id column, WHERE IN () approach gives you the worst performance, but if index is added on account_id, the performance is pretty close. Check out the repository to get more detailed information.

Conclusion

The performance may vary in different versions of Postgres, or different SQL queries, especially when there are complicated expressions in subquery, so whether use WHERE IN or TEMP TABLE depends on your own use cases. This post mainly tries to show you how to use temporary table.

Query with temp WHERE IN or WHERE ANY requires less code than TEMP TABLE, but TEMP TABLE seems to have more stable performance in most of the cases. There is also an old post talking about problems using WHERE IN that is worth to read. Overall, it’s better to run some tests on your use cases before making the decision, and don’t forget to use command like EXPLAIN ANALYZE to verify if index is properly used.