find_in_batches, find_each

In Rails, if you need to process a lot of records, it’s good to lean on tools like find_in_batches or find_each so that you don’t load too many records into memory all at once. Under the hood, this performs queries with a LIMIT so that you get a limited number of records at a time.

The query itself will sort by id and fetch records whose id is greater than the max id of the previous “batch”. Then it will repeat for the next batch.

Here is an example query, and it is efficient:

SELECT * FROM table WHERE id > max_id_from_last_batch ORDER BY id LIMIT 1000;

However, if you want to restrict by other where conditions on other columns, you may encounter a performance bottleneck if you don’t have the right indices.

For example, suppose you have another column account_id and you want to restrict your results only to records for a given account_id.

The query would end up looking something like this:

SELECT * FROM table WHERE id > max_id_from_last_batch AND account_id = 25 ORDER BY id LIMIT 1000;

Suppose you have an index on account_id. Your DB engine can use either the index on account_id or it can use the intrinsic index on id.

If it uses the account_id index, it has to scan through all of the records for the given account to find ones that meet the id > max_id_from_last_batch filter. If you have a large table, each batch you query takes longer and longer to find.

If it uses the id index, it has to scan through a ton of records across multiple accounts to find ones that pertain to the given account_id. This can take a while if you have a lot of accounts. You might have to scan through millions of records to find 1,000 for the given account.

So either way is inefficient.

The solution (in this case) is to add a composite index on id and account_id. While this might seem counterintuitive, it can provide a dramatic speed improvement for find_in_batches or find_each queries.

This info is based on real-life investigation in PostgreSQL. Your results may vary in other DB’s.

Bonus Tip: In Rails 4, find_in_batches and find_each don’t automatically disable the ActiveRecord query cache. As a result, you can have huge memory bloat if you’re looping through tons of batches in a given process. To avoid this, use an explicit ActiveRecord::Base.uncached block.