ActiveRecord Enumerable
February 3rd, 2007 by michael
Ever wanted to iterate over a huge number of objects and found
that find(:all) doesn’t cut it? Then this plugin
might be for you. Install it with
$ script/plugin install svn://rubyforge.org/var/svn/ar-enumerable/ar_enumerable/trunk
First, a note of caution. This plugin has not seen much practical
use yet, in its current state it is only slightly above a
proof-of-concept. There are several tests included that cover most
of the functionality. If you intend to use this plugin, please
make sure to independently check that it does what you want.
Enumerating ActiveRecord objects
Let’s start with a boring example, to get the general idea
Person.where(['lastname = ?', 'Smith']).map(&:firstname)
Slightly more interesting
Person.where(['lastname = ?', 'Smith']).inject(0) do |income, person|
income += person.income
end
siblings = Person.where(
['lastname = ?', 'Smith'],
:include => :siblings).inject([0, 0]) do |memo, person|
memo[0] += person.siblings.count
memo[1] += 1
memo
end
siblings[0].to_f / siblings[1]
The point here is that your database may contain millions of Smiths,
for what it’s worth, the entire population of the earth could have
turned into Smiths, still, only a limited number of them would be
loaded into memory at the same time.
Internally, enumeration works in two completely different ways.
By default, a chunk of objects is retrieved from the database using
OFFSET and LIMIT clauses. This way is supported by all databases,
however it can be slow.
Therefore, the preferred alternative is to use a standard SQL cursor
for iterating over the result set of a query. It would simply look like
this
Person.where(['lastname = ?', 'Smith'], :use_cursor => true).inject(0) do |income, person|
income += person.income
end
A further parameter is how many records (or rows, in the case of cursors)
are fetched from the database in one access. The default is 1000. You can
set it explicitly like this
Person.where(['lastname = ?', 'Smith'], :fetch_count => 100).inject(0) do |income, person|
income += person.income
end
Performance Considerations
Some informal tests suggest that for large result sets without
included associations cursor-based enumeration can be faster than
OFFSET/LIMIT-based enumeration by a factor of 3 to 5.
When associations are included, things become more blurry. In this
case, cursor-based enumeration imposes an ordering on the result
set that can slow down things considerably.
The fetch count has considerable influence on performance, basically,
you can buy speed by spending memory.

Just a reminder, you don’t need to write the equal sign in the following:
Person.where(['lastname = ?', 'Smith']).inject(0) do |income, person|
income += person.income
end
Since the *value* of the block is what’s being stored in ‘income’. So the example should look like this:
Person.where(['lastname = ?', 'Smith']).inject(0) do |income, person|
income + person.income
end
thanks michael!
Michael -
This seems like it would be very helpful for a data validation suite I’m attempting to write using ActiveRecord. A couple of questions for you:
1. Can I install this plugin outside of the rails framework? I’m currently only using the ActiveRecord gem in my work.
2. Is there support for manually controlling iteration across the objects? For example, during data validation I’d like to manually compare records from two tables (both of which I’d like to access using your plugin, as they are quite large) and then in some case move to the next object in either 1 or both of the data sets. I need to manually control when the next object is yielded and need to iterate across two tables at the same time. If this is not supported, is there a simple way for me to create some type of where_two method which would give me this functionality?
Seems like a great start on a problem with ActiveRecord that has bothered me for a while
-Drew
Drew: I’ve never used ActiveRecord on its own, but I’m pretty sure you can use the plugin with it, without the rest of the Rails frameworks. In your setting, plugins are not loaded and initialized automatically, therefore you have to make sure yourself, that the plugin’s lib directory is added to the load path (
$:.unshift ...) and thatinit.rbis called.Regarding coordinated iteration over two tables, I haven’t thought of this before, and I don’t have a solution handy. Have a look at the standard Ruby Enumerator and Generator libraries, they may help you get what you need.
Be sure to wrap a transaction block around the combined iteration! ActiveRecord Enumerable internally uses a transaction block, but you have to ensure that all iteration/cursor are in the same single transaction.
I gather that Enumerator/Generator use some
callccmagic to jump out of and back into a scope. I’m not completely sure that this won’t mess up things in some way.Michael -
I’ve been using this plugin on a test basis and I like what I see. Another simple question: can we get a gem (for non-rails work) of this app? I think you’d be surprised the amount of people that are looking for something lik this to use with some type of statistical analysis of data outside rails using activerecord.
Thanks,
Drew
Drew: A gem on its own wouldn’t do much good. The enumeration functionality must be inserted into ActiveRecord and that’s something a gem can’t do by itself.
If you want to use the AR Enumeration in your app and the Rails plugin loading mechanism is not available, you need to do two things: First, make sure that the directory containing
ar_enumerable.rbis in the load path (the Ruby global variable$:). Then execute the code ininit.rb.Arthur Lyman asked:
Yes, the plugin is available, I just checked it. You have to make sure that you use the correct URL, in particular, it has to start with “svn:” not “Svn:”, i.e.
svn://rubyforge.org/var/svn/ar-enumerable/ar_enumerable/trunk[...] with the size of the dataset. A (possibly MySQL specific) solution described by both Jamis Buck and Michael Schuerig relies on splitting the dataset into smaller chunks based on the primary key. Since the primary key [...]