Last year, I worked on a project in which I needed to parse and handle large .xlsx files. Some of those files had more than 200K rows. I was looking for a gem that could efficiently do the job.
I wanted a simple solution that would just iterate over each row, parse the contents of the row and save data in database tables.
When I called SimpleXlsxReader.open(file_path), memory consumption would go to 2GB and it would stay that way until the garbage collector cleaned it.
What I needed was a library that would read Excel rows in streams. At that time, none of the mentioned gems had that ability. I found a gem called Creek that does only that – stream parsing large excel files. It is simple and works well. Memory consumption while parsing these files was now under 200 MB, which is acceptable. In the meantime, Roo was updated and now it has the same ability.
Also, to reduce the number of SQL queries, I implemented saving data in batches – 1000 rows in 1 query. So, my method reads a thousand rows, makes an array of ActiveRecord objects and then saves them at once. I used the gem called ActiveRecord Import for this purpose.
@file_path = file_path
@records = 
@counter = 1
BATCH_IMPORT_SIZE = 1000
rows.each do |row|
records << build_new_record(row)
import_records if reached_batch_import_size? || reached_end_of_file?
attr_reader :file_path, :records
@book ||= Creek::Book.new(file_path)
# in this example, we assume that the
# content is in the first Excel sheet
@rows ||= book.sheets.first.rows
self.counter += 1
@row_count ||= rows.count
# only build a new record without saving it
# save multiple records using activerecord-import gem
# clear records array
(counter % BATCH_IMPORT_SIZE).zero?
counter == row_count
Finally, this task is being executed asynchronously in a background job since it’s too heavy to handle in the web process.