Last year, I worked on a project in which I needed to parse and handle large .xlsx files. Some of those files had more than 200K rows. I was looking for a gem that could efficiently do the job.
I wanted a simple solution that would just iterate over each row, parse the contents of the row and save data in database tables.
When I called SimpleXlsxReader.open(file_path), memory consumption would go to 2GB and it would stay that way until the garbage collector cleaned it.
What I needed was a library that would read Excel rows in streams. At that time, none of the mentioned gems had that ability. I found a gem called Creek that does only that - stream parsing large excel files. It is simple and works well. Memory consumption while parsing these files was now under 200 MB, which is acceptable. In the meantime, Roo was updated and now it has the same ability.
Also, to reduce the number of SQL queries, I implemented saving data in batches - 1000 rows in 1 query. So, my method reads a thousand rows, makes an array of ActiveRecord objects and then saves them at once. I used the gem called ActiveRecord Import for this purpose.
class ExcelDataParser def initialize(file_path) @file_path = file_path @records =  @counter = 1 end BATCH_IMPORT_SIZE = 1000 def call rows.each do |row| increment_counter records << build_new_record(row) import_records if reached_batch_import_size? || reached_end_of_file? end end private attr_reader :file_path, :records attr_accessor :counter def book @book ||= Creek::Book.new(file_path) end # in this example, we assume that the # content is in the first Excel sheet def rows @rows ||= book.sheets.first.rows end def increment_counter self.counter += 1 end def row_count @row_count ||= rows.count end def build_new_record(row) # only build a new record without saving it RecordModel.new(...) end def import_records # save multiple records using activerecord-import gem RecordModel.import(records) # clear records array records.clear end def reached_batch_import_size? (counter % BATCH_IMPORT_SIZE).zero? end def reached_end_of_file? counter == row_count end end
Finally, this task is being executed asynchronously in a background job since it's too heavy to handle in the web process.