Dealing with various sources of data in web applications requires us to create services that will extract information from CSV, Excel, and other file types. In that case, it’s best to use some existing libraries, or if your backend is on Rails, use gems. There are many gems with very cool features like CSVImporter
and Roo
. But you can also use plain Ruby CSV
.
Either way, if those are small CSV files, you will get your job done easily. But what if you need to import large CSV files (~100MB / ~1M rows)?
The Problem
For a recent project I worked on, an external system would send a CSV file containing 200k rows every 15 minutes. Aforementioned solutions were simply not good enough; they were slow and ate up a bunch of RAM.
I knew I had to find a more efficient solution.
First Try: CSVImporter gem
Because of its simplicity, I first decided to use the CSVImporter
gem. After installing the gem, the CSVImporter
class was defined like this:
class ForecastCSVImporter
include CSVImporter
model Forecast
column :property
column :date
column :value
column :type
column :location
column :created_at
end
Executing an import for a specific file was done like this:
ForecastCSVImporter.new(path: file_path).run!
The implementation was simple and it worked really well on a test CSV file. Then I decided to test it in production, but it was too slow. Importing took more than 2 hours which was fairly bad. The main problem was that each CSV row had to be converted into an ActiveRecord
model and had to call #create
.
I have tried many different gems and wrote a custom CSV importer by using plain Ruby CSV
with batch insert commands. Performance improvements were noticeable and the import time was reduced to around 45 minutes, but it was still too slow. I felt I still wasn’t on the right track.
Solution – Use COPY
After all of these attempts, I finally gave up on Ruby solutions and looked for help from my database.
I found that PostgreSQL has a really powerful yet very simple command called COPY
which copies data between a file and a database table. It can be used in both ways:
- to import data from a CSV file to database
- to export data from a database table to a CSV file
Example of usage:
COPY forecasts
FROM ’tmp/forecast.csv’
CSV HEADER;
This piece of SQL code will import the content from a CSV file to our forecasts
table. Note one thing: it’s assumed that the number and order of columns in the table is the same as in the CSV file.
Results
Importing a CSV file with ~1M rows now takes under 4 seconds which is blazing fast when compared to previous solutions!
Library Support
Usually, when writing application code, it’s good to avoid writing raw SQL. By default, ActiveRecord
doesn’t support the COPY
command but there is a gem which takes care of that. It’s called postgres-copy
. The gem provides a simple interface for copying data between a database table and a CSV file.
Let’s see an example:
# Enable COPY command on Forecast model
class Forecast < ActiveRecord::Base
acts_as_copy_target
end
# Run export of table data to a file
Forecast.copy_to ’/tmp/forecast.csv’
# Run import from a CSV file to database
Forecast.copy_from ’/tmp/forecast.csv’
SQL commands of previous calls are:
-- Forecast.copy_to ’/tmp/forecast.csv’
COPY (SELECT "forecasts".* FROM "forecasts")
TO ’/tmp/forecast.csv’
WITH DELIMITER ’;’ CSV HEADER
-- Forecast.copy_from ’/tmp/forecast.csv’
COPY forecasts
FROM ’/tmp/forecast.csv’
Data manipulation
The COPY
command is simple and super fast. However, it has restrictions in some advanced scenarios when importing from CSV:
- you must use all of the columns from a CSV file
- problems arise if you want to manipulate the data before it is inserted into the database table.
You can specify mappings between CSV columns and table columns. That means you can have different orders of attributes in a CSV file and the database table, but the table must use all of the columns from a CSV file.
These problems are common, but as always, there are workarounds.
The first one is to create a temporary table where you would import the original data from a CSV file. After the COPY
command inserts all the data from the CSV file, you can perform a custom INSERT
command to transfer data from the temporary table to your original table. Within the INSERT
command, you can easily perform data manipulation.
Let’s see an example:
COPY forecasts_import
FROM ’tmp/forecast.csv’;
INSERT INTO forecasts
SELECT location_id::int, value, forecast_type, DATE(created_at)
FROM forecasts_import;
DELETE FROM forecasts_import;
A second approach when our attributes in the CSV and our tables don’t match is to read the data from standard input and manipulate it through our application:
COPY forecasts
FROM STDIN;
In that case, you need to manually read the data, row by row, from a CSV file (e.g. by using plain Ruby CSV
) and send the data to STDIN
of your database connection.
db_conn = ActiveRecord::Base.connection.raw_connection
copy_statement = ’COPY forecasts FROM STDIN’
file_path = ’/tmp/forecast.csv’
db_conn.copy_data(copy_statement) do
CSV.foreach(file_path, headers: true) do |row|
db_conn.put_copy_data(row.fields + [Time.zone.now])
end
end
Although the import process goes through Ruby, there is no overhead of instantiating ActiveRecord
objects and performing validations. This is a bit slower than directly importing with the database, but it’s still very fast.
Can I use this if I’m not on PostgreSQL?
In addition to PostgreSQL, other databases also support native CSV importing.
For example, Microsoft SQL Server uses the BULK INSERT
SQL command which is pretty similar to PostgreSQL’s COPY
command, while Oracle has a command line tool called sqlloader
. MySQL also supports CSV file imports with the LOAD DATA INFILE
command or by using the mysqlimport utility.
Conclusion
Native CSV parsers will give you better performance than using plain CSV parsers combined with many INSERT
commands. Of course, you should take care of your database validations because you are skipping application validations.
If you’re working with small CSV files, existing libraries are usually the best solution because of their simplicity, especially if you need to do data manipulation before persisting it to the database.
In all other cases, go ahead and use the database tools because they are really powerful and easy to use.