Dealing with various sources of data in web applications requires us to create services that will extract information from CSV, Excel, and other file types. In that case, it's best to use some existing libraries, or if your backend is on Rails, use gems. There are many gems with very cool features like
Roo. But you can also use plain
Either way, if those are small CSV files, you will get your job done easily. But what if you need to import large CSV files (~100MB / ~1M rows)?
For a recent project I worked on, an external system would send a CSV file containing 200k rows every 15 minutes. Aforementioned solutions were simply not good enough; they were slow and ate up a bunch of RAM.
I knew I had to find a more efficient solution.
First Try: CSVImporter gem
Because of its simplicity, I first decided to use the
CSVImporter gem. After installing the gem, the
CSVImporter class was defined like this:
class ForecastCSVImporter include CSVImporter model Forecast column :property column :date column :value column :type column :location column :created_at end
Executing an import for a specific file was done like this:
The implementation was simple and it worked really well on a test CSV file. Then I decided to test it in production, but it was too slow. Importing took more than 2 hours which was fairly bad. The main problem was that each CSV row had to be converted into an
ActiveRecord model and had to call
I have tried many different gems and wrote a custom CSV importer by using plain
Ruby CSV with batch insert commands. Performance improvements were noticeable and the import time was reduced to around 45 minutes, but it was still too slow. I felt I still wasn't on the right track.
Solution - Use COPY
After all of these attempts, I finally gave up on Ruby solutions and looked for help from my database.
I found that PostgreSQL has a really powerful yet very simple command called
COPY which copies data between a file and a database table.
It can be used in both ways:
- to import data from a CSV file to database
- to export data from a database table to a CSV file.
Example of usage:
COPY forecasts FROM 'tmp/forecast.csv' CSV HEADER;
This piece of SQL code will import the content from a CSV file to our
forecasts table. Note one thing: it's assumed that the number and order of columns in the table is the same as in the CSV file.
Importing a CSV file with ~1M rows now takes under 4 seconds which is blazing fast when compared to previous solutions!
Usually, when writing application code, it's good to avoid writing raw SQL. By default,
ActiveRecord doesn't support the
COPY command but there is a gem which takes care of that. It's called
postgres-copy. The gem provides a simple interface for copying data between a database table and a CSV file.
Let's see an example:
# Enable COPY command on Forecast model class Forecast < ActiveRecord::Base acts_as_copy_target end # Run export of table data to a file Forecast.copy_to '/tmp/forecast.csv' # Run import from a CSV file to database Forecast.copy_from '/tmp/forecast.csv'
SQL commands of previous calls are:
-- Forecast.copy_to '/tmp/forecast.csv' COPY (SELECT "forecasts".* FROM "forecasts") TO '/tmp/forecast.csv' WITH DELIMITER ';' CSV HEADER -- Forecast.copy_from '/tmp/forecast.csv' COPY forecasts FROM '/tmp/forecast.csv'
COPY command is simple and super fast. However, it has restrictions in some advanced scenarios when importing from CSV:
you must use all of the columns from a CSV file
problems arise if you want to manipulate the data before it is inserted into the database table.
You can specify mappings between CSV columns and table columns. That means you can have different orders of attributes in a CSV file and the database table, but the table must use all of the columns from a CSV file.
These problems are common, but as always, there are workarounds.
The first one is to create a temporary table where you would import the original data from a CSV file. After the
COPY command inserts all the data from the CSV file, you can perform a custom
INSERT command to transfer data from the temporary table to your original table. Within the
INSERT command, you can easily perform data manipulation.
Let's see an example:
COPY forecasts_import FROM 'tmp/forecast.csv'; INSERT INTO forecasts SELECT location_id::int, value, forecast_type, DATE(created_at) FROM forecasts_import; DELETE FROM forecasts_import;
A second approach when our attributes in the CSV and our tables don't match is to read the data from standard input and manipulate it through our application:
COPY forecasts FROM STDIN;
In that case, you need to manually read the data, row by row, from a CSV file (e.g. by using plain
Ruby CSV) and send the data to
STDIN of your database connection.
db_conn = ActiveRecord::Base.connection.raw_connection copy_statement = 'COPY forecasts FROM STDIN' file_path = '/tmp/forecast.csv' db_conn.copy_data(copy_statement) do CSV.foreach(file_path, headers: true) do |row| db_conn.put_copy_data(row.fields + [Time.zone.now]) end end
Although the import process goes through Ruby, there is no overhead of instantiating
ActiveRecord objects and performing validations. This is a bit slower than directly importing with the database, but it's still very fast.
Can I use this if I'm not on PostgreSQL?
In addition to PostgreSQL, other databases also support native CSV importing.
For example, Microsoft SQL Server uses the
BULK INSERT SQL command which is pretty similar to PostgreSQL's
COPY command, while Oracle has a command line tool called
MySQL also supports CSV file imports with the
LOAD DATA INFILE command or by using the mysqlimport utility.
Native CSV parsers will give you better performance than using plain CSV parsers combined with many
INSERT commands. Of course, you should take care of your database validations because you are skipping application validations.
If you're working with small CSV files, existing libraries are usually the best solution because of their simplicity, especially if you need to do data manipulation before persisting it to the database.
In all other cases, go ahead and use the database tools because they are really powerful and easy to use.