Microsoft Blog for PostgreSQL

6 MIN READ

Moving data with PostgreSQL COPY and \COPY commands

Microsoft

Aug 25, 2020

When you want to move data into your PostgreSQL database, there are a few options available like pg_dump and Azure Data Factory. The method you pick depends on the scenario you want to enable. Today, we’re exploring three scenarios where you can consider using PostgreSQL’s COPY command.

“COPY is the Postgres method of data-loading. Postgres's COPY comes in two separate variants, COPY and \COPY: COPY is server based, \COPY is client based.” - The PostgreSQL Wiki

Both versions of COPY move data from a file to a Postgres table.

The server based COPY command has limited file access and user permissions, and isn’t available for use on Azure Database for PostgreSQL.

\COPY runs COPY internally, but with expanded permissions and file access. The source file does not have to exist on the same machine as the Postgres instance if you use \COPY. You can even use both absolute and relative paths.

\COPY is a psql feature. You need to have the psql application to be able to use \COPY.

Today, we’ll use \COPY to enable these three scenarios:

Migrating data from Postgres to Postgres
Ingesting from http data sources
Moving data from other database engines, like MySQL

Migrating data from Postgres to Postgres

For our first scenario, we’ll do a data migration between two PostgreSQL servers. We’ll first move the data with pg_dump, with and without parallelism. Then we’ll compare the performance using \COPY.

These are the advantages and disadvantages I expect to see:

Advantages

\COPY command goes through pipe, so no space required on client side
Potential for increased performance with parallelism, compared to pg_dump

Disadvantages

\COPY method does not migrate indexes, foreign keys, and other similar objects. You’ll have to take additional steps to move those objects. Alternatively, you can consider using a migration service.

For this example, I will be migrating from an Azure Database for PostgreSQL Basic tier single Server to a General Purpose tier single server.

Prepare psql connection string scripts

First, I’ll store the psql connection strings to my source and destination servers in scripts. That way I can call those scripts when I want to connect.

pgsource.sh = bash script with psql and connection string to source `tests` database

pgdestination.sh = bash script with psql and connection string to destination `tests` database

cat <<EOF > pgsource.sh
PGPASSWORD=<sourcePassword> psql -t -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests
EOF
chmod +x pgsource.sh

cat <<EOF > pgdestination.sh
PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
EOF
chmod +x pgdestination.sh

My source database has 20 tables each with 1 million auto-generated rows. There are no indexes or foreign keys in this database.

Test full restoration with pg_dump/pg_restore (no parallelism)

Internally, pg_dump and pg_restore uses the COPY command

time PGPASSWORD=<sourcePassword> pg_dump -C -Fc -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres

real    6m15.310s
user    0m22.976s
sys     0m9.393s

- Total Time: 6m15sec

Test full restoration with pg_dump/pg_restore (parallel)

We’ll need to use the directory format export, and dump to regular files. (See the pg_dump, pg_restore documentation to learn about format options).

time PGPASSWORD=<sourcePassword> pg_dump -C -Fd -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests -j 4 -f testsdir

real    3m23.004s
user    0m20.291s
sys     0m13.888s

time PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres -j 4 testsdir

real    0m56.175s
user    0m1.671s
sys     0m0.478s

- Total Time: 4m19sec

Test data move with \COPY (no parallelism)

Let's use \COPY to move data. Here are the steps we’ll follow:

Preparation: use pg_dump and psql to move schema
Generate \COPY command pipes
Copy data through \COPY command pipes

time PGPASSWORD=<sourcePassword> pg_dump -C -Fc --schema-only -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> --schema-only -C -c -d postgres

real    0m15.288s
user    0m0.144s
sys     0m0.054s

time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
       echo "\copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh
done

real    7m47.179s
user    0m19.518s
sys     0m23.467s

- Total Time: 8m2sec

Test alternative restoration with \COPY (parallelism)

Unlike parallelized pg_dump / pg_restore, there's no need to move data to a staging file. We can continue to use the pipe mechanism from \COPY without parallelism.

export maxParal=4
>curParal.tmp
time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
  echo $j >>curParal.tmp
  nohup $SHELL <<EOF &
  echo "\copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh; sed -i '/^$j$/d' curParal.tmp
EOF
  while [ `wc -l curParal.tmp|cut -f 1 -d " "` -ge $maxParal ]
  do
    sleep 1 
  done 
  echo $curParal $maxParal
done

real    2m59.491s
user    0m24.388s
sys     0m40.200s

- Total Time: 3m14sec

\COPY's advantage over pg_dump

Without parallelism:

pg_dump/pg_restore won with 6m15sec vs 8m2sec with \COPY approach

With parallelism:

\COPY approach won with 3m14sec vs 4m19sec of pg_dump/pg_restore

Although pg_dump/pg_restore uses the COPY command internally, when we use a manual parallelism mechanism we can obtain better execution times. The performance improvement comes because with parallel \COPY we can use pipes and avoid saving data to a staging area, as pg_dump/pg_restore requires when using parallelism.

Using \COPY to ingest from http datasources

Using \COPY command we can ingest not only .csv, txt or binary files, data or copy from another database through pipes, we can create authentic ETL-like processes with a single command.

In this case we are going to ingest the open csv dataset COVID-19 cases worldwide.

First, we’ll create a Postgres table called covid19casesww with the columns we need:

echo "create table covid19casesww (dateRep date,day integer,month integer,year integer,cases integer,deaths integer,countriesAndTerritories text,geoId text,countryterritoryCode text,popData2019 integer,continentExp text);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

We ingest the data into our Postgres server with a pipe from wget to the \COPY command

echo "set datestyle to SQL,DMY;\COPY covid19casesww FROM PROGRAM 'wget -q -O - "$@" "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"|tail -n +2' CSV DELIMITER ',';"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

Now we can access the data and calculate, for example, the 10 territories with the lowest number of cases:

echo "select sum(cases) as cases, countriesAndTerritories, continentexp from covid19casesww group by countriesAndTerritories,continentExp order by 1 asc limit 10;"|PGPASSWORD=<destinationPassword> psql -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

cases |      countriesandterritories      | continentexp
-------+-----------------------------------+--------------
     3 | Anguilla                          | America
     7 | Bonaire, Saint Eustatius and Saba | America
     8 | British_Virgin_Islands            | America
    11 | Montserrat                        | America
    11 | Papua_New_Guinea                  | Oceania
    11 | Seychelles                        | Africa
    12 | Holy_See                          | Europe
    13 | Falkland_Islands_(Malvinas)       | America
    13 | Greenland                         | America
    15 | Saint_Kitts_and_Nevis             | America
(10 rows)

Using \COPY to ingest from other database engines

Suppose we want migrate data from a MySQL server to a PostgreSQL server.

First create the destination table in Postgres. Mine is called prumysql.

echo "create table prumysql (a text, b integer);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests

Pipe from mysql to \COPY command

echo "\COPY prumysql FROM PROGRAM 'echo select a,b from prumysql|mysql -h <sourceServerName>.mysql.database.azure.com -u <username>@<sourceServerName> -p<sourcePassword> -sN --ssl tests -B' ;"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests