When you want to move data into your PostgreSQL database, there are a few options available like pg_dump and Azure Data Factory. The method you pick depends on the scenario you want to enable. Today, we’re exploring three scenarios where you can consider using PostgreSQL’s COPY command.
“COPY is the Postgres method of data-loading. Postgres's COPY comes in two separate variants, COPY and \COPY: COPY is server based, \COPY is client based.” - The PostgreSQL Wiki
Both versions of COPY move data from a file to a Postgres table.
The server based COPY command has limited file access and user permissions, and isn’t available for use on Azure Database for PostgreSQL.
\COPY runs COPY internally, but with expanded permissions and file access. The source file does not have to exist on the same machine as the Postgres instance if you use \COPY. You can even use both absolute and relative paths.
\COPY is a psql feature. You need to have the psql application to be able to use \COPY.
Today, we’ll use \COPY to enable these three scenarios:
- Migrating data from Postgres to Postgres
- Ingesting from http data sources
- Moving data from other database engines, like MySQL
Migrating data from Postgres to Postgres
For our first scenario, we’ll do a data migration between two PostgreSQL servers. We’ll first move the data with pg_dump, with and without parallelism. Then we’ll compare the performance using \COPY.
These are the advantages and disadvantages I expect to see:
Advantages
- \COPY command goes through pipe, so no space required on client side
- Potential for increased performance with parallelism, compared to pg_dump
Disadvantages
- \COPY method does not migrate indexes, foreign keys, and other similar objects. You’ll have to take additional steps to move those objects. Alternatively, you can consider using a migration service.
For this example, I will be migrating from an Azure Database for PostgreSQL Basic tier single Server to a General Purpose tier single server.
Prepare psql connection string scripts
First, I’ll store the psql connection strings to my source and destination servers in scripts. That way I can call those scripts when I want to connect.
pgsource.sh = bash script with psql and connection string to source `tests` database
pgdestination.sh = bash script with psql and connection string to destination `tests` database
cat <<EOF > pgsource.sh
PGPASSWORD=<sourcePassword> psql -t -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests
EOF
chmod +x pgsource.sh
cat <<EOF > pgdestination.sh
PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
EOF
chmod +x pgdestination.sh
My source database has 20 tables each with 1 million auto-generated rows. There are no indexes or foreign keys in this database.
Test full restoration with pg_dump/pg_restore (no parallelism)
Internally, pg_dump and pg_restore uses the COPY command
time PGPASSWORD=<sourcePassword> pg_dump -C -Fc -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres
real 6m15.310s
user 0m22.976s
sys 0m9.393s
- Total Time: 6m15sec
Test full restoration with pg_dump/pg_restore (parallel)
We’ll need to use the directory format export, and dump to regular files. (See the pg_dump, pg_restore documentation to learn about format options).
time PGPASSWORD=<sourcePassword> pg_dump -C -Fd -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests -j 4 -f testsdir
real 3m23.004s
user 0m20.291s
sys 0m13.888s
time PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> -C -c -d postgres -j 4 testsdir
real 0m56.175s
user 0m1.671s
sys 0m0.478s
- Total Time: 4m19sec
Test data move with \COPY (no parallelism)
Let's use \COPY to move data. Here are the steps we’ll follow:
- Preparation: use pg_dump and psql to move schema
- Generate \COPY command pipes
- Copy data through \COPY command pipes
time PGPASSWORD=<sourcePassword> pg_dump -C -Fc --schema-only -h <sourceServerName>.postgres.database.azure.com -U <username>@<sourceServerName> tests|PGPASSWORD=<destinationPassword> pg_restore -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> --schema-only -C -c -d postgres
real 0m15.288s
user 0m0.144s
sys 0m0.054s
time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
echo "\copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh
done
real 7m47.179s
user 0m19.518s
sys 0m23.467s
- Total Time: 8m2sec
Test alternative restoration with \COPY (parallelism)
Unlike parallelized pg_dump / pg_restore, there's no need to move data to a staging file. We can continue to use the pipe mechanism from \COPY without parallelism.
export maxParal=4
>curParal.tmp
time for j in `echo "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' and table_type='BASE TABLE' ORDER BY table_name"|./pgsource.sh`
do
echo $j >>curParal.tmp
nohup $SHELL <<EOF &
echo "\copy $j FROM PROGRAM 'echo copy binary $j to STDOUT|./pgsource.sh' WITH BINARY;"|./pgdestination.sh; sed -i '/^$j$/d' curParal.tmp
EOF
while [ `wc -l curParal.tmp|cut -f 1 -d " "` -ge $maxParal ]
do
sleep 1
done
echo $curParal $maxParal
done
real 2m59.491s
user 0m24.388s
sys 0m40.200s
- Total Time: 3m14sec
\COPY's advantage over pg_dump
Without parallelism:
pg_dump/pg_restore won with 6m15sec vs 8m2sec with \COPY approach
With parallelism:
\COPY approach won with 3m14sec vs 4m19sec of pg_dump/pg_restore
Although pg_dump/pg_restore uses the COPY command internally, when we use a manual parallelism mechanism we can obtain better execution times. The performance improvement comes because with parallel \COPY we can use pipes and avoid saving data to a staging area, as pg_dump/pg_restore requires when using parallelism.
Using \COPY to ingest from http datasources
Using \COPY command we can ingest not only .csv, txt or binary files, data or copy from another database through pipes, we can create authentic ETL-like processes with a single command.
In this case we are going to ingest the open csv dataset COVID-19 cases worldwide.
First, we’ll create a Postgres table called covid19casesww with the columns we need:
echo "create table covid19casesww (dateRep date,day integer,month integer,year integer,cases integer,deaths integer,countriesAndTerritories text,geoId text,countryterritoryCode text,popData2019 integer,continentExp text);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
We ingest the data into our Postgres server with a pipe from wget to the \COPY command
echo "set datestyle to SQL,DMY;\COPY covid19casesww FROM PROGRAM 'wget -q -O - "$@" "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv"|tail -n +2' CSV DELIMITER ',';"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
Now we can access the data and calculate, for example, the 10 territories with the lowest number of cases:
echo "select sum(cases) as cases, countriesAndTerritories, continentexp from covid19casesww group by countriesAndTerritories,continentExp order by 1 asc limit 10;"|PGPASSWORD=<destinationPassword> psql -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
cases | countriesandterritories | continentexp
-------+-----------------------------------+--------------
3 | Anguilla | America
7 | Bonaire, Saint Eustatius and Saba | America
8 | British_Virgin_Islands | America
11 | Montserrat | America
11 | Papua_New_Guinea | Oceania
11 | Seychelles | Africa
12 | Holy_See | Europe
13 | Falkland_Islands_(Malvinas) | America
13 | Greenland | America
15 | Saint_Kitts_and_Nevis | America
(10 rows)
Using \COPY to ingest from other database engines
Suppose we want migrate data from a MySQL server to a PostgreSQL server.
First create the destination table in Postgres. Mine is called prumysql.
echo "create table prumysql (a text, b integer);"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
Pipe from mysql to \COPY command
echo "\COPY prumysql FROM PROGRAM 'echo select a,b from prumysql|mysql -h <sourceServerName>.mysql.database.azure.com -u <username>@<sourceServerName> -p<sourcePassword> -sN --ssl tests -B' ;"|PGPASSWORD=<destinationPassword> psql -t -h <destinationServerName>.postgres.database.azure.com -U <username>@<destinationServerName> tests
/COPY as an alternative way to do PostgreSQL data migration
PostgreSQL’s \COPY enables a range of data loading experiences. We've shared some examples in this blog. How do you use \COPY?
Updated Sep 14, 2020
Version 5.0FranciscoPardillo
Microsoft
Joined June 10, 2020
Azure Database for PostgreSQL Blog
Follow this blog board to get notified when there's new activity