Building Data Access Layers in Python

Source
Imagine you’re building a web application for a student organization. You need to store member information, upcoming events, and potentially even forum posts. It would be a pain to write all the code to directly connect and query the database for every single action?

That’s where Data Access Layers (DAL) come in! Think of a DAL as a specialized library specifically designed for interacting with databases. It acts as a middle layer between your application code and the actual database itself. Here’s what a DAL does for you:

  • Abstraction: You don’t have to worry about the specifics of the underlying database technology (like SQL or NoSQL). The DAL handles the communication details, allowing you to write code that focuses on what the data actually represents (members, events, etc.).
  • Efficiency: DALs can optimize database queries and data retrieval, making your application perform better. They might handle things like connection pooling (reusing connections instead of constantly creating new ones) to improve database access speed.
  • Reusability: You can create generic functions within the DAL to perform common database operations (like adding, updating, or deleting data). This saves you time and ensures consistency across your codebase.
  • Maintainability: If you ever need to switch database technologies, the impact on your application code is minimized because the DAL acts as a buffer. You only need to modify the DAL itself, not rewrite large portions of your application code.

Here’s an analogy: Think of your application code as a chef preparing a delicious meal. The ingredients (data) are stored in the pantry (database). The DAL acts as your sous-chef, efficiently retrieving the exact ingredients you need (specific data) without you having to rummage through the pantry yourself. This frees you up to focus on the core logic of your application (cooking the meal).

Data Access Layers (DAL) in Python:

1. Using Object-Relational Mappers (ORMs):

ORMs like SQLAlchemy simplify data access by mapping database tables to Python classes. Here’s an example using SQLAlchemy:

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

# Database connection
engine = create_engine('sqlite:///mydatabase.db')

# Define a base class for models
Base = declarative_base()

# Create a user model class
class User(Base):
  __tablename__ = 'users'

  id = Column(Integer, primary_key=True)
  name = Column(String)
  email = Column(String)

# Create database tables (if they don't exist)
Base.metadata.create_all(engine)

# Create a session for interacting with the database
Session = sessionmaker(bind=engine)
session = Session()

# Add a new user
new_user = User(name="Alice", email="alice@hotmail.com")
session.add(new_user)
session.commit()

# Get all users
users = session.query(User).all()
for user in users:
  print(user.name, user.email)

# Close the session
session.close() 

2. Using raw SQL queries:

For more control, you can write raw SQL queries. This requires managing database connections and parsing results:

import sqlite3

# Connect to the database
conn = sqlite3.connect('mydatabase.db')
cursor = conn.cursor()

# Insert a new user
cursor.execute("INSERT INTO users (name, email) VALUES (?, ?)", ("Bob", "bob@example.com"))
conn.commit()

# Get all users
cursor.execute("SELECT * FROM users")
users = cursor.fetchall()
for user in users:
  print(user[0], user[1])  # Access data by index

# Close the connection
conn.close() 

Choosing the right approach depends on your project’s needs. ORMs offer a higher level of abstraction and are generally recommended for complex data models. Raw SQL gives you more control but requires writing more code. Lightweight libraries (like peewee) provide a balance between the two.

Remember:

  • Always handle database connections and errors appropriately.
  • Use prepared statements to prevent SQL injection vulnerabilities.
  • Consider connection pooling for performance optimization in high-traffic applications.

As you embark on your database journey, Data Access Layers (DALs) will become an invaluable tool in your arsenal. They not only streamline database interactions but also lay the foundation for clean, maintainable, and scalable code. By abstracting away the complexities of different database technologies, DALs empower you to focus on the core logic of your application, allowing you to write efficient code.

Furthermore, the benefits of DALs extend beyond individual projects. As you progress to larger, more complex applications, DALs will become essential for managing intricate data models and ensuring consistency across your codebase. By adopting best practices in database interaction from the outset, you’ll be well-equipped to tackle even the most challenging data-driven projects in the future.

Read-Only User in Postgres

Create a read-only user in PostgreSQL

1. To create a new user in PostgreSQL:

CREATE USER username WITH PASSWORD 'your_password';

To learn more about creating PostgreSQL user, visit this post.

2. GRANT the CONNECT access:

GRANT CONNECT ON DATABASE database_name TO username;

3. Then GRANT USAGE on schema:

GRANT USAGE ON SCHEMA schema_name TO username;

4. GRANT SELECT

  • Grant SELECT for a specific table:
GRANT SELECT ON table_name TO username;
  • Grant SELECT for multiple tables:
GRANT SELECT ON ALL TABLES IN SCHEMA schema_name TO username;
  • If you want to grant access to the new table in the future automatically, you have to alter default:
ALTER DEFAULT PRIVILEGES IN SCHEMA schema_name
GRANT SELECT ON TABLES TO username;

Reference taken from this Article !

Script to Create Read-Only user:

CREATE ROLE Read_Only_User WITH LOGIN PASSWORD 'Test1234' 
NOSUPERUSER INHERIT NOCREATEDB NOCREATEROLE NOREPLICATION VALID UNTIL 'infinity';

Assign permission to this read only user:

GRANT CONNECT ON DATABASE YourDatabaseName TO Read_Only_User;
GRANT USAGE ON SCHEMA public TO Read_Only_User;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO Read_Only_User;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA public TO Read_Only_User;

https://whiscardz.wordpress.com/2020/07/02/create-readonly-user-postgres/

https://serverfault.com/questions/60500/crate-a-new-read-only-user-in-postgres

Mysql Database on Aurora with Terraform

Source

I can never remember how to connect to a MYSQL Database deployed with RDS and Aurora in a VPC. As my setup inevitably gets more complex, I can’t seem figure out the right combination of security groups, VPC settings, and SSH tunneling flags, that lets me magically connect. So my goal for this blog is to deploy the simplest setup of: Terraform + RDS + Aurora + MYSQL and be able to connect to the database.

I will assume familiarity with Aurora and MYSQL. but before we get started, lets go over RDS and Aurora:

RDS:

Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the cloud. It provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks.

Aurora:

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. It provides up to five times better performance than MySQL with the security, availability, and reliability of a commercial database at one tenth the cost.

So that sounds great! Let’s build it!

Build our Database Cluster and Instance

Note: These are instructions for OSX
Note: This assumes you have terraform installed

main.tf

resource "aws_rds_cluster_instance" "cluster_instances" {
  identifier         = "${var.cluster_name}-instance"
  cluster_identifier = "${aws_rds_cluster.cluster.id}"
  instance_class     = "${var.instance_class}"
}

resource "aws_rds_cluster" "cluster" {
  cluster_identifier     = "${var.cluster_name}"
  database_name          = "sample_rds"
  master_username        = "${var.username}"
  master_password        = "${var.password}"
  vpc_security_group_ids = ["${aws_security_group.aurora-sg.id}"]
  skip_final_snapshot    = true
}

resource "aws_security_group" "aurora-sg" {
  name   = "aurora-security-group"
  vpc_id = "${aws_default_vpc.default.id}"

  ingress {
    protocol    = "tcp"
    from_port   = 3306
    to_port     = 3306
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    protocol    = -1
    from_port   = 0 
    to_port     = 0 
    cidr_blocks = ["0.0.0.0/0"]
  }
}

provider "aws" {
  region = "us-east-1"
}

variable "cluster_name" {
  default = "rds-sample-cluster"
} 
  
variable "instance_class" {
  default = "db.t2.small"
}

variable "username" {
  default = "master"
}

variable "password" {
  default = "password"
}

This creates an RDS ClusterRDS Instance and Security Group:

RDS Clusters:

A DB cluster consists of one or more instances, and a cluster volume that manages the data for those instances. An Aurora cluster volume is a virtual database storage volume that spans multiple Availability Zones, with each Availability Zone having a copy of the cluster data.

RDS Instance:

A DB instance is an isolated database environment running in the cloud. It is the basic building block of Amazon RDS. A DB instance can contain multiple user-created databases, and can be accessed using the same client tools and applications you might use to access a stand-alone database instance.

Security Group:

A security group acts as a virtual firewall that controls the traffic for one or more instances.

Create our Bastion Host

More detailed instructions for building a Bastion host here

bastion.tf

resource "aws_default_vpc" "default" {}

resource "aws_instance" "bastion" {
  ami                         = "ami-1d4e7a66"
  key_name                    = "${aws_key_pair.bastion_key.key_name}"
  instance_type               = "t2.micro"
  security_groups             = ["${aws_security_group.bastion-sg.name}"]
  associate_public_ip_address = true
}

resource "aws_security_group" "bastion-sg" {
  name   = "bastion-security-group"
  vpc_id = "${aws_default_vpc.default.id}"

  ingress {
    protocol    = "tcp"
    from_port   = 22
    to_port     = 22
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    protocol    = "tcp"
    from_port   = 3306
    to_port     = 3306
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    protocol    = -1
    from_port   = 0
    to_port     = 0
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_key_pair" "bastion_key" {
  key_name   = "your_key_name"
  public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDWbz6ur89BKQ+am87EovJsv6g9QpbOiw13lTF7Kw1StbQAmkcGGrNTK2LIWsP3cQf+P+gptRAJbuqB1jQKZ283TwwREIv+l5AMKrbEkanOF4zsc8a9zitejlOLvVUxtVoMi5ROVYD2dLKjqAbDtqIC9LmMD+hcpqcXLhS6t+HVSVI862dTNVFY1EGukLGQ3IEJfw5v7FDzLn72NsuUiXEeCZu8DtlXLCTYRnqv+XkJQWVocPdFDUWISSIQ0CTFu+GJvJjdqDyAhYo3it7Eybj6XuSgLDwkQcNU45Ewz4Nn7LwV+f4Av8D25m4FZOfpWaj5+q9Fc9nRdIsB7P0oFgj5YoaTngQKy27MJ5UppMO7OOhriurJ/PBOrGpeqPcftWKLpcHLIGrm3ndoDKQx12R1s0gyYpA4JuNUWHYcxNrFa2rs/6AoFuS7wNUmM+DYB8iTjOl6dT8dS5AgMxGoZ3NepMPYilw1gf+gw9Ft3pHs2IMfDfqwZpXga8KdYwxBmRakpHdA7Nzje8ufvP/TBawsqVcW7z5gG9uPhYtfnYYezSIxv56PMSWEfqchkz+raPsElzIGtPcC1snncQlau95utV25r88BzXhCMJwNy9aDNEfSrm5SORlA97xicroCOuRjw2PnQyIXKvWDZtyqX5799x37K/HDYpJnvcgwpTlDZQ== your_email@example.com"
}

output "bastion_public_dns" {
  value = "${aws_instance.bastion.public_dns}"
}

Deploy our Database and VPC

terraform apply

Output from terraform apply:

Outputs:

bastion_public_dns = ec2-52-91-54-64.compute-1.amazonaws.com
rds_instance_endpoint = rds-sample-cluster-instance.cn03auqgmvh3.us-east-1.rds.amazonaws.com

Create an SSH tunnel to your RDS instance

ssh -L 3307:rds-sample-cluster-instance.cn03ausgmvh3.us-east-1.rds.amazonaws.com:3306 ubuntu@ec2-52-91-54-64.compute-1.amazonaws.com -N

Lets break this down

-L local_socket:host:hostport

Specifies that connections to the given TCP port or Unix socket on the local (client) host are to be forwarded to the given host and port, or Unix socket, on the remote side. This works by allocating a socket to listen to either a TCP port on the local side, optionally bound to the specified bind_address, or to a Unix socket. Whenever a connection is made to the local port or socket, the connection is forwarded over the secure channel, and a connection is made to either host port hostport, or the Unix socket remote_socket, from the remote machine.

We want to route traffic on port 3306 from the host rds-sample-cluster-instance.cn03auqgmvh3.us-east-1.rds.amazonaws.com to port 3307,
and we can connect through the Bastion Host: ubuntu@ec2-52.91.54.64.us-east-1.compute.amazonaws.com.

and finally:

-N  Do not execute a remote command. This is useful for just forwarding ports.

Now we can connect to our remote database locally:

mysql -u master -p -P 3307 -h 127.0.0.1  

Success!


Photo by Ambreen Hasan / Unsplash


TLDR

main.tf

provider "aws" {
  region = "us-east-1"
}

variable "cluster_name" {
  default = "rds-sample-cluster"
} 
  
variable "instance_class" {
  default = "db.t2.small"
}

variable "username" {
  default = "master"
}

variable "password" {
  default = "password"
}

# =============
# = RDS Setup =
# =============

resource "aws_rds_cluster_instance" "cluster_instances" {
  identifier         = "${var.cluster_name}-instance"
  cluster_identifier = "${aws_rds_cluster.cluster.id}"
  instance_class     = "${var.instance_class}"
}

resource "aws_rds_cluster" "cluster" {
  cluster_identifier     = "${var.cluster_name}"
  database_name          = "sample_rds"
  master_username        = "${var.username}"
  master_password        = "${var.password}"
  vpc_security_group_ids = ["${aws_security_group.aurora-sg.id}"]
  skip_final_snapshot    = true
}

resource "aws_security_group" "aurora-sg" {
  name   = "aurora-security-group"
  vpc_id = "${aws_default_vpc.default.id}"

  ingress {
    protocol    = "tcp"
    from_port   = 3306
    to_port     = 3306
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    protocol    = -1
    from_port   = 0 
    to_port     = 0 
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# =================
# = Bastion Setup =
# =================

resource "aws_default_vpc" "default" {}

resource "aws_instance" "bastion" {
  ami                         = "ami-1d4e7a66"
  key_name                    = "${aws_key_pair.bastion_key.key_name}"
  instance_type               = "t2.micro"
  security_groups             = ["${aws_security_group.bastion-sg.name}"]
  associate_public_ip_address = true
}

resource "aws_security_group" "bastion-sg" {
  name   = "bastion-security-group"
  vpc_id = "${aws_default_vpc.default.id}"

  ingress {
    protocol    = "tcp"
    from_port   = 22
    to_port     = 22
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    protocol    = "tcp"
    from_port   = 3306
    to_port     = 3306
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    protocol    = -1
    from_port   = 0
    to_port     = 0
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_key_pair" "bastion_key" {
  key_name   = "your_key_name"
  public_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDWbz6ur89BKQ+am87EovJsv6g9QpbOiw13lTF7Kw1StbQAmkcGGrNTK2LIWsP3cQf+P+gptRAJbuqB1jQKZ283TwwREIv+l5AMKrbEkanOF4zsc8a9zitejlOLvVUxtVoMi5ROVYD2dLKjqAbDtqIC9LmMD+hcpqcXLhS6t+HVSVI862dTNVFY1EGukLGQ3IEJfw5v7FDzLn72NsuUiXEeCZu8DtlXLCTYRnqv+XkJQWVocPdFDUWISSIQ0CTFu+GJvJjdqDyAhYo3it7Eybj6XuSgLDwkQcNU45Ewz4Nn7LwV+f4Av8D25m4FZOfpWaj5+q9Fc9nRdIsB7P0oFgj5YoaTngQKy27MJ5UppMO7OOhriurJ/PBOrGpeqPcftWKLpcHLIGrm3ndoDKQx12R1s0gyYpA4JuNUWHYcxNrFa2rs/6AoFuS7wNUmM+DYB8iTjOl6dT8dS5AgMxGoZ3NepMPYilw1gf+gw9Ft3pHs2IMfDfqwZpXga8KdYwxBmRakpHdA7Nzje8ufvP/TBawsqVcW7z5gG9uPhYtfnYYezSIxv56PMSWEfqchkz+raPsElzIGtPcC1snncQlau95utV25r88BzXhCMJwNy9aDNEfSrm5SORlA97xicroCOuRjw2PnQyIXKvWDZtyqX5799x37K/HDYpJnvcgwpTlDZQ== your_email@example.com"
}

# ===========
# = Outputs =
# ===========

output "bastion_public_ip" {
  value = "${aws_instance.bastion.public_ip}"
}

Make sure you replace the public_key in your aws_key_pair

Run:

terraform apply

Outputs:

bastion_public_dns = ec2-52-91-54-64.compute-1.amazonaws.com
rds_instance_endpoint = rds-sample-cluster-instance.cn03auqgmvh3.us-east-1.rds.amazonaws.com

Export the variables to use for your SSH tunnel:

export RDS_INSTANCE_ENDPOINT="ec2-52-91-54-64.compute-1.amazonaws.com"
export BASTION_PUBLIC_DNS="rds-sample-cluster-instance.cn03auqgmvh3.us-east-1.rds.amazonaws.com"

Fancier export with terraform output -json:
Note: you need jq to be installed brew install jq

export BASTION_PUBLIC_DNS=$(terraform output -json | jq -r '.bastion_public_dns.value')
export RDS_INSTANCE_ENDPOINT=$(terraform output -json | jq -r '.rds_instance_endpoint.value')

-r is short for --raw-input, which for our case means not to surround the value in quotes.

Open SSH tunnel to your database:

ssh -L 3307:$RDS_INSTANCE_ENDPOINT:3306 ubuntu@$BASTION_PUBLIC_DNS -N

Connect to your database:

mysql -u master -p -P 3307 -h 127.0.0.1

PostgreSQL DATE_TRUNC Function

Source

Summary: this tutorial shows you how to use the PostgreSQL date_trunc() function to truncate a timestamp or interval to a specified level of precision.

Introduction to the PostgreSQL date_trunc function

The date_trunc function truncates a TIMESTAMP or an  INTERVAL value based on a specified date part e.g., hour, week, or month and returns the truncated timestamp or interval with a level of precision.

The following illustrates the syntax of the date_trunc function:

date_trunc('datepart', field)
  • millennium
  • century
  • decade
  • year
  • quarter
  • month
  • week
  • day
  • hour
  • minute
  • second
  • milliseconds
  • microseconds

The field argument is a TIMESTAMP or an INTERVAL value to truncate. It could be an expression that evaluates to a timestamp or an interval.

The date_trunc function returns a TIMESTAMP or an INTERVAL value.

PostgreSQL date_trunc examples

The following example truncates a TIMESTAMP value to hour date part:

SELECT DATE_TRUNC('hour', TIMESTAMP '2017-03-17 02:09:30');

The date_trunc function returns a result with the hour precision.

If you want to truncate a TIMESTAMP value to a minute, you pass the 'minute' string as the first argument:

The function returns a TIMESTAMP whose precision level is minute:

See the following rental table in the sample database:

Rental table - PostgreSQL date_trunc function demo

You  can count the number of rentals by month by using the date_trunc function as follows:

In this query, the date_trunc function truncates the rental date to month part. The COUNT function counts the number of rentals and the GROUP BY clause groups the rentals by month.

Similarly, you can count the number of rentals by staff per year as follows:

The following shows the output:

In this tutorial, you have learned how to use the PostgreSQL date_trunc function to truncate a timestamp or an interval value.

PostgreSQL Transaction

In PostgreSQL, as in life, don’t wait too long to commit

Source

 

I recently helped a client with serious PostgreSQL problems. Their database is part of an application that configures and monitors large networks; each individual network node reports its status to the database every five minutes. The client was having problems with one table in particular, containing about 25,000 rows — quite small, by modern database standards, and thus unlikely to cause problems.

However, things weren’t so simple: This table was growing to more than 20 GB every week, even though autovacuum was running. In response, the company established a weekly ritual: Shut down the application, run a VACUUM FULL, and then restart things. This solved the problem in the short term — but by the end of the week, the database had returned to be about 20 GB in size.  Clearly, something needed to be done about it.

I’m happy to say that I was able to fix this problem, and that the fix wasn’t so difficult. Indeed, the source of the problem might well be obvious to someone with a great deal of PostgreSQL experience. That’s because the problem mostly stemmed from a failure to understand how PostgreSQL’s transactions work, and how they can influence the functioning of VACUUM, the size of your database, and the reliability of your system. I’ve found that this topic is confusing to many people who are new to PostgreSQL, and I thus thought that it would be a useful to walk others through the problem, and then describe how we solved it.

PostgreSQL, like many other modern relational databases, uses a technology known as multiversion concurrency control (“MVCC”).  MVCC reduces the number of exclusive locks placed on a database’s rows, by keeping around multiple versions of each row. If that sounds weird, then I’ll try to explain. Let’s start with a very simple table:

CREATE TABLE Foo (
   id SERIAL PRIMARY KEY,
   x  INTEGER NOT NULL
 );

Now let’s add 1 million rows to that table:

INSERT INTO Foo (x) VALUES (generate_series(1,1000000));

Now, how much space does this table take up on disk? We can use the built-in pg_relation_size function, but that’ll return a large integer representing a number of bytes. Fortunately, PostgreSQL also include pg_size_pretty, which turns a number into something understandable by computer-literate humans:

SELECT pg_size_pretty(pg_relation_size('foo'));

35 MB

What happens to the size of our table if we UPDATE each row, incrementing x by 1?

UPDATE Foo SET x = x + 1;
SELECT pg_size_pretty(pg_relation_size('foo'));

69 MB

If you’re new to MVCC, then the above is probably quite surprising. After all, you might expect an UPDATE query to change each of the existing 1 million rows, keeping the database size roughly identical.

However, this is not at all the case: Rather than replace or update existing rows, UPDATE in an MVCC system adds new rows. Indeed, in an MVCC-based system, our queries never directly change or remove any rows. INSERT adds new rows, DELETE marks rows as no longer needed, and UPDATE performs both an INSERT and a DELETE.

That’s right: In an MVCC system, UPDATE doesn’t change existing rows, although we might have the illusion of this occurring.  Every time you UPDATE a table, you’re doubling its size.

At this point, many newcomers to PostgreSQL are even more confused: How can it be that deleted rows don’t go away? And why would UPDATE work in such a crazy way? And how can all of this possibly help me?

The key to understanding what’s happening is to realize that once a row has been added to the system, it cannot be changed. However, its space can be reused by new rows, if the old row is no longer accessible by any transaction.

This starts to make a bit more sense if you realize that every transaction in PostgreSQL has a unique ID number, known as the “transaction ID.” This number constantly increases, providing us with a running count of transactions in the system. Every time we INSERT a row, PostgreSQL records the first transaction ID from which the row should now be visible, in a column known as “xmin.” You an think of xmin as the row’s birthday. You cannot normally see xmin, but PostgreSQL will show it to you, if you ask for it explicitly:

SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187665 │  1 │ 2 │
│ 187665 │  2 │ 3 │
│ 187665 │  3 │ 4 │
│ 187665 │  4 │ 5 │

In an MVCC-based system, DELETE doesn’t really delete a row.  Rather, it indicates that a row is no longer available to future transactions. We can see an initial hint of this if, when we SELECT the contents of a table, we ask for not only xmin, but also xmax — the final transaction (or “expiration date,” if you like) for which a row should be visible:

SELECT  xmin, xmax, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ xmax │ id │ x │

│ 187656 │    0 │  1 │ 2 │
│ 187656 │    0 │  2 │ 3 │
│ 187656 │    0 │  3 │ 4 │
│ 187656 │    0 │  4 │ 5 │

We can see that each of these rows was added in the same transaction ID (187656). Moreover, all of these rows are visible, since their xmax value is 0. (Don’t worry; we’ll see other xmax values in a little bit.)  If I update some  of these rows, we’ll see that their xmin value will be different:

UPDATE Foo SET x = x + 1 WHERE id IN (1,3);
SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187668 │  1 │ 3 │
│ 187665 │  2 │ 3 │
│ 187668 │  3 │ 5 │
│ 187665 │  4 │ 5 │

The rows with ID 1 and 3 were added in transaction 187668, whereas the rows with ID 2 and 4 were added in an earlier transaction, number 187665.

Since I’ve already told you that rows cannot be changed by a query, this raises an interesting question: What happened to the rows with IDs 1 and 3 that were added in the first transaction, number 187665?

The answer is: Those rows still exist in the system! They have not been modified or removed. But the system did mark them as having an xmax of 187668, the transaction ID in which the new, replacement rows are now available.

Transaction IDs only go up. But you can imagine a hypothetical time-traveler’s database, in which you could move the transaction ID down, and see rows from an earlier era. (Somehow, I doubt that this would be a very popular television series.)

The thing is, we don’t need a hypothetical, time-traveler’s database in order to see such behavior. All we need is to open two separate sessions to our PostgreSQL database at the same time, and have each session open a transaction.

You see, when you BEGIN a new transaction, your actions are hidden from the rest of the world — and similarly, modifications made elsewhere in the database are hidden from you.  You can thus have a situation in two different sessions see completely different versions of the same row, thanks to their different transaction IDs.  This might sound crazy, but it actually helps the database to run smoothly, with a minimum of conflicts and locks.

So, let’s open two different PostgreSQL sessions, and take a look at what happens.  In order to distinguish between them, I used the \set command in psql to make PROMPT1 either “Session A>” or “Session B>”. Here’s session A:

Session A> BEGIN;
Session A> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

And here is session B:

Session B> BEGIN;
Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

At this point, they are identical; both sessions see precisely the same rows, with the same xmin and xmax. Now let’s modify the values of two rows in session A:

Session A> update foo set x = x + 1 where id in (1,3);

What do we see now in session A?

Session A> select xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187670 │    0 │  1 │ 4 │
│ 187665 │    0 │  2 │ 3 │
│ 187670 │    0 │  3 │ 6 │
│ 187665 │    0 │  4 │ 5 │

Just as we would expect, we have two new rows (IDs 1 and 3), and two old rows (IDs 2 and 4), which we can distinguish via the xmin value.  In all cases, the rows have an xmax value of 0.

Let’s return to session B, and see what things look like there.  (Note that I haven’t committed any transactions here!)

Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │  xmax  │ id │ x │

│ 187668 │ 187670 │  1 │ 3 │
│ 187665 │      0 │  2 │ 3 │
│ 187668 │ 187670 │  3 │ 5 │
│ 187665 │      0 │  4 │ 5 │

That’s right — the rows that session B sees are no longer the rows that session A sees.  Session B continues to see the same rows as were visible when it started the transaction. We can see that the xmax of session B’s old rows is the same as the xmin of session A’s new rows. That’s because any transaction after 187670 will see the new row (i.e., with an xmin of 187670), but any transaction started before 187670 will see the old row (i.e., with an xmax of 187670).

I hope that the picture is now coming together: When you INSERT a new row, its xmin is the current transaction ID, and its xmax is 0.  When you DELETE a row, its xmax is changed to be the current transaction ID.  And when you UPDATE a row, PostgreSQL does both of these actions.  That’s why the UPDATE I did earlier, on all of the rows in the table, doubled its size.  (More on that in a moment.)

This allows different transactions to work independently, without having to wait for locks. Imagine if session B wants to look at row with ID 3; because of MVCC, it doesn’t need to wait to read it.  Heck, session B can even change the row with ID 3, although it’ll hang until session A commits or rolls back.

And indeed, another benefit of MVCC is that rollbacks are trivially easy to implement: If we COMMIT session A and roll back session B, then session A’s view of the world will be the current one, for all following transactions.  But if we ROLLBACK session A, then everything is fine; PostgreSQL keeps track of the fact that session A was rolled back, and that we should actually be using (until further notice) the rows with an xmax of 187670.

As you can imagine, this can lead to some trouble. If every UPDATE increases the number of rows, and every DELETE fails to free up any space, every PostgreSQL database will eventually run out of room, no? Well, yes — except for VACUUM, which is the garbage-collecting mechanism for PostgreSQL’s rows. VACUUM can be run manually, but it’s normally run via “autovacuum,” a program that runs in the background, and invokes VACUUM on a regular basis.

VACUUM doesn’t free space from your PostgreSQL database. Rather, it identifies rows whose xmin is greater than the current transaction ID, and thus marks those rows as available for re-use. PostgreSQL thus keeps track of not only the current transaction ID, but also the ID of the earliest still-open transaction. So long as that transaction is open, it needs to be able to see the rows from that point in time.

Don’t confuse VACUUM with VACUUM FULL; the later does indeed clear out space from a PostgreSQL database, but locks the database while it’s doing so. If you have a serious 24/7 application, then VACUUM FULL is a very bad idea. The regular autovacuum daemon, albeit perhaps with a bit of configuration, should take care of identifying which rows can and should be marked for deletion.

In the case of my client’s problem, my initial thought was that autovacuum wasn’t running. However, a quick look at the pg_stat_user_tables view showed that autovacuum was running on a regular basis.

However, VACUUM will mark a row as eligible for reuse if no other transactions can see it. (After all, if a session remains open, then it still needs to see those old row versions.)  Meaning that if a transaction remains open from the time the system starts up, VACUUM will never identify any rows as eligible for reuse, because there is still an open transaction whose ID is lower than the xmax of the deleted rows.

And this is precisely what happened to my client: It turns out that their application, at startup, was opening several transactions with BEGIN statements… and then never committing those transactions or rolling them back.  None of those transactions led to a lock in PostgreSQL, because they weren’t executing any query that would lead to a conflict. And autovacuum was both running and reporting that it ran on each table — because it had indeed done so.

But when it came time to mark the old rows as eligible for reuse, VACUUM ignored all of the rows in the entire database, including our 23,000-row table — because of the transaction that the application had opened at startup. It didn’t matter how many times we can VACUUM, whether it was done manually or automatically, or what else was running at the time: All of the old versions of every row in our table were kept around by PostgreSQL, leading to a massive disk usage and database slowness. Stopping the application had the effect of closing those transactions, and thus allowing VACUUM FULL to do its magic, albeit at the cost of system downtime. And because the application opened the transactions at startup, any fix was bound to be a temporary one.

My client was a bit surprised to discover that the entire database could encounter such problems from an open transaction that was neither committed nor rolled back. However, I hope that you now see why the space wasn’t reclaimed, and why the database acted as it did.

The solution, of course, was to get rid of those three open, empty transactions at startup; once we did that, everything ran smoothly.

Transactions are a brilliant and useful invention, but in PostgreSQL (and other MVCC systems), you shouldn’t be holding onto open transactions for too long. Doing so might have surprisingly unpleasant consequences. Commit or rollback as soon as you can — or you might find yourself with a database accumulating gigabytes full of old, unnecessary rows.

Partitioning in PostgreSQL

Inheritance

Table inheritance allows to extract a common set of columns into a parent, master table with children defining additional fields.

create table articles (id serial, title varchar, content text);
create table articles_w_tags (tags text[]) inherits (articles);
create table articles_wo_tags () inherits (articles);

Let’s insert some data

insert into articles_wo_tags (title, content)
    values ('Title 1', 'Content 1'),
           ('Title 2', 'Content 2');
insert into articles_w_tags (title, content, tags)
    values ('Title 3', 'Content 3', '{"tag_1", "tag_2"}'::text[]),
           ('Title 4', 'Content 4', '{"tag_2", "tag_3"}'::text[]);

Let’s perform a select query on each of these tables.

select * from articles_wo_tags;
 id |  title  |  content
----+---------+-----------
  1 | Title 1 | Content 1
  2 | Title 2 | Content 2
select * from articles_w_tags;
 id |  title  |  content  |     tags
----+---------+-----------+---------------
  3 | Title 3 | Content 3 | {tag_1,tag_2}
  4 | Title 4 | Content 4 | {tag_2,tag_3}

When querying the master table, the query references all rows of that master table plus all of its children tables; values from the common set of columns are displayed. The only keyword can be used to indicate that the query should apply only to a particular table and not any tables below it in the inheritance hierarchy.

select * from articles;
 id |  title  |  content
----+---------+-----------
  3 | Title 3 | Content 3
  4 | Title 4 | Content 4
  1 | Title 1 | Content 1
  2 | Title 2 | Content 2

Changes performed on the master table are propagated to the children.

update articles set content = content || ' Changed';
select * from articles_w_tags;
 id |  title  |      content      |     tags
----+---------+-------------------+---------------
  3 | Title 3 | Content 3 Changed | {tag_1,tag_2}
  4 | Title 4 | Content 4 Changed | {tag_2,tag_3}

Partitioning

Table partitioning means splitting a table into smaller pieces and provides various performance benefits for tables that hold large amounts of data, i.e. the size of a table is about to exceed the physical memory of the database server.

PostgreSQL allows table partitioning via table inheritance. Each partition must be created as a child table of a single parent table (which remains empty and exists only to represent the whole data set).

PostgreSQL implements range and list partitioning methods. The former is done with a range defined by a column or set of columns with no overlap between the ranges. The latter is done by explicitly listing which key values appear in each partition.

Let’s start by creating a parent table called logs.

create table logs (created_at timestamp without time zone default now(),
                   content text);

We will partition table by date into four quarters of the year.

create table logs_q1
    (check (created_at >= date '2014-01-01' and created_at <= date '2014-03-31'))
    inherits (logs);
create table logs_q2
    (check (created_at >= date '2014-04-01' and created_at <= date '2014-06-30'))
    inherits (logs);
create table logs_q3
    (check (created_at >= date '2014-07-01' and created_at <= date '2014-09-30'))
    inherits (logs);
create table logs_q4
    (check (created_at >= date '2014-10-01' and created_at <= date '2014-12-31'))
    inherits (logs);

Next step is to create indices on the key column of each child table.

create index logs_q1_created_at on logs_q1 using btree (created_at);
create index logs_q2_created_at on logs_q2 using btree (created_at);
create index logs_q3_created_at on logs_q3 using btree (created_at);
create index logs_q4_created_at on logs_q4 using btree (created_at);

Next, let’s create a trigger function to dispatch the data among child tables.

create or replace function on_logs_insert() returns trigger as $$
begin
    if ( new.created_at >= date '2014-01-01' and new.created_at <= date '2014-03-31') then
        insert into logs_q1 values (new.*);
    elsif ( new.created_at >= date '2014-04-01' and new.created_at <= date '2014-06-30') then
        insert into logs_q2 values (new.*);
    elsif ( new.created_at >= date '2014-07-01' and new.created_at <= date '2014-09-30') then
        insert into logs_q3 values (new.*);
    elsif ( new.created_at >= date '2014-10-01' and new.created_at <= date '2014-12-31') then
        insert into logs_q4 values (new.*);
    else
        raise exception 'created_at date out of range';
    end if;

    return null;
end;
$$ language plpgsql;

Let’s attach the trigger function defined above to logs table.

create trigger logs_insert
    before insert on logs
    for each row execute procedure on_logs_insert();

Finally, let’s insert some data into logs table to see the partitioning in work.

insert into logs (created_at, content)
    values (date '2014-02-03', 'Content 1'),
           (date '2014-03-11', 'Content 2'),
           (date '2014-04-13', 'Content 3'),
           (date '2014-07-08', 'Content 4'),
           (date '2014-10-23', 'Content 5');
select * from logs_q1;
     created_at      |  content
---------------------+-----------
 2014-02-03 00:00:00 | Content 1
 2014-03-11 00:00:00 | Content 2
select * from logs_q2;
     created_at      |  content
---------------------+-----------
 2014-04-13 00:00:00 | Content 3
select * from logs_q3;
     created_at      |  content
---------------------+-----------
 2014-07-08 00:00:00 | Content 4
select * from logs_q4;
     created_at      |  content
---------------------+-----------
 2014-10-23 00:00:00 | Content 5

A short article that dive a bit more in partitioning in PostgreSQL

Partitioning refers to splitting a large table into smaller pieces. This article covers the basics of partitioning in PostgreSQL.

Currently, PostgreSQL supports range and list partitioning via table inheritance. Basically, you have to create each partition as a child table of the master table. Generally, if you want to split data into specific ranges, then use range partitioning.

Date and timestamp values are good examples for this. For example, you may want to split yearly data into quarterly partitions. Similarly, if you want to split data based on a specific list of values, then use list partitioning. For this, you can consider creating partitions for each state of a country. Generally, in range partitioning, you would be adding more partitions over time, whereas in list partitioning the data will be continuously growing within each partition. This design decision is very important from the partition maintenance point of view, which most people ignore.

Although the syntax for creating partitioned tables (just like the one in a commercial RDBMS) is not available in PostgreSQL, you may wonder exactly how to set up partitioning in PostgreSQL. You will be glad to know that PostgreSQL uses the concept of inheritance, triggers and constraint exclusion to support partitioning. Additionally, you can also use rules instead of triggers.

Similarly, you would need to take care of NULL values. This is the most cumbersome part of using this easy-to-write trigger function, so let’s try out another way of writing the trigger function:

CREATE OR REPLACE FUNCTION orders_insert() RETURNS TRIGGER AS
$$
BEGIN
    IF (NEW.order_date >= DATE '2011-01-01' AND NEW.order_date < DATE '2012-01-01') THEN
       INSERT INTO orders_part_2011 VALUES (NEW.*);
   ELSIF (NEW.order_date < DATE '2011-01-01') THEN
       INSERT INTO orders_part_2010 VALUES (NEW.*);
   ELSE
       RAISE EXCEPTION 'Date out of range. check orders_insert() function!';
   END IF;
   RETURN NULL;
END;
$$
LANGUAGE plpgsql;
CREATE TRIGGER orders_insert_trigger
BEFORE INSERT ON orders
FOR EACH ROW EXECUTE PROCEDURE orders_insert();

As you can see, we are now explicitly checking order_date to redirect data into the appropriate partitions. This doesn’t need handling of quote or NULL values — at least, for non-partition-key values. I have simply used NEW.* instead of referring to individual column values.

In case you are not able to create all partitions beforehand, you would need to update the above trigger function with the logic to create partitions on-the-fly. Generally, we use separate triggers for partition creation and data redirection. Obviously, the partition creation trigger should always be fired before the redirection trigger. You should take care to avoid race conditions between those two triggers.

Now that the trigger is in place, let’s insert some records and verify the setup:

pg=# INSERT INTO orders VALUES(1, 'pune', '2011-08-22');
INSERT 0 0
pg=# INSERT INTO orders VALUES(2, 'pune', '2010-02-22');
INSERT 0 0
pg=# UPDATE orders SET address = 'bengaluru' WHERE id = 2;
UPDATE 1

Do not panic on seeing the INSERT statement messages — they simply convey that zero records were inserted in the base (master) table. Actually, the records were transparently inserted into the related partitions.

Another thing to note about the UPDATE statement is that you are not supposed to update the partitioning key. Basically, any change to the partitioning key value might result in the movement of that record to another partition, which is termed as the row movement.

A simple way to handle queries that do update the partitioning key is to capture the UPDATE query, delete the related record from the partition, and then fire an INSERT on the base table. Then the partitioning mechanism will kick in and redirect the record to the appropriate partition. This is not handled in the current setup, as I am updating the address value, which is a non-partition-key column.

Let’s now check where the records are stored:

SELECT * FROM orders;
 id |  address   |     order_date
----+--------------+---------------------------
  1 | pune         | 22-AUG-11 00:00:00
  2 | bengaluru | 22-FEB-10 00:00:00
(2 rows)

PostgreSQL knows about the child tables of the orders table, so it assumes that the user wants all the data, from the parent as well as all the children. The partitions are normal tables, and you can query them as usual:

SELECT * FROM orders_part_2011;
 id | address |     order_date
----+---------+------------------------
  1 | pune     | 22-AUG-11 00:00:00
(1 row)
SELECT * FROM orders_part_2010;
 id |  address  |     order_date
----+-----------+--------------------
  2 | bengaluru | 22-FEB-10 00:00:00
(1 row)

You can also check if the master table is really empty, by using the ONLY clause, which restricts the lookup to only the table specified in the statement:

SELECT * FROM ONLY orders;
 id | address | order_date
----+---------+------------
(0 rows)

Querying over partitions
Use the EXPLAIN feature to check the plan for querying over partitions:

EXPLAIN SELECT * FROM orders WHERE order_date = '02-JAN-11';
                                       QUERY PLAN
-----------------------------------------------------------------
 Result  (cost=0.00..26.01 rows=7 width=40)
   ->  Append  (cost=0.00..26.01 rows=7 width=40)
         ->  Seq Scan on orders  (cost=0.00..23.75 rows=6 width=44)
               Filter: (order_date = '02-JAN-11 00:00:00'::timestamp without time zone)
         ->  Seq Scan on orders_part_2011 orders  (cost=0.00..2.26 rows=1 width=18)
               Filter: (order_date = '02-JAN-11 00:00:00'::timestamp without time zone)
(6 rows)

In the above output, you see that only one partition was scanned, based on the WHERE clause conditions. Let’s look at another example:

EXPLAIN SELECT * FROM orders WHERE order_date = now();
                                     QUERY PLAN
-----------------------------------------------------------------
 Result  (cost=0.00..30.03 rows=8 width=41)
   ->  Append  (cost=0.00..30.03 rows=8 width=41)
         ->  Seq Scan on orders  (cost=0.00..26.50 rows=6 width=44)
               Filter: (order_date = now())
         ->  Seq Scan on orders_part_2011 orders  (cost=0.00..2.51 rows=1 width=18)
               Filter: (order_date = now())
         ->  Seq Scan on orders_part_2010 orders  (cost=0.00..1.01 rows=1 width=44)
               Filter: (order_date = now())
(8 rows)

Here all the partitions are scanned — definitely not what we wanted! You should be aware of the fact that the planner analyses the query before the values from the parameters or stored procedures are substituted. As the planner does not know the exact value of now() during the planning phase, it cannot prune partitions, and so scans all the partitions. You need to look out for such cases where constant values are expected. In case you are planning to use functions in the WHERE clause, do make sure to understand the various types of functions that can be created in PostgreSQL.

Let us now go through the finer details of the three features used in PostgreSQL partitioning.

Constraint exclusion

Constraint exclusion works with only range or equality check constraints. It might not work for constraints like the following:

ALTER TABLE product_items_j ADD CONSTRAINT chk_item_name CHECK (item_name LIKE 'P%');

More importantly, the WHERE condition should be similar to the CHECK constraints. For example, if you have the following CHECK constraint:

ALTER TABLE product_items_j ADD CONSTRAINT chk_item_name CHECK (item_name BETWEEN 'P' AND 'PB'');

Then constraint exclusion would not help for the following query, since the WHERE condition is not similar to the CHECK constraint:

SELECT item_name FROM product_items WHERE item_name LIKE 'Q%' ;

To get the constraint exclusion working, you need this form of query:

SELECT item_name FROM product_items WHERE item_name = 'Pen';

From the above examples, it should be clear that the user needs to take extra effort while handling the WHERE clause predicates, in order to hit the CHECK constraints. The other major problem with this setup is that there is no automatic way to verify if all the CHECK constraints are mutually exclusive. You need to be extra careful while setting them up.

The EXPLAIN feature comes in very handy to tackle these issues. Basically, any change to the query or table, even the slightest one, should be rigorously followed by looking at the EXPLAIN output. If you do not see the expected plan, then either the WHERE clause or the CHECK constraints need to be looked into.

Inheritance

The important rule that you should always be aware of is that child tables inherit column DEFAULTvalues, not NULL and CHECK constraints only. Any other constraints, like UNIQUEPRIMARY and FOREIGN key, will not be inherited. Also, indexes, ownership and permissions will not be inherited.

One has to set up the non-inherited constraints on all the child tables. However, there are various ways of setting up these constraints, which we will discuss in the very next section. The child tables cannot rename inherited columns, but can add new columns — this is the biggest advantage of using inheritance.

You should also note that one can enable or disable inheritance on child tables by using the ALTER TABLE command. This is, in fact, very useful when you want to remove or move partitions from the base table.

Using inheritance, you can have multiple levels of partitioning. You can always create a great-great-grandchild of a partition, but you might experience a performance loss with this kind of setup, since more base tables are involved, directly or indirectly. So generally, 1 or 2 levels for partitions and sub-partitions should be good. If you are going any deeper, then most probably you need to rethink the partitioning scheme employed.

Uniqueness

PostgreSQL uses indexes for supporting uniqueness. However, there are no multi-table indexes, which means that an index cannot span over more than one table. This is the reason for not having any primary key in our setup. In general, you might want to concentrate on indexing the partitioning key columns. For this, create non-overlapping CHECK constraints and then create a unique index on partitioning key columns over each partition. This is almost like having unique values over all partitions.

For indexing the non-partitioning-key columns, you create unique indexes over each partition. However, for verifying uniqueness, you need custom functions to scan all partitions, which would hopefully use the related indexes.

As you must have realised, this scanning of all partitions was not what we wanted, to begin with. Also, due to the lack of multi-table indexes, maintaining a primary key or a foreign key is technically not possible.

Consider that you created a primary key on each of the order_id columns of all child tables. Now it may happen that both orders_part_2011 and order_part_2010 have an order_id value of 1. Even though the uniqueness constraint within each child table is upheld, the base table orders now has duplicate order_id values.

To overcome this, you can create an additional lookup table, with primary keys from the base as well as child tables. Basically, you need to use triggers to keep the lookup table updated as you insert, update or delete records from the related partitioned table. Now all the primary and foreign key constraints can be mapped to this lookup table.

That’s it! These are basic setup issues that you must be able to get through now. You should now be ready to face the next challenge — performance tuning for partitions.

What’s around the corner?

The PostgreSQL community does have several discussions on this feature. For more details, you can visit this wiki entry on partitioning.

Backup and Restore Database in PostgreSQL

source

PostgreSQL database server provides pg_dump and psql utilities for backup and restore databases. This article will describe various ways to use of pg_dump command to backup database. Also you will learn how to restore datbase backup.

Backup and Restore Database in PostgreSQL

Below is some connections options which you can use for connecting remote server or authenticated server with all queries given in this article.

  • -d, –dbname=DBNAME database name
  • -h, –host=HOSTNAME database server hostname or ip
  • -p, –port=PORT database server port number (default: 5432)
  • -U, –username=NAME connect as specified database user
  • -W, –password force password prompt
  • –role=ROLENAME do SET ROLE before dump

1. Backup and Restore Single Database

Backup: single database in PostgreSQL. Replace your actual database name with mydb.

$ pg_dump -U postgres -d mydb > mydb.pgsql

Restore: single database backup in PostgreSQL.

$ psql -U postgres -d mydb < mydb.pgsql

2. Backup and Restore All Databases

Backup: all databases in PostgreSQL using pg_dumpall utility.

$ pg_dumpall -U postgres > alldbs.pgsql

Restore: all database backup using following command.

$ psql -U postgres < alldbs.pgsql

3. Backup and Restore Single Table

Backup: a single table named mytable from mydb database.

$ pg_dump -U postgres -d mydb -t mytable > mydb-mytable.pgsql

Restore: single table backup to database. Make sure your backup file contains only single table backup which you want to restore.

$ psql -U postgres -d mydb < mydb-mytable.pgsql

4. Compressed Backup and Restore Database

Backup: PostgreSQL database in compressed format.

$ pg_dump -U postgres -d mydb | gzip > mydb.pgsql.gz

Restore: database from compressed backup file directly.

$ gunzip -c mydb.pgsql.gz | psql -U postgres -d mydb

5. Split Backup in Multiple Files and Restore

Backup: PostgreSQL database and split backup in multiple files of specified size. It helps us to backup a large database and transfer to other host easily. As per below example it will split backup files of 100mb in size.

$ pg_dump -U postgres -d mydb | split -b 100m – mydb.pgsql

Restore: database backup from multiple splited backup files.

$ cat mydb.pgsql* | psql -U postgres -d mydb

Backup: database in compressed splited files of specified size.

$ pg_dump -U postgres -d mydb | gzip | split -b 100m – mydb.pgsql.gz

Restore: database from multiple files of compressed files.

$ cat mydb.pgsql.gz* | gunzip | psql -U postgres -d mydb