Tuesday, October 30, 2012

PostgreSQL tuning for MySQL admins


You can get optimum performance from your database by tuning your system in three areas: the hardware, the database, and the database server. Each increasingly more specialized than the last, with the tuning of the actual database server being unique to the software being used. If you're already familiar with tuning MySQL databases, you'll find tuning a PostgreSQL database server to be similar, but with some key differences to watch out for.
Before tuning your PostgreSQL database server, work on optimizing some of the key factors in the hardware and the database. All databases, of all types including PostgresSQL and MySQL, are ultimately limited by the I/O, memory, and processing capabilities of the hardware. The more a server has of each of these, the greater performance it's capable of. Using fast disks with hardware RAID is essential for a busy enterprise database server, as is having large amounts of memory. For the best results the server needs to have enough memory to cache the most commonly used tables without having to go to the disk. Under no circumstances should the server start swapping to hard disk. Similarly, the faster the CPU the better; for servers handling multiple simultaneous transactions, multicore CPUs are best.
On the software side, you can optimize both the actual database structure and frequently used queries. Be sure to create appropriate indexes. As with MySQL, primary key indexes are essential, and unique indexes offer advantages for data integrity and performance. Also, all full-text searches should have the correct indexes. Unlike MySQL, it is possible to build indexes while the database fulfills read and write requests. Look at the CONCURRENTLY option on the CREATE INDEX command, which allows the index to be built without taking any locks that prevent concurrent inserts, updates, or deletes on the table.
Even though an index has been created, PostgreSQL may not necessarily use it! PostgreSQL has a component called the planner that analyzes any given query and decides which is the best way to perform the requested operations. It decides between doing an index-based search or a sequential scan. In general, the planner does a good job of deciding which is the most effective way to resolve a query.
Let's see how this works in practice. Here is a simple table and some data:
CREATE TABLE birthdays (
id serial primary key,
firstname varchar(80),
surname varchar(80),
dob date

INSERT INTO birthdays (firstname, surname, dob) VALUES ('Fred', 'Smith', '1989-05-02');
INSERT INTO birthdays (firstname, surname, dob) VALUES ('John', 'Jones', '1979-03-04');
INSERT INTO birthdays (firstname, surname, dob) VALUES ('Harry', 'Hill', '1981-02-11');
INSERT INTO birthdays (firstname, surname, dob) VALUES ('Bob', 'Browne', '1959-01-21');
Use the EXPLAIN command to see what the planner will decide when executing any given query:
EXPLAIN select * from birthdays;

                          QUERY PLAN
 Seq Scan on birthdays  (cost=0.00..12.00 rows=200 width=364)
This tells us that since all the data is being requested, PostgreSQL will use a sequential scan (Seq Scan). If the query uses the primary key (id) then the planner tries a different approach:
 EXPLAIN select * from birthdays where id=2;

                                    QUERY PLAN
 Index Scan using birthdays_pkey on birthdays  (cost=0.00..8.27 rows=1 width=364)
This time it favored an Index Scan. Still, just because an index exists it doesn't mean the planner will decide to use it. Doing a search for a particular date of birth will (without the index) do a sequential scan:
EXPLAIN select * from birthdays where dob='1989-05-02';

                                    QUERY PLAN
 Seq Scan on birthdays  (cost=0.00..1.10 rows=1 width=364)
If you created an index using the command CREATE INDEX dob_idx ON birthdays(dob); and then ran the EXPLAIN command again, the result would be the same – a sequential scan would still be used. The planner makes this decision based on various table statistics, including the size of the dataset, all of which are not (by default) collected automatically. Without the latest stats, the planner's decisions will be less than perfect. Therefore, when you create an index or insert large amounts of new data, you should run the ANALYZE command to collect the latest statistics and improve the planner's decisionmaking.
You can force the planner to use the index (if it exists) using the SET enable_seqscan = off; command:
SET enable_seqscan = off;

EXPLAIN select * from birthdays where dob='1989-05-02';
                                QUERY PLAN
 Index Scan using dob_idx on birthdays  (cost=0.00..8.27 rows=1 width=364)
Turning off sequential scans might not improve performance, as index scans, for a large number of results, can be more I/O-intensive. You should test the performance differences before deciding to disable it permanently.
The EXPLAIN command can check how queries are performed and find bottlenecks on the actual database structure. It also has an ANALYZE option that performs queries and shows the actual run times. Here is the same query, but this time with the ANALYZE option:
EXPLAIN ANALYZE select * from birthdays where dob='1989-05-02';

                                             QUERY PLAN


 Seq Scan on birthdays  (cost=0.00..1.09 rows=1 width=19)
        (actual time=0.007..0.008 rows=1 loops=1)
The results now contain extra information showing the actual results returned. Unfortunatly it isn't possible to compare the "actual time" and "cost" fields, as they are measured differently, but if the rows match, or are close, it means that the planner correctly estimated the work load.
One other piece of routine maintenance that affects performance is the clearing up of unused data left behind in the database after updates and deletes. When PostgreSQL deletes a row, the actual data may still reside in the database, marked as deleted and not used by the server. This makes deleting fast, but the unused data needs to be removed at some point. Using the VACUUM command removes this old data and frees up space. The PostgreSQL documentation explains how to set up autovacuum, which automates the execution of VACUUM and ANALYZE commands.

Tweaking the PostgreSQL server parameters

The /var/lib/pgsql/data/postgresql.conf file contains the configuration parameters for the PostgreSQL server, and defines how various resources are allocated. Altering parameters in this file is similar to setting MySQL server system variables, either from the command-line options or via the MySQL configuration files. Most of the parameters are best left alone, but modifying a few key items can improve performance. However, as with all resource-based configuration, setting items to unrealistic amounts will actually degrade performance; consider yourself warned.
  • shared_buffers configures the amount of memory allocated to hold queries before they are fed into the operating system's buffers. The precise metrics of how this parameter affects performance aren't clear, but increasing it from the default of 32MB to between 6-15% of available RAM should enhance performance. For a 4GB system, a value of 512MB should be sufficient.
  • effective_cache_size tells the planner about the size of the disk cache provided by the operating system. It should be at least a quarter of the total available memory, and setting it to half of system memory is considered a normal conservative setting.
  • wal_buffers is the number of disk page buffers in shared memory for writeahead logging. Setting this to around 16MB can improve the speed of WAL writes for large transactions.
  • work_mem is the amount of working memory available during sort operations. On systems that do a lot of sorting, increasing the work_mem parameter allows PostgreSQL to use memory for sorting rather than using the disk. The parameter is per-sort, which means if a client does two sorts in a query, the specified amount of memory will be used twice. A value of, say, 10MB used by 50 clients doing two sorts each would occupy just under 1GB of system memory. Given how quickly the numbers can add up, setting this parameter too high can consume memory unnecessarily, but you can see performance gains by increasing it from the default of 1MB in certain environments.
To change a parameter, edit the conf file with a text editor, then restart the PostgreSQL server using the command service postgresql restart.
One last item to watch involves PostgreSQL's logging system, which is useful when you're trying to catch errors or during application development. However, if the logs are written to the same disk as the PostgreSQL database, the system may encounter an I/O bottleneck as the database tries to compete for bandwidth with its own logging actions. Tune the logging options accordingly and consider logging to a separate disk.
In summary, you can improve your database server's performance if you run PostgreSQL on suitable hardware, keep it routinely maintained, and create appopriate indexes. Changing some of the database server configuration variables can also boost performance, but always test your database under simulated load conditions before enabling any such changes in a production environment.

No comments:

Post a Comment