Tuesday, January 14, 2014

How to use PostgreSQL Foreign Data Wrappers for external data management


Often times, huge web projects use multiple programming languages and even multiple databases. While relational database management systems (RDBMS) are common, they have limitations when it comes to the management of highly variable data. For such applications, NoSQL databases are a better alternative. The PostgreSQL RDBMS now provides Foreign Data Wrappers (FDW) that let PostgreSQL query non-relational external data sources.
FDWs are drivers that allow PostgreSQL database administrators to run queries and get data from external sources, including other SQL databases (Oracle, MySQL), NoSQL databases(MongoDB, Redis, CouchDB), text files in CSV and JSON formats, and content from Twitter. A few of the wrappers, such as the one for Kyoto Tycoon, allow PostgreSQL to handle both read and write operations on remote data.
FDWs are based on the SQL Management of External Data (SQL/MED) standard, which supports SQL interfaces to remote data sources and objects. They have been officially supported since PostgreSQL 9.1. You can see a full list of released FDWs on the PostgreSQL wiki.

Strengths and weaknesses of relational and NoSQL databases

Why might you want to use FDWs when most web applications use relation databases on the back end? RDBMSes have been around for decades and are perfectly suitable to storing data whose structure is known in advance. Relational databases allow developers to create complex queries on data from multiple tables. They are secure and flexible when it comes to retrieving structured data, and they keep data consistent.
But RDBMSes are not suitable for storing data with huge variations in record structures or many hierarchical sublevels. NoSQL database models offer more freedom in data structure, simpler management, fewer system requirements, high scalability on multiple servers, and fast performance. They allow the storage of multidimensional structures with huge amount of data. On the minus side, however, they do not always preserve data consistency.
SQL databases maintain the properties of atomicity, consistency, isolation, and durability (ACID), which makes them a natural choice for storing important data such as financial transactions and accounts. By contrast, NoSQL databases are typically used to store less important data, such as server logs, or variable data that cannot be easily described in a structure during the design stage.
FDWs are designed to preserve PostgreSQL security and utilize the numerous features of PostgreSQL databases while taking advantage of the performance and scalability of NoSQL databases.

How to connect PostgreSQL with MongoDB through an FDW

MongoDB, one popular NoSQL solution, is a document database that allows objects with different numbers of fields to be included in a database. Objects can also be nested in other objects, with no limit on the depth. Let's see how to use FDWs in PostgreSQL to access MongoDB data.
I'll assume you have installed and configured PostgreSQL on your server. You may have the latest stable release, PostgreSQL 9.3, installed, but the latest version of the MongoDB FDW (currently mongo_fdw 2.0.0), which is developed by a third-party company, is compatible only with PotsgreSQL 9.2. The temporary lack of compatibility with the latest stable PotsgreSQL release is one of the disadvantages of this approach. While we can expect a new release of the wrapper that is compatible with PostgreSQL 9.3, there is no current information when it will be ready.
You can install the PostgreSQL 9.2 RPM package for your architecture by running the following commands:
wget http://yum.postgresql.org/9.2/redhat/rhel-6-i386/pgdg-centos92-9.2-6.noarch.rpm
rpm -ivH pgdg-centos92-9.2-6.noarch.rpm
Then use the yum search postgres command to list all the available packages for your architecture, and install them with a command like yum install postgresql92-devel postgresql92-server postgresql92-contrib.
Initialize your PostgreSQL cluster and start the server:
service postgresql-9.2 initdb
Initializing database:                                     [  OK  ]
/etc/init.d/postgresql-9.2 start
Starting postgresql-9.2 service:                           [  OK  ]
Once you have your PostgreSQL database server up and running, you can log in with the special postgres user, run the command-line interface for PostgreSQL, and create a test table with some sample data – in this case a list of shops and their addresses:
su postgres
bash-4.1$ psql
postgres=# CREATE TABLE shops(id serial primary key NOT NULL, name text NOT NULL, address char(50));
postgres=# INSERT INTO shops(name, address) VALUES ('My Hardware', 'USA, NY, 5th Avenue 33'), ('My Mobile Devices', 'UK, London, Fulham Road 22'), ('My Software', 'Germany, Berlin, Rosenthaler Street 3');
You can verify the data that you have entered through the SELECT query:
postgres=# select * from shops;
 id |       name        |                      address
  1 | My Hardware       | USA, NY, 5th Avenue 33
  2 | My Mobile Devices | UK, London, Fulham Road 22
  3 | My Software       | Germany, Berlin, Rosenthaler Street 3
(3 rows)
In an application that uses this data you might want to collect the total income from all the different types of online shops. Getting the answer might be complicated by the fact that each shop might sell totally different products, and it might be difficult to define the tables' structures during the design stage.
Instead of trying to force the data to follow a relational structure, you can use a document database like MongoDB that better supports the storage of highly variable data.
Create the corresponding /etc/yum.repos.d/mongodb.repo file with the configuration details for the MongoDB repository as explained in the official installation instructions. Then use the yum install mongo-10gen mongo-10gen-server command to install the latest stable release of the MongoDB server and the included tools. Start the service by entering /etc/init.d/mongod start at the command prompt.
You can configure both PostgreSQL and MongoDB to auto-start after your system is rebooted by entering the commands chkconfig postgresql-9.2 on && chkconfig mongod on.
Start the MongoDB command shell by typing mongo followed by the database name:
mongo myshops
MongoDB shell version: 2.4.8
connecting to: myshops
Next, enter some sample objects in the MongoDB database:
  "shop_id" : 1,
  "order_id" : 1,
  "customer" : "Joe D.",
  "products" : [ 
    { "product_id" : "SKU01", "type" : "CPU", "model" : "Intel Core i3 4340", "price" : 220 }, 
 { "product_id" : "SKU04", "type" : "CPU", "model" : "Intel Core i7 4770", "price" : 420 },
    { "product_id" : "SKU35", "type" : "laptop bag", "model" : "leather 1", "colour" : "black", "price" : 40 }
  "delivery_address" : {
 "country" : "USA",
 "state" : "California",
 "town" : "Yorba Linda",
 "street_address" : "Main street 23",
 "zip" : "92886",
 "mobile_phone" : "101001010101"
  "shop_id" : 2,
  "order_id" : 2,
  "customer" : "Mike A.",
  "products" : [ 
    { "product_id" : "SKU01", "type" : "smart phone", "model" : "Google Nexus", "price" : 400 }, 
 { "product_id" : "SKU05", "type" : "tablet", "model" : "iPad Air", "memory" : "16GB", "price" : 420 }
  "delivery_address" : {
 "country" : "Belgium", 
 "town" : "Brussels",
 "street_address" : "1st street 2",
 "zip" : "1234",
 "mobile_phone" : "1010010143"
  "shop_id" : 2,
  "order_id" : 3,
  "customer" : "Mike A.",
   "products" : [ 
    { "product_id" : "SKU04", "type" : "smart phone", "model" : "HTC Hero", "condition" : "used", "price" : 20 }, 
 { "product_id" : "SKU05", "type" : "tablet", "model" : "iPad Air", "memory" : "16GB", "promotion" : "Christmas 20% off", "price" : 336 }
  "delivery_address" : {
 "country" : "UK", 
 "town" : "London",
 "street_address" : "2nd street 22",
 "zip" : "9999",
 "mobile_phone" : "41010010143"
  "shop_id" : 3,
  "order_id" : 4,
  "customer" : "John C.",
  "products" : [ 
   { "product_id" : "SKU335", "type" : "book", "title" : "Learn PostgreSQL", "price" : 30 }
  "delivery_address" : {
 "country" : "Germany", 
 "town" : "Koln",
 "PO_box" : "223"
You can use the built-in MongoDB aggregation functionality to get the total sum for every order and store it in a separate data collection:
var total = db.orders.aggregate( [
    { $unwind: "$products" },
    { $group: {
        _id: '$_id',
  shop_id : { $first : "$shop_id" },
  order_id : { $first : "$order_id" },
  customer : { $first : "$customer" },
  sum: { $sum: '$products.price' } 
    } } 
] );
To see the result:
db.total.find().pretty().sort( { order_id: 1 } )
        "_id" : ObjectId("52c47761ab2d51cfcc878609"),
        "shop_id" : 1,
        "order_id" : 1,
        "customer" : "Joe D.",
        "sum" : 680
        "_id" : ObjectId("52c47761ab2d51cfcc87860a"),
        "shop_id" : 2,
        "order_id" : 2,
        "customer" : "Mike A.",
        "sum" : 820
        "_id" : ObjectId("52c47761ab2d51cfcc87860b"),
        "shop_id" : 2,
        "order_id" : 3,
        "customer" : "Mike A.",
        "sum" : 356
        "_id" : ObjectId("52c47762ab2d51cfcc87860c"),
        "shop_id" : 3,
        "order_id" : 4,
        "customer" : "John C.",
        "sum" : 30
Once you have sample data stored in your PostgreSQL and MongoDB databases you are ready to bind the two through the FDW.
First, use git to get the latest version of the FDW from the repository, then build and install the wrapper:
cd /usr/src/
git clone https://github.com/citusdata/mongo_fdw
cd /usr/src/mongo_fdw/
PATH=/usr/pgsql-9.2/bin/:$PATH make
PATH=/usr/pgsql-9.2/bin/:$PATH make install
Next, load the extension from the PostgreSQL command-line interface. Verify it and create a server instance for the wrapper:
postgres=# CREATE EXTENSION mongo_fdw;
postgres=# \dx mongo_fdw;
                      List of installed extensions
   Name    | Version | Schema |               Description
 mongo_fdw | 1.0     | public | foreign data wrapper for MongoDB access
(1 row)
postgres=# CREATE SERVER mongo_server FOREIGN DATA WRAPPER mongo_fdw OPTIONS (address '', port '27017');
Then set up a foreign table:
    shop_id INTEGER,
    order_id INTEGER,
 customer TEXT,
SERVER mongo_server
OPTIONS (database 'myshops', collection 'total');
Now you are ready to run SQL queries on the data stored in the MongoDB database. For example, you can list all the records from the table, and then run another query to find the total income for each shop and sort the result based on the shop ID.
SELECT * FROM shops_sales;
 shop_id | order_id | customer | sum
       3 |        4 | John C.  |  30
       2 |        2 | Mike A.  | 820
       1 |        1 | Joe D.   | 680
       2 |        3 | Mike A.  | 356
(4 rows)

SELECT shops.id AS "shop ID", shops.name AS "shop name", SUM(shops_sales.sum) AS "income" FROM shops INNER JOIN shops_sales ON shops.id = shops_sales.shop_id GROUP BY shops.id ORDER BY shops.id;
 shop ID |     shop name     | income
       1 | My Hardware       |    680
       2 | My Mobile Devices |   1176
       3 | My Software       |     30
(3 rows)

FDW future

Most of the currently available FDWs support only reading from the remote data sources. Since the release of PostgreSQL 9.3, developers can create FDWs to also perform inserts, updates, and deletes on foreign data. Whether any particular FDW supports these operations depends on the developers of the corresponding wrapper.
FDWs work as mediators between PostgreSQL databases and external data sources in different formats. You can run SQL queries on every possible source of information as long as the wrapper knows how to convert the external data to PostgreSQL format. FDWs give PostgreSQL application developers a useful tool to extract data from diverse technologies and use a single, unified way to query it.

No comments:

Post a Comment