Tuesday, October 30, 2012

PostgreSQL tuning for MySQL admins


You can get optimum performance from your database by tuning your system in three areas: the hardware, the database, and the database server. Each increasingly more specialized than the last, with the tuning of the actual database server being unique to the software being used. If you're already familiar with tuning MySQL databases, you'll find tuning a PostgreSQL database server to be similar, but with some key differences to watch out for.
Before tuning your PostgreSQL database server, work on optimizing some of the key factors in the hardware and the database. All databases, of all types including PostgresSQL and MySQL, are ultimately limited by the I/O, memory, and processing capabilities of the hardware. The more a server has of each of these, the greater performance it's capable of. Using fast disks with hardware RAID is essential for a busy enterprise database server, as is having large amounts of memory. For the best results the server needs to have enough memory to cache the most commonly used tables without having to go to the disk. Under no circumstances should the server start swapping to hard disk. Similarly, the faster the CPU the better; for servers handling multiple simultaneous transactions, multicore CPUs are best.
On the software side, you can optimize both the actual database structure and frequently used queries. Be sure to create appropriate indexes. As with MySQL, primary key indexes are essential, and unique indexes offer advantages for data integrity and performance. Also, all full-text searches should have the correct indexes. Unlike MySQL, it is possible to build indexes while the database fulfills read and write requests. Look at the CONCURRENTLY option on the CREATE INDEX command, which allows the index to be built without taking any locks that prevent concurrent inserts, updates, or deletes on the table.
Even though an index has been created, PostgreSQL may not necessarily use it! PostgreSQL has a component called the planner that analyzes any given query and decides which is the best way to perform the requested operations. It decides between doing an index-based search or a sequential scan. In general, the planner does a good job of deciding which is the most effective way to resolve a query.
Let's see how this works in practice. Here is a simple table and some data:
CREATE TABLE birthdays (
id serial primary key,
firstname varchar(80),
surname varchar(80),
dob date

INSERT INTO birthdays (firstname, surname, dob) VALUES ('Fred', 'Smith', '1989-05-02');
INSERT INTO birthdays (firstname, surname, dob) VALUES ('John', 'Jones', '1979-03-04');
INSERT INTO birthdays (firstname, surname, dob) VALUES ('Harry', 'Hill', '1981-02-11');
INSERT INTO birthdays (firstname, surname, dob) VALUES ('Bob', 'Browne', '1959-01-21');
Use the EXPLAIN command to see what the planner will decide when executing any given query:
EXPLAIN select * from birthdays;

                          QUERY PLAN
 Seq Scan on birthdays  (cost=0.00..12.00 rows=200 width=364)
This tells us that since all the data is being requested, PostgreSQL will use a sequential scan (Seq Scan). If the query uses the primary key (id) then the planner tries a different approach:
 EXPLAIN select * from birthdays where id=2;

                                    QUERY PLAN
 Index Scan using birthdays_pkey on birthdays  (cost=0.00..8.27 rows=1 width=364)
This time it favored an Index Scan. Still, just because an index exists it doesn't mean the planner will decide to use it. Doing a search for a particular date of birth will (without the index) do a sequential scan:
EXPLAIN select * from birthdays where dob='1989-05-02';

                                    QUERY PLAN
 Seq Scan on birthdays  (cost=0.00..1.10 rows=1 width=364)
If you created an index using the command CREATE INDEX dob_idx ON birthdays(dob); and then ran the EXPLAIN command again, the result would be the same – a sequential scan would still be used. The planner makes this decision based on various table statistics, including the size of the dataset, all of which are not (by default) collected automatically. Without the latest stats, the planner's decisions will be less than perfect. Therefore, when you create an index or insert large amounts of new data, you should run the ANALYZE command to collect the latest statistics and improve the planner's decisionmaking.
You can force the planner to use the index (if it exists) using the SET enable_seqscan = off; command:
SET enable_seqscan = off;

EXPLAIN select * from birthdays where dob='1989-05-02';
                                QUERY PLAN
 Index Scan using dob_idx on birthdays  (cost=0.00..8.27 rows=1 width=364)
Turning off sequential scans might not improve performance, as index scans, for a large number of results, can be more I/O-intensive. You should test the performance differences before deciding to disable it permanently.
The EXPLAIN command can check how queries are performed and find bottlenecks on the actual database structure. It also has an ANALYZE option that performs queries and shows the actual run times. Here is the same query, but this time with the ANALYZE option:
EXPLAIN ANALYZE select * from birthdays where dob='1989-05-02';

                                             QUERY PLAN


 Seq Scan on birthdays  (cost=0.00..1.09 rows=1 width=19)
        (actual time=0.007..0.008 rows=1 loops=1)
The results now contain extra information showing the actual results returned. Unfortunatly it isn't possible to compare the "actual time" and "cost" fields, as they are measured differently, but if the rows match, or are close, it means that the planner correctly estimated the work load.
One other piece of routine maintenance that affects performance is the clearing up of unused data left behind in the database after updates and deletes. When PostgreSQL deletes a row, the actual data may still reside in the database, marked as deleted and not used by the server. This makes deleting fast, but the unused data needs to be removed at some point. Using the VACUUM command removes this old data and frees up space. The PostgreSQL documentation explains how to set up autovacuum, which automates the execution of VACUUM and ANALYZE commands.

Tweaking the PostgreSQL server parameters

The /var/lib/pgsql/data/postgresql.conf file contains the configuration parameters for the PostgreSQL server, and defines how various resources are allocated. Altering parameters in this file is similar to setting MySQL server system variables, either from the command-line options or via the MySQL configuration files. Most of the parameters are best left alone, but modifying a few key items can improve performance. However, as with all resource-based configuration, setting items to unrealistic amounts will actually degrade performance; consider yourself warned.
  • shared_buffers configures the amount of memory allocated to hold queries before they are fed into the operating system's buffers. The precise metrics of how this parameter affects performance aren't clear, but increasing it from the default of 32MB to between 6-15% of available RAM should enhance performance. For a 4GB system, a value of 512MB should be sufficient.
  • effective_cache_size tells the planner about the size of the disk cache provided by the operating system. It should be at least a quarter of the total available memory, and setting it to half of system memory is considered a normal conservative setting.
  • wal_buffers is the number of disk page buffers in shared memory for writeahead logging. Setting this to around 16MB can improve the speed of WAL writes for large transactions.
  • work_mem is the amount of working memory available during sort operations. On systems that do a lot of sorting, increasing the work_mem parameter allows PostgreSQL to use memory for sorting rather than using the disk. The parameter is per-sort, which means if a client does two sorts in a query, the specified amount of memory will be used twice. A value of, say, 10MB used by 50 clients doing two sorts each would occupy just under 1GB of system memory. Given how quickly the numbers can add up, setting this parameter too high can consume memory unnecessarily, but you can see performance gains by increasing it from the default of 1MB in certain environments.
To change a parameter, edit the conf file with a text editor, then restart the PostgreSQL server using the command service postgresql restart.
One last item to watch involves PostgreSQL's logging system, which is useful when you're trying to catch errors or during application development. However, if the logs are written to the same disk as the PostgreSQL database, the system may encounter an I/O bottleneck as the database tries to compete for bandwidth with its own logging actions. Tune the logging options accordingly and consider logging to a separate disk.
In summary, you can improve your database server's performance if you run PostgreSQL on suitable hardware, keep it routinely maintained, and create appopriate indexes. Changing some of the database server configuration variables can also boost performance, but always test your database under simulated load conditions before enabling any such changes in a production environment.

NASA achieves data goals for Mars rover with open source software


Since the landing of NASA’s rover, Curiosity, on Mars on August 11th (Earth time), I have been following the incredible wealth of images that have been flowing back. I am awestruck by the breadth and beauty of the them.
The technological challenge of Curiosity sending back enormous amounts of data has, in my opinion, not been fully appreciated. From NASA reports, we know that Curiosity was sending back 'low level resolution' data (1,200 x 1,200 pixels) until it went through a software "brain transplant" and is now providing even more detailed and modifiable data.
How is this getting done so efficiently and distributed so effectively?
One recent story highlighted the 'anytime, anywhere' availability of Curiosity’s exploration that is handling "hundreds of gigabits/second of traffic for hundreds of thousands of concurrent viewers." Indeed, as the blog post from the cloud provider, Amazon Web Services (AWS), points out: "The final architecture, co-developed and reviewed across NASA/JPL and Amazon Web Services, provided NASA with assurance that the deployment model could cost-effectively scale, perform, and deliver an incredible experience of landing on another planet. With unrelenting goals to get the data out to the public, NASA/JPL prepared to service hundreds of gigabits/second of traffic for hundreds of thousands of concurrent viewers."
This is certainly evidence of the growing role that the cloud plays in real-time, reliable availability.
But, dig beneath the hood of this story—and the diagram included—and you’ll see another story. One that points to the key role of open source software in making this phenomenal mission work and the results available to so many, so quickly.
Here’s the diagram I am referring to:
Curiosity Diagram

If you look at the technology stack, you’ll see that at each level open source is key to achieving NASA’s mission goals. Let’s look at each one:


Nginx (pronounced engine-x) is a free, open source, high-performance HTTP server and reverse proxy, as well as an IMAP/POP3 proxy server. As a project, it has been around for about ten years. According to their website, Nginx now hosts nearly 12.18% (22.2M) of active sites across all domains. Nginx is generally regarded as a preeminent webserver for delivering content fast due to "its high performance, stability, rich feature set, simple configuration, and low resource consumption." Unlike traditional servers, Nginx doesn't rely on threads to handle requests. Instead it uses a much more scalable, event-driven (asynchronous) architecture. This architecture uses small, but more importantly, predictable amounts of memory under load.
Among the known high-visiblity sites powered by Nginx, according to its website, are Netflix, Hulu, Pinterest, CloudFlare, Airbnb, WordPress.com, GitHub, SoundCloud, Zynga, Eventbrite, Zappos, Media Temple, Heroku, RightScale, Engine Yard and NetDNA.

Railo CMS

Railo is an open source content management system (CMS) software which implements the general-purpose CFML server-side scripting language.  If you're familiar with Drupal, think mega-Drupal and that is Railo. It has recently been accepted as part of Jboss.org and runs on the Java Virtual Machine (JVM). It is often used to create dynamic websites, web applications and intranet systems. CFML is a dynamic language supporting multiple programming paradigms.


Perhaps the most important piece of this high-demand configuration, GlusterFS is an open source, distributed file system capable of scaling to several petabytes (actually, 72 brontobites!) and handling thousands of clients. GlusterFS clusters together storage building blocks over Infiniband RDMA or TCP/IP interconnect, aggregating disk and memory resources and managing data in a single global namespace. GlusterFS is based on a stackable user space design and can deliver exceptional performance for diverse workloads. It is especially adept at replicating big data across multiple platforms, allowing users to analyze the data via their own analytical tools. This technology is used to power the personalized radio service Pandora and the cloud content services company Brightcove, and by NTTPC for its cloud storage business. (In 2011, Red Hat acquired the open source software company, Gluster, which supports the GlusterFS upstream community; the product is now known as Red Hat Storage Server.)
I suspect that, like myself, the cascade of visual images from this unique exploration has sparked for yet another generation the mystery and immense challenge of life beyond our own planet. And what a change from the grainy television transmissions of the first moon landing, 43 years ago this summer (at least for those who are in a position to remember it!). Even 20 years ago, the delays, the inferior quality, and the narrow bandwidth of the data that could be analyzed stands in stark contrast to what is being delivered right now and for the next few years from this one mission.
Taken together, the combination of cloud and open source enabled the Curiosity mission to provide these results in real time, not months delayed; at high quality, not "good enough" quality. A traditional, proprietary approach would not have been this successful, given the short time to deployment and shifting requirements that necessitated the ultimate in agility and flexibility. NASA/JPL are to be commended. And while there was one cloud offering involved, "it really could have been rolled with any number of other solutions," as the story cited at the beginning of this post notes.
As policy makers and technology strategists continue their focus on 'big data', the mission of Curiosity will provide some important lessons. One key take away: open source has been key to the success of this mission and making its results as widely available as possible in so quickly a time frame.

Tuesday, October 23, 2012

Features of Open Source GPS Tracking System


Over the past few years, Global Positioning Tracking System (GPS) applications have become extremely popular among automobile consumers and in fact anyone who drives a vehicle on a regular basis probably use them. So much so that many car manufacturers offer GPS capabilities built directly into their cars. Mobile device providers have also found themselves competing with each other over their location aware applications using GPS technology. While there are several applications on the market that offer functionality for individual consumers, there is not a lot available for companies or small business owners who need to manage several vehicles at once from a central location.
The Open GTS (Open GPS Tracking System) Project is an open source project developing the Open GTS application, focusing on a GPS application specifically built for managing fleets of vehicles for small businesses. Fleet vehicles have different requirements for GPS applications than individual vehicles. For instance, the dispatch manager’s ability to keep track of each vehicle’s location through the work day is just as important as the driver’s ability to find their way around with accurate real time mapping and directions.

Types of Transportation Fleets using Open Source GPS Tracking Systems

The Open GTS package is currently the only open source application that provides these small business capabilities and has been downloaded by hundreds of small business users worldwide in over one hundred and ten countries. Companies who manage their automotive fleets with Open GTS include taxi services, parcel delivery, truck and van shipping, ATV and recreational vehicle rentals, business car rentals, water based freight ships and barges, and farm vehicles.

Skinnable Web Interfaces

The convenient centralized tracking application is built to fit into any small business application environment and can be customized accordingly. Along with the ability to code new extension modules or modifications to the base components as needed, it is easy to customize the user experience by adding your own CSS. The addition of a custom CSS will create a user experience that can fit more naturally into your existing business environment and even include your company logo and particular company background colors and fonts.

Customized Reporting

Another feature of open source GPS tracking systems is the ability to generate custom reports based on your specific data needs. For Open GTS, since it is based on XML for its underlying reporting structure, reports can be configured to provide data on a particular historical period, a particular set of vehicles in the fleet or even one vehicle in the fleet.


Geofenced areas, also known as geozones, are geographic parameters in which your fleet of vehicles is allowed to operate. Customizable geofencing zones allow users of the open source GPS systems to define their own areas of operation and change them as their business grows. Multiple geozones can also be defined and identified with a custom name for better organization of all of your different areas of operation.

Customizable Map Providers

Open GTS allows users to integrate a number of mapping programs including Google Maps and Microsoft’s mapping application Virtual Earth. The Mapstraction engine is also supported, which powers the popular mapping applications Map24 and MapQuest.

Operating System Independence

Since open source GPS tracking systems are web application tools, they are able to run on any operating system. The Open GTS tool is built on the Apache Tomcat application server using the java runtime environment and uses MySQL for its relational database.

Localization and Compliance

GPS tracking systems and Open GTS in particular must support easy options for localizing their interfaces and language support. In addition, Open GTS complies with all i18n compliance standards for internationalization and localization.

Sunday, October 21, 2012

11 Basic Linux NMAP command Examples for System administrators


NMAP(Network Mapping) is one of the important network monitoring tool. Which checks for what ports are opened on a machine.
Some important to note about NMAP
  • NMAP abbreviation is network mapper
  • NMAP is used to scan ports on a machine, either local or remote machine (just you require IP/hostname to scan).
  • NMAPis can be installed on windows, Sun Solaris machines too.
  • NMAPcan be used to scan large networks, remember I am saying large networks.
  • NMAPcan be used to get operating system details such as open ports, software used for a service and its version no, vendor of network card and up time of that system too(Don’t worry we will see all these things in this post.
  • Please do not try to use NMAP on machines which you don’t have permission.
  • Can be used by hackers to scan for systems for vulnerability.
  • Just a funny note : You can see this NMAP used by Trinity in Matrix-II movie, when she tries to hack in to electric grid super computer.
Note : MAN pages of NMAP is one of the best man pages I have come across. It is explained in such a way that even new user can understand what each option do and one more thing is that, it even have examples in to on how to use NMAP in different situations, when you have time read it. You will get lots of information.
Let us start with some examples to better understand nmap command:
  1. Check for particular port on local machine.
  2. Use nmap to scan local machine for open ports.
  3. Nmap to scan remote machines for open ports.
  4. Nmap to scan entire network for open ports.
  5. Scan only ports with -F option.
  6. Scan a machine with -v option for verbose mode.
  7. Scan a machine for TCP protocol open ports.
  8. Scan a machine for UDP protocol open ports.
  9. Scan a machine for services and their software versions.
  10. Scan for open Protocols such as TCP, UDP, ICMP, IGMP etc on a machine.
  11. Scan a machine for to check what operating system its running.
Example1 : Scanning for a single port on a machine
nmap –p portnumber hostname
nmap -p 53
Starting Nmap 5.21 ( http://nmap.org )
Nmap scan report for localhost (
Host is up (0.000042s latency).
53/tcp open domain
Nmap done: 1 IP address (1 host up) scanned in 0.04 seconds
The above example will try to check 53(DNS) port is open on port or not.
Example2 : Scan entire machine for checking open ports.
nmap hostname
Starting Nmap 5.21 ( http://nmap.org )
Nmap scan report for localhost (
Host is up (0.00037s latency).
Not shown: 998 closed ports
53/tcp open domain
631/tcp open ipp
Nmap done: 1 IP address (1 host up) scanned in 0.08 seconds
Example3 : Scan remote machine for open ports
nmap remote-ip/host
Starting Nmap 5.21 ( http://nmap.org )
Nmap scan report for localhost (
Host is up (0.00037s latency).
Not shown: 998 closed ports
53/tcp open domain
631/tcp open ipp
Nmap done: 1 IP address (1 host up) scanned in 0.08 seconds
Example4: Scan entire network for IP address and open ports.
nmap network ID/subnet-mask
Starting Nmap 5.21 ( http://nmap.org )
Nmap scan report for
Host is up (0.016s latency).
Not shown: 996 closed ports
23/tcp open telnet
53/tcp open domain
80/tcp open http
5000/tcp open upnp
Nmap scan report for
Host is up (0.036s latency).
All 1000 scanned ports on are closed
Nmap scan report for
Host is up (0.000068s latency).
All 1000 scanned ports on are closed
Nmap done: 256 IP addresses (3 hosts up) scanned in 22.19 seconds
Example5: Scan just ports, dont scan for IP address, hardware address, hostname, operating system name, version, and uptime etc. It’s very much fast as it said in man pages etc. We observed in our tests that it is 70% faster in scan ports when compared to normal scan.
nmap –F hostname
-F is for fast scan and this will not do any other scanning.
nmap -F
Starting Nmap 5.21 ( http://nmap.org ) 
Nmap scan report for
Host is up (0.028s latency).
Not shown: 96 closed ports
23/tcp open telnet
53/tcp open domain
80/tcp open http
5000/tcp open upnp
Nmap done: 1 IP address (1 host up) scanned in 0.10 seconds
Example6: Scan the machine and give as much details as possible.
nmap -v hostname
nmap -v
Starting Nmap 5.21 ( http://nmap.org )
Initiating Ping Scan at 13:31
Scanning [2 ports]
Completed Ping Scan at 13:31, 0.00s elapsed (1 total hosts)
Initiating Parallel DNS resolution of 1 host. at 13:31
Completed Parallel DNS resolution of 1 host. at 13:31, 0.00s elapsed
Initiating Connect Scan at 13:31
Scanning [1000 ports]
Discovered open port 53/tcp on
Discovered open port 80/tcp on
Discovered open port 23/tcp on
Discovered open port 5000/tcp on
Completed Connect Scan at 13:31, 0.21s elapsed (1000 total ports)
Nmap scan report for
Host is up (0.014s latency).
Not shown: 996 closed ports
23/tcp open telnet
53/tcp open domain
80/tcp open http
5000/tcp open upnp
Read data files from: /usr/share/nmap
Nmap done: 1 IP address (1 host up) scanned in 0.26 seconds
 Example7 : Scan a machine for TCP open ports
nmap –sT hostname
Here s stands for scanning and T is for scanning TCP ports only
nmap -sT
Starting Nmap 5.21 ( http://nmap.org )
Nmap scan report for
Host is up (0.022s latency).
Not shown: 996 closed ports
23/tcp open telnet
53/tcp open domain
80/tcp open http
5000/tcp open upnp
Nmap done: 1 IP address (1 host up) scanned in 0.28 seconds
Example8 : Scan a machine for UDP open ports.
nmap –sU hostname
Here U indicates UDP port scanning. This scanning requires root permissions.
Exmaple9 : Scanning for ports and to get what is the version of different services running on that machine
nmap –sV hostname
s stands for scaning and V indicates version of each network service running on that host
nmap -sV
Starting Nmap 5.21 ( http://nmap.org )
Stats: 0:00:06 elapsed; 0 hosts completed (1 up), 1 undergoing Service Scan
Service scan Timing: About 0.00% done
Nmap scan report for localhost (
Host is up (0.000010s latency).
Not shown: 998 closed ports
53/tcp open domain dnsmasq 2.59
631/tcp open ipp CUPS 1.5
Service detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 6.38 seconds
Example10 : To check which protocol(not port) such as TCP, UDP, ICMP etc is supported by the remote machine. This -sO will give you the protocol supported and its open status.
#nmap –sO hostname
nmap -sO localhost
Starting Nmap 5.21 ( http://nmap.org )
Nmap scan report for localhost (
Host is up (0.14s latency).
Not shown: 249 closed protocols
1 open icmp
2 open igmp
6 open tcp
17 open udp
103 open|filtered pim
136 open|filtered udplite
255 open|filtered unknown
Nmap done: 1 IP address (1 host up) scanned in 2.57 seconds
 Example11 : To scan a system for operating system and uptime details
nmap -O hostname
-O is for operating system scan along with default port scan
nmap -O google.com
Starting Nmap 5.21 ( http://nmap.org ) 
Nmap scan report for google.com (
Host is up (0.021s latency).
Hostname google.com resolves to 11 IPs. Only scanned
rDNS record for maa03s16-in-f8.1e100.net
Not shown: 997 filtered ports
80/tcp open http
113/tcp closed auth
443/tcp open https
Device type: general purpose|WAP
Running (JUST GUESSING) : FreeBSD 6.X (91%), Apple embedded (85%)
Aggressive OS guesses: FreeBSD 6.2-RELEASE (91%), Apple AirPort Extreme WAP v7.3.2 (85%)
No exact OS matches for host (test conditions non-ideal).
OS detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 16.23 seconds
Some sites to refer (not for practical examples, but for to get good concept):
nmap.org : official site for ourNMAP

Thursday, October 18, 2012

Understanding Linux / Unix Filesystem Inode


Inode, short form of Index Node is what the whole Linux filesystem is laid on. Anything which resides in the filesystem is represented by Inodes. Just take an example of an old school library which still works with a register having information about their books and their location, like which cabinet and which row, which books resides and who is the author of that book. In this case, the line specific to one book is Inode. In the same way Inodes stores objects, which we will study in detail below.
So, in the linux system, the filesystem mainly consists of two parts, first is the metadata and the second part is the data itself. Metadata, in other words is the data about the data. Inodes takes care of the metadata part in the filesystem.

Inode Basics:

So, as I said, every file or directory in the filesystem is associated with an Inode. An Inode is a data structure, and it stores the following information about the destination:
  • Size of file (In bytes)
  • Device ID (Device containing the file)
  • User ID (of the owner)
  • Group ID
  • File Modes (how owner, group or others could access the file)
  • Extended Attributes (like ACL)
  • Files access, change or modification time stamps
  • Link Count (no of hard links pointing to the inode … remember, no soft links are counted here)
  • Pointer to the disk block that stores the content.
  • File type (whether file, directory or special block device)
  • Block size of the filesystem
  • No. of blocks file us using.
Linux filesystem never stores the file creation time, though lot of people get confused in that. The complete explanation about the various time stamps stored in inode are explained in this article.
A typical inode data will look something like this:
# stat 01
Size: 923383 Blocks: 1816 IO Block: 4096 regular file
Device: 803h/2051d Inode: 12684895 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2012-09-07 01:46:54.000000000 -0500
Modify: 2012-04-27 06:22:02.000000000 -0500
Change: 2012-04-27 06:22:02.000000000 -0500

How / When Inodes are created ?

Creation of Inodes depends on the filesystem you are using. Some filesystem like ext3 creates the Inodes when the filesystem is created, hence having a limited number of Inodes while others like JFS and XFS creates Inodes at the filesystem creation also, but uses dynamic Inode allocation and can increase the number of Inodes according to the need, hence avoiding the situation where all the Inodes gets used up.

What happen when someone tries to access a File:

When a user tries to access the file or any information related to the file then he/she uses the file name to do so but internally the file-name is first mapped with its Inode number stored in a directory table. Then through that Inode number the corresponding Inode is accessed. There is a table (Inode table) where this mapping of Inode numbers with the respective Inodes is provided.

Inode Pointer Structure:

So, as already explained, Inodes only stores metadata information of the file, including the information of the blocks where the real data of the file is stored. This is where Inode Pointer Structure explained.
As explained in the Wikipedia Article, the structure could have 11 to 13 pointers, but most file system store data structure in 15 pointers. These 15 pointers consists of:
  • Twelve pointers that directly point to blocks of the file’s data, called as direct pointers.
  • One singly indirect pointer, which points to a block of pointers that then point to blocks of the file’s data.
  • One doubly indirect pointer, which points to a block of pointers that point to other blocks of pointers that then point to blocks of the file’s data.
  • One triply indirect pointer, which points to a block of pointers that point to other blocks of pointers that point to other blocks of pointers that then point to blocks of the file’s data.
The above things can be explained in a diagram like this:
Inode Pointer Structure
Inode Pointer Structure (From wikipedia, Wikimedia Commons license)


Q. How do I define inode in one line ?
A. An inode is a data structure on a traditional Unix-style file system such as UFS or ext3. An inode stores basic information about a regular file, directory, or other file system object.
Q. How can I see files or directory Inode number ?
A. You can use “stat” command to see the information or you can use “-i” argument with “ls” command to see the inode number of a file.
Q. How to find the total number of Inodes in a Filesystem and the usage of Inodes ?
A. “df -i” command will tell you the stats about the total number, number used and free Inodes.
Q. Why Inode information doesn’t contain the filename ?
A. Inodes store information which are unique to an Inode. In case of a hard link, an Inode could have 2 different file names pointing to the same Inode. So, it’s better not to store the filename inside an Inode.
Q. What if Inode have no links ?
A. Inode having no links (means 0 links), is removed from the filesystem and the resources are freed for reallocation but deletion must wait until all processes that have opened it finish accessing it.
Q. Does Inode change when we move a file from one location to another ?
A. The Inode number stays the same even when we move the file from one location to another only if it’s on the same filesystem. If it’s across filesystem, then the Inode number changes.
Q. What happens when we create a new file or directory, does it create a new Inode ?
A. No, when we create a File or directory, it will just use a already created Inode space, and update the information, but won’t create a new Inode. Inodes are only created at filesystem creation time (exception about some other filesystems, which is explained above)
Q. Can I find a file from an Inode number?
A. Yes, by using the following command
# find / -inum inode-number -exec ls -l {} \;
Using the same command and replacing “ls” command with “rm” command, you can remove a file also on the basis on inode number
# find / -inum inode-number -exec rm -f {} \;


  1. Wikipedia
  2. Linux Magazine

Regular Expressions in Linux Explained with Examples


Regular expressions (Regexp)is one of the advanced concept we require to write efficient shell scripts and for effective system administration. Basically regular expressions are divided in to 3 types for better understanding.
1)Basic Regular expressions
2)Interval Regular expressions (Use option -E for grep and -r for sed)
3)Extended Regular expressions (Use option -E for grep and -r for sed)
Some FAQ’s before starting Regular expressions
What is Regular expressions?
A regular expressions is a concept of matching a pattern in a given string.
Which commands support regular expressions?
vi, tr, grep, sed, awk, perl, python etc.

Basic Regular Expressions

Basic regular expressions: This set includes very basic set of regular expressions which do not require any options to execute. This set of regular expressions are developed long time back.

^ –Caret/Power symbol to match a starting at the beginning of line.
$ –To match end of the line
 * –0 or more occurrence of previous character.
. –To match any character
[] –Range of character
[^char] –negate of occurrence of a character set
\ –Actual word finding
\ –Escape character
Lets start with our Regexp with examples, so that we can understand it better.

^ Regular Expression

Example 1: Find all the files in a given directory
ls -l | grep ^-
As you are aware that the first character in ls -l output, - is for regular files and d for directories in a given folder. Let us see what ^- indicates. The ^ symbol is for matching line starting, ^- indicates what ever lines starts with -, just display them. Which indicates a regular file in Linux/Unix.
If we want to find all the directories in a folder use grep ^d option along ls -l as shown below
ls -l | grep ^d
How about character files and block files?
ls -l | grep ^c
ls -l | grep ^b
We can even find the lines which are commented using ^ operator with below example
grep ‘^#’ filename
How about finding lines in a file which starts with ‘abc’
grep ‘^abc’ filename
We can have number of examples with this ^ option.

$ Regular Expression

Example 2: Match all the files which ends with sh
ls -l | grep sh$
As $ indicates end of the line, the above command will list all the files whose names end with sh.
how about finding lines in a file which ends with dead
grep ‘dead$’ filename
How about finding empty lines in a file?
grep ‘^$’ filename

 * Regular Expression

Example 3: Match all files which have a word twt, twet, tweet etc in the file name.
ls -l | grep ‘twe*t’
How about searching for apple word which was spelled wrong in a given file where apple is misspelled as ale, aple, appple, apppple, apppppple etc. To find all patterns
grep ‘ap*le’ filename
Readers should observe that the above pattern will match even ale word as * indicates 0 or more of previous character occurrence.

. Regular Expression

Example 4: Filter a file which contains any single character between t and t in a file name.
ls -l | grep ‘t.t’
Here . will match any single character. It can match tat, t3t, t.t, t&t etc any single character between t and t letters.
How about finding all the file names which starts with a and end with x using regular expressions?
ls -l | grep ‘a.*x’
The above .* indicates any number of characters
Note: .* in this combination . indicates any character and it repeated(*) 0 or more number of times.
Suppose you have files as..
etc.. it will find all the files/folders which start with a and ends with x in our example.

[] Square braces/Brackets Regular Expression

Example 5: Find all the files which contains a number in the file name between a and x
ls -l | grep ‘a[0-9]x’
This will find all the files which is
So where ever it finds a number it will try to match that number.
Some of the range operator examples for  you.
[a-z] –Match’s any single char between a to z.
[A-Z] –Match’s any single char between a to z.
[0-9] –Match’s any single char between 0 to 9.
[a-zA-Z0-9] – Match’s any single character either a to z or A to Z or 0 to 9
[!@#$%^] — Match’s any ! or @ or # or $ or % or ^ character.
You just have to think what you want match and keep those character in the braces/Brackets.

[^char] Regular Expression

Example6: Match all the file names except a or b or c in its filenames
ls | grep  ’[^abc]‘
This will give output all the file names except files which contain a or b or c.

\ Regular expression

Example7: Search for a word abc, for example I should not get abcxyz or readabc in my output.
grep ‘\’ filename

\ Escape Regular Expression 

Example 8: Find files which contain [ in its name, as [ is a special charter we have to escape it
grep "\[" filename
grep '[[]‘ filename
Note: If you observe [] is used to negate the meaning of [ regular expressions, so if you want to find any specail char keep them in [] so that it will not be treated as special char.
Note: No need to use -E to use these regular expressions with grep. We have egrep and fgrep which are equal to “grep -E”. I suggest you just concentrate on grep to complete your work, don’t go for other commands if grep is there to resolve your issues. Stay tuned to our next post on Regular expressions.

Wednesday, October 17, 2012

60 OS Replacements for Storage Software


According to IDC, the amount of digital data in our universe is doubling every two years. They say that in 2011 our world generated 1.8 zettabytes (1.8 trillion gigabytes) of data. The research firm also reports that enterprises store 80 percent of that data at some point during its lifecycle.
The problem: while the amount of storage capacity needed is growing incredibly rapidly, enterprise budgets are not increasing at the same rate.
As a result, enterprises are increasingly looking to open source solutions to help them manage their huge data stores while keeping costs down. And the open source community also has many storage-related projects that can help small businesses and consumers with their storage needs as well.
This month, Datamation is updating our list or open source software that can replace commercial storage solutions. We put together a mix of storage-related projects for home users and companies of all sizes.
Here you'll find software that you can use with industry-standard hardware to create your own storage device, utilities to help you maximize your available storage capacity, and a host of other tools to help you manage your stored data.
As always, if you have additional suggestions for the list, please note them in the comments section below.


1. Amanda
Replaces Symantec NetBackup, NovaBackup, Barracuda Backup Service
Short for "Advanced Maryland Automatic Network Disk Archiver," Amanda is a mature tool that can back up data from a very high number of workstations connected to a LAN. Currently, Zmanda supports Amanda's development and offers related cloud-based products. Operating System: Windows, Linux, OS X.
2. Areca Backup
Replaces Norton Ghost, McAfee Online Backup, NovaBackup
Appropriate for very small businesses or home users, Areca backs up individual workstations. Easy setup and versatility make this open source backup solution popular, and it supports some advanced backup features, including delta backup. Operating System: Windows, Linux.
3. Bacula
Replaces Symantec NetBackup, NovaBackup, Barracuda Backup Service
One of the most popular open source backup solutions for enterprise users, Bacula offers a suite of tools to backup, verify and recover data from large networks. See Bacula Systems for commercial support. Operating System: Windows, Linux, OS X.
4. Clonezilla
Replaces Backup Exec, Norton Ghost
Created as a replacement for Symantec Software, Clonezilla is a backing and cloning solution that allows bare metal backup and recovery, as well as multicasting for deploying multiple systems at once. Choose Clonezilla Live for backing up a single machine or Clonezilla SE to clone more than 40 systems at once. Operating System: Linux.
5. Create Synchronicity
Replaces Norton Ghost, McAfee Online Backup, NovaBackup
This solution's claim to fame is its extremely lightweight size. A good option for standalone systems, it's customizable and easy to use. Operating System: Windows.
6. FOG
Replaces Symantec NetBackup, NovaBackup, Barracuda Backup Service
Popular with schools and small businesses, FOG resides on a Linux-based server and provides cloning functionality for Windows-based networked PCs. It offers an easy-to-use Web interface, and it includes features like virus scanning, testing, disk wiping and file recovery,. Operating System: Linux, Windows.
7. Partimage
Replaces Norton Ghost
Like FOG, Partimage runs on a Linux server, and it can clone Windows or Linux PCs connected to the network. Because it only images used blocks, it often runs faster than similar backup tools. Operating System: Windows, Linux.
8. Redo
Replaces Norton Ghost, McAfee Online Backup, NovaBackup
This backup solution boasts that it can do a bare-metal restore in under ten minutes. The website also claims that Redo is "so simple that anyone can use it," and  calls it "the easiest, most complete disaster recovery solution available." Operating System: Windows, Linux.


9. 7-zip
Replaces WinZip
Compressing files before you store them can help minimize the amount of storage capacity you need. 7-zip supports ZIP files and several popular compression formats, including 7Z files, which offer 30-70 percent greater compression than ZIP files. Operating System: Windows, Linux, OS X.
10. KGB Archiver
Replaces WinZip
KGC Archiver claims to offer an "unbelievable high compression rate" that's even better than 7Z. It also offers AES-256 encryption. Operating System: Windows.
11. PeaZip
Replaces WinZip
This incredibly versatile compression utility supports more than 150 different formats. It also includes security features like strong encryption, two factor authentication, encrypted password manager and secure deletion. Operating System: Windows, Linux, OS X.


12. Kexi
Replaces Microsoft Access, FileMaker
A KDE application, Kexi is sometimes called "Microsoft Access for Linux." It offers visual tools for database creation and a database engine (SQLite), but it can also be used with MySQL or PostgreSQL servers. Operating System: Windows, Linux, OS X.
13. LucidDB
Replaces Microsoft SQL Server
LucidDB claims to be "the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence." Accordingly, it offers advanced analytics capabilities and good scalability. Operating System: Windows, Linux.
14. MySQL
Replaces Microsoft SQL Server
This Oracle-owned project boasts that it's the "world's most popular open source database." It comes in several commercial editions in addition to the open source version. Operating System: Windows, Linux, OS X.
15. PostgreSQL
Replaces Microsoft SQL Server
PostgreSQL's website claims, "Unlike many proprietary databases, it is extremely common for companies to report that PostgreSQL has never, ever crashed for them in several years of high activity operation. Not even once." It's won numerous awards and is standards-compliant. Operating System: Windows, Linux, OS X.

Data Destruction

16. BleachBit
Replaces Easy System Cleaner
If you need to clean out a hard drive and "shred" data so that it cannot be recovered, BleachBit is for you. In addition, it offers a number of other privacy protection tools that clean cache, erase Internet history, delete cookies, get rid of temporary files and eliminates other "junk" that slows down your system. Operating System: Windows, Linux.
17. Darik's Boot And Nuke
Replaces Kill Disk, BCWipe Total WipeOut
For those who need to eliminate the data on an entire drive, Darik's Boot and Nuke does the trick. Note, however, that this is primarily a home user and small business application and doesn't erase RAID arrays. Operating System: OS Independent.
18. Eraser
Replaces BCWipe Enterprise
Eraser's website notes, "Most people have some data that they would rather not share with others - passwords, personal information, classified documents from work, financial records, self-written poems, the list continues." If you want to delete these files from your hard drive, Eraser will write over the old data so that it can never be recovered. Operating System: Windows.
19. FileKiller
Replaces BCWipe Enterprise
Like Eraser, FileKiller "shreds" old files by rewriting over the stored data. It boasts fast performance, and it allows the end user to specify how many times to overwrite the file. Operating System: Windows.
20. Wipe
Replaces BCWipe Enterprise
Wipe is similar to Eraser and FileKiller, but it works on Linux instead of Windows. Operating System: Linux.


21. Bulk File Manager
Replaces NoClone 2010, FalconStor Data Deduplication
This app performs de-duplication at the file level. In addition, it also offers bulk re-naming, bulk moving, file splitting and file joining capabilities. Operating System: Windows.
22. Opendedup
Replaces NoClone 2010, FalconStor Data Deduplication
Opendedup performs inline de-duplication to reduce storage utilization by up to 95 percent. It's available as an appliance for simplified setup and deployment. Operating System: Windows, Linux.

Document Management Systems (DMS)

23. Epiware
Replaces Documentum, Microsoft SharePoint, OpenText
This document manager offers features like search, access history, version history, calendaring, project management and a wiki. Paid support is available. Operating System: Windows, Linux.
24. LogicalDOC
Replaces Documentum, Microsoft SharePoint, OpenText
LogicalDOC boasts an easy and intuitive interface that runs through any browser. You can choose to run the free or paid version on your own server, or opt for the no-hassle cloud version. Operating System: OS Independent.
Replaces Documentum, Microsoft SharePoint, OpenText
OpenDocMan offers fine-grained control, Web-based access and compliance with ISO 17025 and the OIE standard for document management. In addition to the open source community version, it comes in a hosted professional version or an on-premise enterprise version. Operating System: Windows, Linux, OS X.
26. OpenKM
Replaces Documentum, Microsoft SharePoint, OpenText
This highly usable document management solution offers capabilities like sharing content, setting security roles, auditing, and finding enterprise documents and registers. It comes in community, professional, cloud and university versions. Operating System: OS Independent.
27. Xinco DMS
Replaces Documentum, Microsoft SharePoint, OpenText
Short for "eXtensible INformation COre," Xinco offers Web-based management of files, documents, contacts, URLs and more. Features include ACLs, version control and full text search. Operating System: Windows, Linux, OS X.

File Systems

28. Ceph
Replaces Unified storage hardware from Dell, EMC, HP
Ceph is a distributed file system that offers unified object, block and file-level storage. Professional support and services are available through InkTank. Operating System: Linux.
29. Gluster
Replaces hardware from EMC, IPDATA, Netgear
Sponsored by Red Hat, the Gluster file system offers unified file and object storage, as well as storage for Hadoop deployments. It's self-healing and can scale to 72 brontobytes. Operating System: Linux.
30. Lustre
Replaces hardware from EMC, IPDATA, Netgear
Oracle-owned Lustre boasts that it can handle very large and complex storage needs, "scaling to tens of thousands of nodes and petabytes of storage with groundbreaking I/O and metadata throughput." Note that although the name is similar to "Gluster," the two are completely independent projects. Operating System: Linux.

31. ZFS
Replaces hardware from EMC, IPDATA, Netgear
Originally developed by Sun, this file system supports very high storage capacities and offers features like error checking, RAID capabilities and data deduplication. It has been incorporated in many other open source projects, including FreeNAS and NAS4Free. Operating System: Solaris, OpenSolaris, Linux, OS X, FreeBSD.

File Managers

32. Explorer++
Replaces Windows Explorer
Explorer++ extends the capabilities of the standard Windows Explorer with tabbed browsing, an improved interface, keyboard shortcuts, file merge, file split, and customization capabilities. Like the regular Windows Explorer, it also offers drag-and-drop functionality. Operating System: Windows.
33. muCommander
Replaces Windows Explorer, xplorer
Java-based muCommander offers a dual-pane file management interface with a light footprint. It allows users to modify zipped files on the fly, and it supports multiple file transfer protocols. Operating System: Windows, Linux, OS X.
34. Nautilus
Replaces Windows Explorer, xplorer
Nautilus, the file manager for the Gnome desktop, is available for most Linux distributions. The intuitive interface should feel familiar to anyone who's ever used a file manager. Operating System: Linux.
35. PCManFM
Replaces Windows Explorer, xplorer
The standard file manager for the LXDE desktop, PCManFM also supports other Linux desktops. Its features include drag-and-drop support, thumbnails, icon view, tabbed windows and trash can support. Operating System: Linux.
36. QTTabBar
Replaces Windows Explorer
Similar to Explorer++, this very popular open source project extends the functionality of Windows Explorer with tabs and other interface improvements. Support for Windows 8 is planned. Operating System: Windows.
37. SurF
Replaces Windows Explorer, xplorer
SurF brings a fresh approach to file management with a unique, tree-based list of files. Other features include brief highlighting of new and recently changed files, auto-complete for search terms and network support. Operating System: Windows.
38. Thunar
Replaces Windows Explorer, xplorer
Used by the Xfce desktop environment, Thunar boasts a clean interface and very fast performance. It includes a bulk renamer and an extensions framework so that you can add any functionality you like. Operating System: Linux.
39. TuxCommander
Replaces Windows Explorer, xplorer
Like other "Commander" style file managers, TuxCommander offers a two-paned interface. It also features support for large files, a tabbed interface, a customizable mounter bar, associations and more. Operating System: Linux.

File Transfer

40. FileZilla
Replaces CuteFTP, FTP Commander
This project includes both server and client software for transferring files via FTP, FTPS or SFTP. Note that the server software is Windows only, but the client software is multiplatform. Operating System: Windows, Linux, OS X.
41. WinSCP
Replaces CuteFTP, FTP Commander
Tremendously popular, WinSCP has been downloaded more than 64 million times. It's an SFTP, SCP, FTPS and FTP client, and it offers basic file manager capabilities. Operating System: Windows.

Hierarchical Storage Management

42. OHSM
Replaces IBM Tivoli Storage Manager HSM, HPSS, EMC DiskXtender
Short for Online Hierarchical Storage Manager, OHSM automatically moves data between high- and low-cost storage media in accordance with the policies set up by the administrator. It allows policies for allocation (where to put a new file) and relocation (when to move an existing file). Operating System: Linux.


43. FreeNAS
Replaces EMC Isilon products, IPDATA appliances, Netgear ReadyNAS
Based on FreeBSD, FreeNAS allows users to create a network-attached storage (NAS) device that will allow them to share files with other systems on the network, regardless of what operating system those systems use. It includes ZFS and incorporates both file and volume management capabilities. Operating System: FreeBSD.
44. NAS4Free
Replaces EMC Isilon products, IPDATA appliances, Netgear ReadyNAS
A fork of FreeNAS, this project also creates a BSD-based NAS system. Key features include ZFS, Software RAID (0, 1, 5), disk encryption and reporting. Operating System: FreeBSD.
45. Openfiler
Replaces EMC Isilon products, IPDATA appliances, Netgear ReadyNAS
This storage management solution combines some of the characteristics of NAS with some of the characteristics of SAN devices. Use it with any industry-standard server to create your own storage device. Commercial support and plug-ins are available. Operating System: Linux.

46. OpenSMT
Replaces EMC Isilon products, IPDATA appliances, Netgear ReadyNAS
Like Openfiler, OpenSMT also allows users to turn standard system hardware into a dedicated storage device with some NAS features and some SAN features. It uses the ZFS filesystem and includes a convenient Web GUI. Operating System: OpenSolaris.
47. Turnkey Linux File Server
Replaces EMC Isilon products, IPDATA appliances, Netgear ReadyNAS
Turnkey offers a wide variety of Linux-based software that you can use to create your own appliance. The File Server version creates a simple NAS device. Operating System: Linux.

Online Data Storage

48. Cyn.in
Replaces Box, DropBox, ADrive, Amazon Cloud Drive, Google Drive
This open source collaboration suite allows users to share, organize, search and collaboratively work on files. In addition to the open source download, it's also available as a paid enterprise appliance or on an SaaS basis. Operating System: Server requires Linux; client versions are OS independent.
49. FTPbox
Replaces Box, DropBox, ADrive, Amazon Cloud Drive, Google Drive
FTPbox makes it easy to sync your files across multiple devices or share your files with others. It can use SFTP or FTPS protocol for secure file transmission. Operating System: Windows.
50. iFolder
Replaces Box, DropBox, ADrive, Amazon Cloud Drive, Google Drive
Built with syncing, backup and file sharing in mind, iFolder works much like DropBox. Simple save your files locally as usually, and iFolder will update them on your server and the other workstations you use. It was originally founded by Novell and is now managed by Kablink. Operating System: Linux, OS X.
51. OwnCloud
Replaces Box, DropBox, ADrive, Amazon Cloud Drive, Google Drive
As the name suggests, OwnCloud makes it possible to create your own cloud for storing music, photos and all other kinds of files. Supported business and enterprise versions are available. Operating System: Windows, Linux, OS X.
52. SparkleShare
Replaces Box, DropBox, ADrive, Amazon Cloud Drive, Google Drive
Because it was built for developers, this online storage solution includes version control software (Git, to be specific). It automatically syncs all files with the hosts, and it allows you to set up multiple projects with different hosts. Operating System: Windows, Linux, OS X.
53. Syncany
Replaces Box, DropBox, ADrive, Amazon Cloud Drive, Google Drive
Syncany works with commercial online storage solutions like Amazon S3 or Google Storage, adding better synchronization functionality (akin to DropBox) and improved security. It encrypts files locally, making it more feasible to use an online service to store sensitive data. Operating System: Linux (Windows and OS X versions planned.)

RAID Controllers

54. Mdadm
Replaces RAID hardware from vendors like Dell, EMC, HP, IBM
Part of the Linux kernel, Mdadm is software that makes it possible to build your own RAID array with standard hardware. It can also monitor and report on RAID arrays. Operating System: Linux.
55. Raider
Replaces RAID hardware from vendors like Dell, EMC, HP, IBM
Raider allows users to convert any Linx disk system into a RAID array. It supports RAID levels 1, 4, 5, 6 or 10. Operating System: Linux.
56. RaidEye
Replaces RAID hardware from vendors like Dell, EMC, HP, IBM
Just as Raider and Mdadm allow you to turn Linux systems into RAID arrays, RaidEye does the same thing for Macs. It's a monitoring tool that works with the built-in RAID capabilities in OS X. Operating System: OS X.
57. Salamander
Replaces RAID hardware from vendors like Dell, EMC, HP, IBM
Another Linux project, Salamander simplifies the process of turning a multi-disk system into a RAID system. As for the name, the website explains, "Salamanders are the only vertebrates that can regenerate limbs. In the same way, a system installed with Salamander can regenerate after a hard-drive failure." Operating System: Linux.
58. SnapRAID
Replaces RAID hardware from vendors like Dell, EMC, HP, IBM
SnapRAID is a non-standard RAID level for storage arrays. It uses snapshot backup capabilities to provide redundancy that protects against the failure of up to two disks in an array. Operating System: OS X.

Storage Virtualization Mangement

59. Libvirt Storage Management
Replaces DataCore Software, VMware vSphere, SolarWinds Storage Manager, IBM
This open source API provides an array of virtualization management capabilities, including storage management capabilities. It supports multiple hypervisors, including, KVM, Xen, VMware, Hyper-V and others. Operating System: Linux.
60. oVirt
Replaces DataCore Software, VMware vSphere, SolarWinds Storage Manager, IBM
Like Libvirt, oVirt can manage many different types of virtualized environments, including virtualized storage. It supports the KVM hypervisor only. Operating System: Linux.

Thursday, October 11, 2012

Which freaking PaaS should I use?


Most of the buzz around the cloud has centered on infrastructure as a service (IaaS). However, IaaS is no longer good enough. Sure, you can forgo buying servers and run everything virtually on Amazon's EC2 server farm. So what? You still have to manage it, and to do that you'll have a growing IT bureaucracy. Companies that want to focus on writing their code and not have to think about application servers at all are now looking to platform as a service (PaaS).
A PaaS is a virtual instance running an application server with some management and deployment tools in front of it. Management of the infrastructure and the higher-level runtime (application server, LAMP stack, and so on) are taken care of for you, and there's generally a marketplace of other services like databases and logging for you to tap. You just deploy your application and provision instances.
[ Stay on top of the current state of the cloud with InfoWorld's special report, "Cloud computing in 2012." Download it today! | Also check out our "Private Cloud Deep Dive," our "Cloud Security Deep Dive," our "Cloud Storage Deep Dive," and our "Cloud Services Deep Dive." ]
Amazon, Google, Microsoft, Red Hat, Salesforce.com, and VMware all have PaaS offerings. There are also smaller vendors such as CloudBees that are compelling. (We also wanted to try out Oracle Cloud, but when we attempted to create an account, the site reported that the preview was full. We look forward to trying it at a later date.) Each vendor has a set of differentiating characteristics beyond the many technical and cosmetic differences. They might even be targeting different sorts of customers. Which one should you choose?
To find out, we examined seven PaaS solutions based on the top concerns we hear from customers:
  • Key differentiators. What's the special sauce? What can you get from vendor X that you can't get from the others?
  • Lock-in. Once you get on, how easy is it to get off?
  • Security. Are important security standards (PCI, SAE, etc.) supported?
  • Reference customers. Who are they marketing to and who is not a good fit? Are there any "keynote" deployments?
In addition to posing these questions to each vendor, we subjected each PaaS offering to a simple test.
The trouble is that all of the tutorials and getting-started documentation seems to be aimed at people working on green field applications. Most of us spend the majority of our time working on existing applications. The big money is getting the legacy to run in our new cloudy world. We wanted to see how easy or difficult it would be to port a legacy application to the cloud.
Our legacy app, called Granny's Addressbook (aka Granny), is a training exercise we use at my company to teach the Java-based Spring framework. The rationale: Chances are if you're comparing PaaS, you aren't a Microsoft shop. If you're not a Microsoft shop, then statistically speaking you're a John Doe with 2.3 kids and your application is in Java. If your application is in Java, then it is probably written in the Spring framework. This is basic market math, so we compared the process of deploying Granny on each of the PaaS clouds.
If you'd like to try this yourself or examine our code, you can download the Granny's Addressbook WAR file, find the Granny source code on GitHub, and follow our steps to deploying Granny to each PaaS (with screen images).
Before diving into the details, we'll give you an executive summary of our results. CloudBees and VMware's Cloud Foundry proved the easiest for deploying our legacy app. CloudBees shined with a built-in CI (continuous integration) tool, while Cloud Foundry's IDE integration was seamless and wonderful. Heroku was a distant third, but probably would have fared better if Granny had been written in Ruby.

Google App Engine has the best SLA but also a high risk of lock-in. Red Hat's OpenShift was a bit disappointing, but we expect the kinks will get worked out as it exits preview status. It likely would've been more impressive if Granny were a Java EE application instead of a Spring app. Red Hat and VMware had the best answers with regards to lock-in.
Amazon Elastic Beanstalk isn't really a PaaS, but it might be a good compromise if you need IaaS customizability with PaaS-like capabilities. Microsoft's Windows Azure supports the most languages, but it didn't function as a "true" PaaS for our application. Microsoft's tooling for Linux didn't work, and its tooling for Eclipse was underwhelming.
It's still a bit early in the PaaS space, but you can already begin porting legacy apps to some cloud platforms with only minor changes or possibly none at all. Big companies and small companies alike may find a PaaS to be a compelling way to deploy applications and cut capital expenditures. This market isn't as crowded as it might seem, as many of the big players aren't yet out of beta. But in the coming months we can expect that to change.
Amazon Elastic Beanstalk Amazon's AWS Elastic Beanstalk kind of sticks out in this list. It isn't a PaaS so much as a deployment tool for Amazon Web Service's EC2. Think of Elastic Beanstock as a wizard for deploying applications to EC2 VMs. We've included it because you'd ask about it if we didn't!
Differentiators. The biggest differentiator in this is the mothership. Most of the other PaaS vendors (including CloudBees, Heroku, and Red Hat's OpenShift) are piggybacking on Amazon's infrastructure. That means if something goes wrong at the infrastructure level, despite their SLAs, they're talking to Amazon while you talk to them because this is really an IaaS you have ultimate control down to the OS level. On the other hand, where a true PaaS would give you "freedom from the obligation of control," Amazon Elastic Beanstalk still requires you to manage infrastructure-level resources.
Lock-in. Lock-in is up to you. Since this is an IaaS, you can ultimately deploy what you want to.
Security. Amazon publicly lists its security and compliance certifications. It's an extensive list that includes FIPS 140-2, ITAR, ISO 27001, PCI DSS Level 1, FISMA Moderate, and SOC 1/SSAE 16/ISAE 3402. Amazon also provides a good amount of documentation on its security processes.
Who's using it? Amazon also publishes its customer case studies. It's an impressive collection of customers ranging from Amazon (duh) to Netflix to Shazam. It's also very long.
How did it do? It was straightforward to deploy our Granny app. To get Granny working with Amazon RDS (MySQL) required provisioning the database via the Elastic Beanstalk wizard and changing the data source descriptors in our application to match. Unfortunately, our progress was blocked by a connection timeout that other people also seem to have encountered. Supposedly you can fix this by adding IP addresses to a security group. However, debugging this took longer than deploying on other PaaS offerings, so we gave up.
Conclusions. Amazon Elastic Beanstalk is a middle ground between an IaaS and a PaaS. It's one throat to choke, but it isn't the real thing. You're going to do all the things that a PaaS would do for you by yourself. If you're thinking of cloud but you haven't decided to "go all in" and make it to PaaS, this might be a good compromise while you get there technically or psychologically. But if you can, go all PaaS and pick something else.

CloudBees CloudBees was one of the first PaaS offerings aimed mainly at the Java developer. Another successful startup by members of the so-called JBoss mafia, CloudBees is backed by Matrix Partners, Marc Fleury, and Bob Bickel, and led by former JBoss CTO Sacha Labourey. CloudBees supports any JVM-based language or framework.
Differentiators. According to CloudBees, a key differentiator is that this is a PaaS company from the ground up, whereas most of the competitors are software vendors with a cloud play. As a proof point, CloudBees notes that neither Red Hat, Oracle, VMware, nor Microsoft has a production-ready for-pay public PaaS offering despite all four having made such an announcement more than a year ago. The implication is that these competitors know how to build, QA, and monetize software, but not a service.
Against "pure" cloud plays such as Heroku and Google App Engine, CloudBees cites its depth in Java as a key attraction. Indeed, this showed when deploying our legacy application. CloudBees also noted its integration of the CI tool, Jenkins, which allows you to develop "full circle" in the cloud from GitHub to build and deploy.
Lock-in. CloudBees doesn't see lock-in as an inherent issue. The company pointed out that Java PaaS providers tend to be based on open source application servers like Tomcat running on open source JDKs. This means you could take an app running on a pure play PaaS vendor and move it back on-premise very easily.
Security. CloudBees noted that while its PaaS is PCI compliant, your application should also be reviewed. CloudBees provides documentation of its security process and constantly reviews those processes. CloudBees offers additional security information under NDA.
Who's using it? CloudBees notes that in addition to startups and small companies "with no access to sophisticated IT staff and capital expenditures," adoption is being driven within larger companies by specific business units. In many cases, central IT isn't responsive enough to their needs, so the business units start working directly with PaaS providers.
CloudBees lists its reference customers publicly. The company pointed to one in particular, Lose It, which generates up to 25,000 transactions per minute on the CloudBees platform. It seems this company only has four employees: two in software development and two in marketing, with zero in IT. CloudBees pointed out that this is the type of "extreme productivity" possible only in the cloud.
How did it do? To get our Granny application running, CloudBees required a simple deploy from a Web page using a "free" trial account. Getting the app to use a CloudBees-provided instance of MySQL required provisioning an instance and changing the data source descriptor to use CloudBees' JDBC driver and the appropriate JDBC URL. Although the Web GUI doesn't make it clear, CloudBees allows you to automatically override the data sources with its command-line interface in a manner similar to Cloud Foundry's IDE.
Conclusions. Due to its simple deployment process and reasonable pricing and service-level agreements, we think CloudBees is a good choice for deploying Java applications, legacy or not. It's at a disadvantage from a business standpoint in that it doesn't have the relationships with existing customers in the manner of VMware or Red Hat. On the other hand, CloudBees isn't stuck with these companies' management structures and compensation models, either. This should allow it to be more agile in attracting new customers to the cloudy world of PaaS.

Google App Engine Google App Engine (GAE) is Google's PaaS. Initially released all the way back in 2008, it's relatively mature compared to other PaaS offerings. App Engine supports Java, Python, and Google's Go language.
Differentiators. Hey, it's Google. GAE offers the same APIs that Google uses for deploying its own applications. The pricing model allows you to pay only for what you use, and the minimums appear to be cheaper than other vendors. Google's SLA also appears to beat the competition. Moreover, App Engine runs on Google's infrastructure. Most other PaaS offerings are front-ending Amazon.
Lock-in. Google's PaaS seems to be the most proprietary of all. We're talking serious lock-in, as in "download this to CSV and fix your code not to use Google's APIs." Ouch!
Security. Google App Engine is SAS70 (now SSAE16 and ISAE 3402) compliant.
Who's using it? Google sees mobile, Web, and gaming companies as being prime candidates. Google publishes an impressive list of customers that include companies like Pulse, Best Buy, Khan Academy, and Ubisoft.
How did they do? We couldn't get Granny to work on App Engine despite spending nearly five times as many minutes/hours we spent on the others. Google provides Spring examples, but the example apps are more simply structured than our application, which was originally based on the Spring Tool Suite IDE template.
Conclusions. Google's SLA is the best. This alone is why many companies we've worked with have chosen App Engine. Also, App Engine is mature. However, App Engine might not be our first choice for a legacy app, considering the amount of work we might have to do. We'd be even more concerned about lock-in for new apps. We'd want to do a lot more due diligence to prove we weren't stuck. When your stock price is $718 per share, investors are going to look to you to provide that value somewhere. Companies who base their entire infrastructure on you and can never leave would be one way you could do that in the long run.
Heroku In development since 2007, Heroku is one of the original PaaS offerings. It was acquired by Salesforce.com in 2008. Heroku employs Yukihiro "Matz" Matsumoto, the creator of the Ruby programming language. In addition to Ruby, Heroku supports Java, Python, Node.js, Clojure, Grails, Gradle, Scala, and Play.
Differentiators. Heroku's key differentiator is its maturity. It has been publicly available for a number of years, and it enjoys a large marketplace of plug-ins. The company said more than 2.35 million apps are running live on the platform today. It noted that its official support for nine languages, and its many more community contributed languages and frameworks, differentiate Heroku from other PaaS offerings.
Lock-in. Heroku describes its PaaS as a 100 percent open platform that offers a native developer experience for both IDE-centric and command-line centric developers. In response to the lock-in question, the company said that code written to run on Heroku around modern best practices can easily run on any other standards-based platform, in-house or in the cloud.
Indeed the risks of lock-in do not seem more significant than with other PaaS offerings. We were able to deploy the Granny application without significant changes. However, it would be interesting to see how easy or difficult it is to dump data from a PostgreSQL or MySQL instance on Heroku.
Security. Heroku publicly lists its security compliance, noting mainly that it sits on Amazon Web Services infrastructure and Amazon is compliant with ISO 27001, SOC 1/SSAE 16/ISAE 3402, PCI Level 1, FISMA Moderate, and Sarbanes-Oxley (SOX). PCI compliance is provided by offloading credit card processing to a compliant third-party service.
Who's using it? Heroku said that it sees adoption from small startups through the largest enterprise customers in the world. It lists a good number of reference accounts, including social and Facebook apps, digital media sites, corporate marketing sites, city government sites, and more. In addition to those listed on the website, the company pointed to "exciting adoption" by Macy's, which is building Java apps on Heroku.
How did it do? Heroku was easier to work with than OpenShift but harder than CloudBees or Cloud Foundry. The documentation was fairly straightforward. In addition to uploading your WAR file, you have to log into your account and set up your database, then return to Eclipse to complete the process. This swapping between the Web GUI and Eclipse makes Heroku a less attractive option than Cloud Foundry. Heroku lacks the polish of some of the other offerings despite its maturity.
Conclusions. Heroku is a "safe" choice because it's well established, with a growing marketplace of add-on services. It isn't the easiest or hardest to work with. For a Ruby app, it might be our first choice. Our initial test was less positive, but after Heroku released improvements to the Java platform on Sept. 19, deploying Granny proved much more seamless. Heroku wouldn't be our first choice for a legacy application, but it's not bad at all.

Microsoft Windows Azure
Windows Azure is Microsoft's take on Amazon Web Services, encompassing both IaaS and PaaS offerings. In addition to .Net language support, there are SDKs for Java, Python, PHP, and Node.js.
Differentiators. First off, this is Microsoft -- your .Net apps can come here too. Further, Microsoft points out that Azure supports almost any developer language that's popular today, and more are being added. Unlike most competitors, which are AWS underneath, Azure runs on Microsoft's own cloud. Additionally, Azure is available for production today with publicly available pricing.
We don't consider Azure to be a true PaaS because the Azure tools actually deployed our entire Tomcat instance. On one hand, this is one way of answering the lock-in question. On the other hand, the whole idea behind choosing a PaaS is to be freed from having to manage your own application server.
Lock-in. According to Microsoft, making use of a PaaS solution means writing to a set of runtime libraries designed for that specific PaaS. This has excellent effects on scale, agility to write, and performance but requires custom work to move to another PaaS. The company notes that data migration is a simpler proposition because there are many ETL patterns supported by Windows Azure and other PaaS platforms.
In other words, we'd be careful to test elsewhere. Interoperability has never been a strong point for Microsoft, which focuses more on ease of entry. Then again, Microsoft does not have a history of price gouging its customers even when they're locked in. For software-freedom-loving open source guys like us, that's a hard admission, but for the most part we think that it is true.
Security. Microsoft has extensive documentation on Windows Azure's security certifications and procedures. These include ISO 2700, SSAE 16 ISAE 3402, EU Model Clauses, and HIPAA BAA. Frankly, this is how you get big government and corporate contracts, so we would expect no less. Microsoft goes above and beyond the certifications by not only penetration testing its product but offering penetration testing to its customers with seven days advanced notice.
Who's using it? Microsoft claims many thousands of Azure customers, from students to single-developer shops to Fortune 500 companies. It also notes that some legacy apps can pose a problem -- for example, an extremely stateful application with a frozen or nonmaintained code base that would preclude the architectural changes necessary to support the PaaS frameworks and its scalable tiers, availability sets, queuing across instances, and so on.
Microsoft sent us a number of published case studies. At the moment, these are mainly schools and municipalities. They don't appear to be specific to Azure either, let alone its PaaS offering. Additional case studies are available on the Azure site, but we're using Google Chrome on Ubuntu 12.04 and the site requires Silverlight.
How did it do? Microsoft called very shortly after we signed up and offered assistance. This is great customer service and honestly belongs among the differentiators. In this era of retail Internet service, "Can I help you?" can sway a decision.
Deploying Granny to Azure showed plenty of rough edges. Azure's Eclipse plug-in didn't work; in fact, it directed us to an EXE file, which obviously wasn't going to work on Linux. The Linux SDK also did not work. On Windows, deploying the application on Azure was only partially PaaS-like. Instructions for deploying the example "Hello, world" Web application include pointing the setup wizard to the local copy of your favorite application server and JDK. The app server is then merely copied to a Windows Server 2008 VM. After that, you can fairly easily have your application use Azure's SQL Server instance.
Conclusion. If you have legacy apps not based on .Net, then Azure probably won't be your first choice. However, with that hands-on approach to both customer service and security, Microsoft could go a long way.
We honestly expected a bit more from the Linux SDK and the Eclipse plug-in. Despite the talk of interoperability and all of the tweeting from OpenAtMicrosoft, Microsoft didn't shine here. Certainly, Microsoft has the wrong messaging on PaaS lock-in for our taste. That said, if you have a mixed infrastructure of .Net, Java, Ruby, Python, and PHP and can do some tweaking but prefer not to rewrite, Azure may be the best choice.

Red Hat OpenShift Red Hat's PaaS offering, called OpenShift, is aimed at Node.js, Ruby, Python, PHP, Perl, and Java developers. OpenShift combines the full Java EE stack with newer technologies such as Node.js and MongoDB.
Differentiators. OpenShift runs Java applications on the JBoss Enterprise Application Platform (JBoss EAP), Red Hat's commercial distribution of JBoss. Red Hat considers Java Enterprise Edition 6 (Java EE 6) to be a compelling differentiator, along with allowing developers to choose the best tool for the job, whether it's Java EE 6, Ruby, Python, PHP, Node.js, Perl, or even a custom language runtime.
In the coming months, Red Hat will be launching the first commercial, paid, supported tier of the OpenShift service. Red Hat said it will also release an on-premises version for enterprises that can't run in the public cloud due to security, governance, and compliance restrictions.
Lock-in. "No lock-in" was one of the foundational principles used in the design and development of OpenShift, according to Red Hat. The company noted that OpenShift uses no proprietary APIs, languages, data stores, infrastructure, or databases, but is built with pure vanilla open source language runtimes and frameworks. This means, for example, that an application built with Python and MySQL on OpenShift will seamlessly port to Python and MySQL running on a stand-alone server or in another cloud (assuming the language versions are the same). Likewise, a JBoss Java EE 6 application running on OpenShift can be moved to any JBoss server.
Security. Red Hat publicly lists OpenShift's security compliance information. The company said that Red Hat's Security Response Team (the same team that continuously monitors Linux for vulnerabilities) is involved with the design and implementation of OpenShift and that the OpenShift OnLine PaaS service is continuously patched and updated by the OpenShift Operations team at the instruction of the Security team. Red Hat also noted that OpenShift runs SELinux, the security subsystem originally developed by the NSA.
Who's using it? Red Hat said a wide cross-section of companies are using OpenShift today, ranging from hobbyist developers to technology startups building their businesses in the cloud to systems integrators and service providers to Fortune 500 enterprises. The company noted that classic legacy applications that are running on mainframes or other legacy platforms are not great candidates for migration to a PaaS.
Because OpenShift is considered a "developer preview" -- Red Hat's term for beta or alpha -- the company didn't feel comfortable releasing any information about existing deployments.
How did it do? It was a lot more work than we expected to get Granny deployed to OpenShift. Swapping between the command-line deployment tool and the Web-based provisioning and management console lacked the user-friendliness of CloudBees or Cloud Foundry. The Red Hat Developer Studio plug-ins didn't work with our application out of the box. Ultimately, we had to edit a lot more descriptor files both inside and outside of the application than we did with other solutions.
Had we deployed a Java EE-compliant app, I'm sure OpenShift would have been friendlier. But when the command-line tool told me to run a command, then warned me that the command was deprecated, it left a bad taste in my mouth. This is truly a "developer preview" and rough around the edges.
Conclusions. If you're already developing JBoss applications, OpenShift may be a good fit. It's worth a preview now, but if you're looking to deploy to a PaaS today, it's not ready. Red Hat should continue to trumpet the Java EE compliance as a differentiating factor. However, even by 2006 when Andrew worked at JBoss, he noticed that most applications deployed in JBoss were written to the Spring Framework. Supporting Red Hat's existing customer base is all well and good, but greatness and business success will come from seamless deployment of applications developed by people who are not already in the Red Hat camp.

VMware Cloud Foundry
VMware bought SpringSource in 2009. Therefore, it isn't surprising that our "legacy" application, which was already based on the Spring Framework, worked seamlessly on Cloud Foundry. Although Cloud Foundry is still beta, it was very polished and worked well.
Differentiators. A key differentiator is the native support of the Spring framework. According to VMware, Cloud Foundry was built in collaboration with the SpringSource engineering team to ensure a seamless development, deployment, and management experience for Java developers. VMware also noted that Cloud Foundry is "unique in its multicloud approach," allowing developers to deploy the same application, without code or architectural changes, to multiple infrastructures both public and private. In fact this isn't unique, as OpenShift is similar, but VMware is uniquely positioned to do it. Unlike CloudBees, Heroku, and Red Hat, VMware has built its own cloud rather than building on Amazon Web Services.
Lock-in. VMware addressed the question of lock-in to my satisfaction. Because the platform is open source and there's a broad ecosystem of compatible providers (examples include CloudFoundry.com, Micro Cloud Foundry, AppFog, and Tier3), developers can easily move applications between Cloud Foundry instances, both on public clouds or private infrastructures. VMware noted that in addition to the multicloud flexibility, this open source flexibility ensures that developers and customers aren't locked into one cloud or one platform. As proof, the company pointed me to a blog post on extracting data using the Cloud Foundry data tunneling service, which is far and above "You can dump it to CSV and port it yourself."
Security. We were unable to find any published documentation on security certifications (PCI, SAE, and so on) for Cloud Foundry. VMware pointed me to its User Authentication and Authorization service, which appears to be a single sign-on scheme based on OAuth2. This could be a helpful service for application developers, but government organizations and large companies are going to require VMware to provide documentation of security certs before migrating to its cloud.
Who's using it? Cloud Foundry is well positioned to meet the needs of companies that want a combination of public and private PaaS. Its focus on an ecosystem of Cloud Foundry providers is a strong point, especially with regards to lock-in. Cloud Foundry is clearly aimed at Ruby, Node.js, and JVM-based languages. If you have a more diverse technology base, this may not be your first choice.
VMware pointed me to several published case studies, including Intel, Diebold, AppFog, Cloud Fuji, and others.
How did it do? We installed the Eclipse plug-in, deployed the WAR, and changed nothing. In fact, the first time we deployed Granny, we accidentally deployed it configured with CloudBees' JDBC information. Cloud Foundry automatically detected our Spring configuration and reconfigured the database settings for our Cloud Foundry database. This kind of magic may make some people nervous, but it worked seamlessly.
Conclusions. Cloud Foundry "just worked" -- we did nothing to the application but install an Eclipse plug-in. What's not to love? For ops teams, there's also a command-line interface. Once this PaaS launches, depending on pricing and such, it will certainly be a viable choice for Java developers. We can assume that for Ruby, which Cloud Foundry is written in, you would have a similar experience. (We have also tested the Node.js interface, which was a little trickier but still very workable.)
Cloud Foundry worked great and was the most straightforward. We were so successful with the Eclipse plug-in that we didn't try the command-line interface. Of course the test wasn't perfectly "fair" in that the app was a Spring app in the first place, but the app was written to be run on a local Tomcat instance, yet it deployed seamlessly to the Cloud Foundry cloud. Considering much of the legacy that will move to the cloud is Java and most existing Java apps are written in Spring, we're excited to see Cloud Foundry launch.
This article, "Which freaking PaaS should I use?," originally appeared at InfoWorld.com. Follow the latest developments in cloud computing at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.
Read more about cloud computing in InfoWorld's Cloud Computing Channel.