Forget about joins and SQL and try NoSQL databases – specifically MongoDB, the leading example
MongoDB is an open source document- oriented database system written
in C++ by Dwight Merriman and Eliot Horowitz. It runs on UNIX machines
as well as Windows and supports replication and sharding (aka horizontal
partitioning) – the process of separating a single database across a
cluster of machines.
Many programming languages – including C, C++, Erlang, Haskell, Perl,
PHP, Python, Ruby and Scala – support MongoDB. It is suitable for many
things, including archiving, event logging, storing documents, agile
development, real-time statistics and analysis, gaming, and mobile and
location services.
This article will show you how to store Apache log files in a MongoDB
database with the help of a small Python script. We’ll also demonstrate
how to implement replication in MongoDB. The replica set consists of nodes 192.168.2.4 (port 27019), 192.168.1.10 (port 27019) and 192.168.2.3 (port 27018)
Connecting to MongoDB for the first time
Your Linux distribution probably includes a MongoDB package, so go
ahead and install it. Alternatively, you can download a precompiled
binary or get the source code from www.mongodb.org and compile it
yourself.
After installation, type mongo –version to find out the MongoDB
version you are using and mongo to run the MongoDB shell and check if
the MongoDB server process is running.
Step 02
MongoDB terminology
NoSQL databases are designed for the web and do not support joins,
complex transactions and other features of the SQL language. You can
update a MongoDB database schema without downtime, but you should design
your MongoDB database without joins in mind.
Their terminology is a little different from the terminology of
relational databases and you should familiarise yourself with it.
Step 03
The _id field
Every time you insert a BSON document in MongoDB, MongoDB
automatically generates a new field called _id. The _id field acts as
the primary key and is always 12 bytes long. To find the creation time
of the object with _id
‘51cb590584919759671e4687’, execute the following command from the MongoDB shell:
Note: You should remember that queries are case-sensitive
Step 04
Inserting an Apache log file into MongoDB
Now that you know some things about MongoDB, it is time to do
something interesting and useful. A log file from Apache will be
inserted inside a MongoDB database using a Python script.
The Python script is executed as follows:
…where www6.ex000704.log.gz is the name of the compressed (for saving disk space) log file.
Step 05
The storeDB.py Python script
The storeDB.py script uses the PyMongo Python module to connect to
MongoDB. The MongoDB server is running on localhost and listens to port
27017. For every inserted BSON document, its _id field is printed on
screen. Finally, the script prints the total number of documents
inserted in the MongoDB database.
The host and its port number are hard-coded inside the script, so change them to match yours.
Step 06
Connecting to MongoDB using PyMongo
You first need to connect to MongoDB using:
You then select the database name you want (LUD) using the following line of code:
db = connMongo.LUD
And finally you select the name of the collection (apacheLogs) to store the data:
logs = db.apacheLogs
After finishing your interaction with MongDB you should close the connection as follows:
connMongo.close()
Step 07
Displaying BSON documents from the apacheLogs collection
Type the following in order to connect to the MongoDB shell:
$ mongo
Select the desired database as follows:
> use LUD
See the available collections for the LUD database as follows:
> show collections
apacheLogs
system.indexes
Lastly, execute the following command to see all the contents of the apacheLogs collection:
> db.apacheLogs.find()
If the output is long, type ‘it’ to go to the next screen.
Step 08
A replication example
Imagine that you have your precious data on your MongoDB server and
there is a power outage. Can you access your data? Is your data safe?
To avoid such difficult questions, you can use replication to keep
your data both safe and available. Replication also allows you to do
maintenance tasks without downtime and have MongoDB servers in different
geographical areas.
Step 09
Running the three MongoDB servers from the command line
For this example, you need three MongoDB server processes running.
We ran the three MongoDB servers, on their respective machines, as follows:
Note: You are going to see lots of output on your screen.
Step 10
More information about the three MongoDB servers
You should specify the name of the replica set (LUDev) when you start
the MongoDB server and have the data directory, specified by the
–dbpath parameter, already created. You do not necessarily need three
discrete Linux machines. You can use the same machine (IP address) as
long as you are using different port numbers and directories.
Step 11
The rs.initiate() command
Once you have your MongoDB server processes up and running, you
should run the rs.initiate() command to actually create and enable the
replica set.
If everything is okay, you will see similar output on your screen. If
the MongoDB server processes are successfully running, most errors come
from misspelled IPs or port numbers. The rs.initiate() command is
simple but has a huge impact!
Step 12
Information about replication
Any node can be primary, but only one node can be primary at a given time.
All write operations are executed at the primary node.
Read operations go to primary and optionally to a secondary node.
MongoDB performs automatic failover.
MongoDB performs automatic recovery.
Replication is not a substitute for backup, so you should not forget to take backups.
Step 13
More information about replication
The former primary will rejoin the set as a secondary if it recovers.
Every node contacts the other nodes every few seconds to make sure that everything is okay.
It is advised to read from the primary node as it is the only one that contains the latest information for sure.
All the machines of a replica set must be equally powerful in order to handle the full load of the MongoDB database.
Step 14
The rs.status() command output The rs.status() command shows you the
current status of your replica set. It is the first command to execute
to find out what is going on.
Apart from primary and secondary nodes, a third type of node exists.
It is called arbiter. An arbiter node does not have a copy of the data
and cannot become primary. Arbiter nodes are only used for voting in
elections for a primary node.
Step 15
Selecting a new primary node
If you shut down the primary MongoDB server (by pressing Ctrl+C), the
logs of the remaining two MongoDB servers will show the failure of the
192.168.1.10:27018 MongoDB server:
Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018
Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018
It takes about 30 seconds for the new primary server to come up and
the new status can be seen by running the rs.status() command.
Important note: Once a primary node is down, you need more than 50
per cent of the remaining nodes in order to select a new primary server.
Step 16
Trying to write data to a non- master node
If you try to write to a non-master node, MongoDB will not allow you and will generate an error message.
Step 17
Useful MongoDB commands
Delete the full apacheLogs collection: db.apacheLogs.drop()
Show available databases: show dbs
Find documents within the apacheLogs collection that have a StatusCode of 404: db.apacheLogs.find({“StatusCode” : “404″})
Connect to the 192.168.1.10 server using port number 27017: mongo 192.168.1.10:27017
Step 18
Hints and tips
It is highly recommended that you first run find() to verify your criteria before actually deleting the data with remove().
Should you need to change the database schema and add another field,
MongoDB will not complain and will do it for you without any problems or
downtime.
The way to handle very large datasets is through sharding.
Mongo has its own distributed file system called GridFS.
No comments:
Post a Comment