http://www.linuxuser.co.uk/features/create-and-save-data-with-a-mongodb-database
Forget about joins and SQL and try NoSQL databases – specifically MongoDB, the leading example
MongoDB is an open source document- oriented database system written
in C++ by Dwight Merriman and Eliot Horowitz. It runs on UNIX machines
as well as Windows and supports replication and sharding (aka horizontal
partitioning) – the process of separating a single database across a
cluster of machines.
Many programming languages – including C, C++, Erlang, Haskell, Perl, PHP, Python, Ruby and Scala – support MongoDB. It is suitable for many things, including archiving, event logging, storing documents, agile development, real-time statistics and analysis, gaming, and mobile and location services.
This article will show you how to store Apache log files in a MongoDB database with the help of a small Python script. We’ll also demonstrate how to implement replication in MongoDB.
Pymongo
Your Linux distribution probably includes a MongoDB package, so go ahead and install it. Alternatively, you can download a precompiled binary or get the source code from www.mongodb.org and compile it yourself.
After installation, type mongo –version to find out the MongoDB version you are using and mongo to run the MongoDB shell and check if the MongoDB server process is running.
NoSQL databases are designed for the web and do not support joins, complex transactions and other features of the SQL language. You can update a MongoDB database schema without downtime, but you should design your MongoDB database without joins in mind.
Their terminology is a little different from the terminology of relational databases and you should familiarise yourself with it.
Every time you insert a BSON document in MongoDB, MongoDB automatically generates a new field called _id. The _id field acts as the primary key and is always 12 bytes long. To find the creation time of the object with _id
‘51cb590584919759671e4687’, execute the following command from the MongoDB shell:
Now that you know some things about MongoDB, it is time to do something interesting and useful. A log file from Apache will be inserted inside a MongoDB database using a Python script.
The Python script is executed as follows:
The storeDB.py script uses the PyMongo Python module to connect to MongoDB. The MongoDB server is running on localhost and listens to port 27017. For every inserted BSON document, its _id field is printed on screen. Finally, the script prints the total number of documents inserted in the MongoDB database.
The host and its port number are hard-coded inside the script, so change them to match yours.
You first need to connect to MongoDB using:
Type the following in order to connect to the MongoDB shell:
Imagine that you have your precious data on your MongoDB server and there is a power outage. Can you access your data? Is your data safe?
To avoid such difficult questions, you can use replication to keep your data both safe and available. Replication also allows you to do maintenance tasks without downtime and have MongoDB servers in different geographical areas.
For this example, you need three MongoDB server processes running.
We ran the three MongoDB servers, on their respective machines, as follows:
More information about the three MongoDB servers
You should specify the name of the replica set (LUDev) when you start the MongoDB server and have the data directory, specified by the –dbpath parameter, already created. You do not necessarily need three discrete Linux machines. You can use the same machine (IP address) as long as you are using different port numbers and directories.
Once you have your MongoDB server processes up and running, you should run the rs.initiate() command to actually create and enable the replica set.
If everything is okay, you will see similar output on your screen. If the MongoDB server processes are successfully running, most errors come from misspelled IPs or port numbers. The rs.initiate() command is simple but has a huge impact!
Any node can be primary, but only one node can be primary at a given time.
All write operations are executed at the primary node.
Read operations go to primary and optionally to a secondary node.
MongoDB performs automatic failover.
MongoDB performs automatic recovery.
Replication is not a substitute for backup, so you should not forget to take backups.
The former primary will rejoin the set as a secondary if it recovers.
Every node contacts the other nodes every few seconds to make sure that everything is okay.
It is advised to read from the primary node as it is the only one that contains the latest information for sure.
All the machines of a replica set must be equally powerful in order to handle the full load of the MongoDB database.
Apart from primary and secondary nodes, a third type of node exists. It is called arbiter. An arbiter node does not have a copy of the data and cannot become primary. Arbiter nodes are only used for voting in elections for a primary node.
If you shut down the primary MongoDB server (by pressing Ctrl+C), the logs of the remaining two MongoDB servers will show the failure of the 192.168.1.10:27018 MongoDB server:
Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018
Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018
It takes about 30 seconds for the new primary server to come up and the new status can be seen by running the rs.status() command.
Important note: Once a primary node is down, you need more than 50 per cent of the remaining nodes in order to select a new primary server.
If you try to write to a non-master node, MongoDB will not allow you and will generate an error message.
Delete the full apacheLogs collection: db.apacheLogs.drop()
Show available databases: show dbs
Find documents within the apacheLogs collection that have a StatusCode of 404: db.apacheLogs.find({“StatusCode” : “404″})
Connect to the 192.168.1.10 server using port number 27017: mongo 192.168.1.10:27017
It is highly recommended that you first run find() to verify your criteria before actually deleting the data with remove().
Should you need to change the database schema and add another field, MongoDB will not complain and will do it for you without any problems or downtime.
The way to handle very large datasets is through sharding.
Mongo has its own distributed file system called GridFS.
Many programming languages – including C, C++, Erlang, Haskell, Perl, PHP, Python, Ruby and Scala – support MongoDB. It is suitable for many things, including archiving, event logging, storing documents, agile development, real-time statistics and analysis, gaming, and mobile and location services.
This article will show you how to store Apache log files in a MongoDB database with the help of a small Python script. We’ll also demonstrate how to implement replication in MongoDB.
Resources
MongoDBPymongo
Step by step
Step 01
Connecting to MongoDB for the first timeYour Linux distribution probably includes a MongoDB package, so go ahead and install it. Alternatively, you can download a precompiled binary or get the source code from www.mongodb.org and compile it yourself.
After installation, type mongo –version to find out the MongoDB version you are using and mongo to run the MongoDB shell and check if the MongoDB server process is running.
Step 02
MongoDB terminologyNoSQL databases are designed for the web and do not support joins, complex transactions and other features of the SQL language. You can update a MongoDB database schema without downtime, but you should design your MongoDB database without joins in mind.
Their terminology is a little different from the terminology of relational databases and you should familiarise yourself with it.
Step 03
The _id fieldEvery time you insert a BSON document in MongoDB, MongoDB automatically generates a new field called _id. The _id field acts as the primary key and is always 12 bytes long. To find the creation time of the object with _id
‘51cb590584919759671e4687’, execute the following command from the MongoDB shell:
> ObjectId("51cb590584919759671e4687").getTimestamp() ISODate("2013-06-26T21:11:33Z")Note: You should remember that queries are case-sensitive
Step 04
Inserting an Apache log file into MongoDBNow that you know some things about MongoDB, it is time to do something interesting and useful. A log file from Apache will be inserted inside a MongoDB database using a Python script.
The Python script is executed as follows:
$ zcat www6.ex000704.log.gz | python2.7 storeDB.py…where www6.ex000704.log.gz is the name of the compressed (for saving disk space) log file.
Step 05
The storeDB.py Python scriptThe storeDB.py script uses the PyMongo Python module to connect to MongoDB. The MongoDB server is running on localhost and listens to port 27017. For every inserted BSON document, its _id field is printed on screen. Finally, the script prints the total number of documents inserted in the MongoDB database.
The host and its port number are hard-coded inside the script, so change them to match yours.
Step 06
Connecting to MongoDB using PyMongoYou first need to connect to MongoDB using:
connMongo = pymongo.Connection('mongodb:// localhost:27017')You then select the database name you want (LUD) using the following line of code:
db = connMongo.LUDAnd finally you select the name of the collection (apacheLogs) to store the data:
logs = db.apacheLogsAfter finishing your interaction with MongDB you should close the connection as follows:
connMongo.close()
Step 07
Displaying BSON documents from the apacheLogs collectionType the following in order to connect to the MongoDB shell:
$ mongoSelect the desired database as follows:
> use LUDSee the available collections for the LUD database as follows:
> show collections apacheLogs system.indexesLastly, execute the following command to see all the contents of the apacheLogs collection:
> db.apacheLogs.find()If the output is long, type ‘it’ to go to the next screen.
Step 08
A replication exampleImagine that you have your precious data on your MongoDB server and there is a power outage. Can you access your data? Is your data safe?
To avoid such difficult questions, you can use replication to keep your data both safe and available. Replication also allows you to do maintenance tasks without downtime and have MongoDB servers in different geographical areas.
Step 09
Running the three MongoDB servers from the command lineFor this example, you need three MongoDB server processes running.
We ran the three MongoDB servers, on their respective machines, as follows:
$ mongod --port 27018 --bind_ip 192.168.1.10 --dbpath ./mongo10 --rest --replSet LUDev $ mongod --port 27019 --bind_ip 192.168.2.6 --dbpath ./mongo6 --rest --replSet LUDev $ mongod --port 27018 --bind_ip 192.168.2.5 --dbpath ./mongo5 --rest --replSet LUDevNote: You are going to see lots of output on your screen.
Step 10
You should specify the name of the replica set (LUDev) when you start the MongoDB server and have the data directory, specified by the –dbpath parameter, already created. You do not necessarily need three discrete Linux machines. You can use the same machine (IP address) as long as you are using different port numbers and directories.
Step 11
The rs.initiate() commandOnce you have your MongoDB server processes up and running, you should run the rs.initiate() command to actually create and enable the replica set.
If everything is okay, you will see similar output on your screen. If the MongoDB server processes are successfully running, most errors come from misspelled IPs or port numbers. The rs.initiate() command is simple but has a huge impact!
Step 12
Information about replicationAny node can be primary, but only one node can be primary at a given time.
All write operations are executed at the primary node.
Read operations go to primary and optionally to a secondary node.
MongoDB performs automatic failover.
MongoDB performs automatic recovery.
Replication is not a substitute for backup, so you should not forget to take backups.
Step 13
More information about replicationThe former primary will rejoin the set as a secondary if it recovers.
Every node contacts the other nodes every few seconds to make sure that everything is okay.
It is advised to read from the primary node as it is the only one that contains the latest information for sure.
All the machines of a replica set must be equally powerful in order to handle the full load of the MongoDB database.
Step 14
The rs.status() command output The rs.status() command shows you the current status of your replica set. It is the first command to execute to find out what is going on.Apart from primary and secondary nodes, a third type of node exists. It is called arbiter. An arbiter node does not have a copy of the data and cannot become primary. Arbiter nodes are only used for voting in elections for a primary node.
Step 15
Selecting a new primary nodeIf you shut down the primary MongoDB server (by pressing Ctrl+C), the logs of the remaining two MongoDB servers will show the failure of the 192.168.1.10:27018 MongoDB server:
Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018
Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018
It takes about 30 seconds for the new primary server to come up and the new status can be seen by running the rs.status() command.
Important note: Once a primary node is down, you need more than 50 per cent of the remaining nodes in order to select a new primary server.
Step 16
Trying to write data to a non- master nodeIf you try to write to a non-master node, MongoDB will not allow you and will generate an error message.
Step 17
Useful MongoDB commandsDelete the full apacheLogs collection: db.apacheLogs.drop()
Show available databases: show dbs
Find documents within the apacheLogs collection that have a StatusCode of 404: db.apacheLogs.find({“StatusCode” : “404″})
Connect to the 192.168.1.10 server using port number 27017: mongo 192.168.1.10:27017
Step 18
Hints and tipsIt is highly recommended that you first run find() to verify your criteria before actually deleting the data with remove().
Should you need to change the database schema and add another field, MongoDB will not complain and will do it for you without any problems or downtime.
The way to handle very large datasets is through sharding.
Mongo has its own distributed file system called GridFS.
No comments:
Post a Comment