Sharing Knlowledge With The World…

Category: No-SQL

No-SQL — A Basic Overview

NO-SQL Systems
What are they?
The first question that came into my mind when I initially started working with these systems.
I heard all kinds of technologies Hadoop,MongoDB,Couch,Hive,Pig,Cassandra , all kinds of technical jargons kicked in ,Map-Reduce,JSON etc etc…so here i try n give our readers a basic overview of what No-SQL Systems are :With the boom of data in this modern-day , not all large scale data management and analysis is BEST solved using RDBMS,though we cant say that No-SQL systems are here to replace Traditional databases in any way.Solutions to the use of a type of database depends largely on the application or the problem faced.No-SQL systems are evolving at a fast pace ,and its a treat to see the developments.As of now there are no declarative query language in NoSQL statements hence more programming is needed.
Lets start with discussing the features of a NoSQL system:

  • No defined schema: unlike the traditional DB ,there are no defined tables with columns,in NoSQL systems the schema is flexible.
  • Cheap: Most NoSQL systems are open source , and very easy to setup.
  • Scalability: These systems are highly scalable.
  • Avalibility:Highly reliable in terms of avalibilty.
  • Performance : high performance.

There are various NoSQL databases available as of no, I would broadly categorize them into four categories:

  • Mapreduce framework
  • Key-Value stores.
  • Document stores.
  • Graph database systems.

Map-Reduce framework
This framework is mostly implemented in an OLAP(Online Analytical Processing) system where the complex analysis covers a large section of the data.Was originally invented by Google ,and now there is an open source Hadoop that implements MapReduce framework.
There is no data model,data is stored in files both as input and file.In Hadoop ,the implemented file system is called HDFS(Hadoop file distriduted system)
User provides a set of specific functions to process data using HDFS:
reader() – this function is used to read records from a file
writer() – this funtion is used to write records to a file.

Once the user provides the above functions the system provides data processing and scalability

Key-Value stores
This framework is more specific to OLTP(Online Transaction processing) systems,the Key -Value stores allows users to store schema less data in form of a key-value relationship in contrast to tradition SQL which had to scan through hierarchies of tables or structured schema to get complex data sets.

Document stores

Graph database systems


Continue Reading

Journaling in MongoDB

Ankit Kansal & Nayan Naik
JOURNALING is the concept in MongoDB which provides you with the feature of write ahead logging(WAL) mechanism. It helps in achieving write operations durability and crash recovery to your system. 
Write ahead logging technique in mongoDB provides durability against miss-happenings like power failures. WAL enabled system writes data to a log file before they writes data to actual data files for the database. If a system using WAL got some failures then after restarting the system cross verify it against the log and after applying some operations comes into consistent state.
If a mongo server i.e. mongod instance is running without journaling enabled then there may be much possibility that your system is not in a consistent state. In that scenario you must perform repair mechanism or resync your system if replication(making duplicacy of  data) is enabled.
 In 64 bit systems journaling is by default enabled while in 32 bit system you explicitly have to enable it using –journal command.
By default in 64 bit, system creates a folder named journal in the specified –db folder as a default however you can change the destination of your journaling file logs. By default journaling file size is of 1GB and when the max size is reached a new file is created. you can change this limit explicitly by changing some parameters.

When you starts your mongoDB instance the operating system maps your data files that is db files presented onto disk to the shared view(it is a view which is present between your mongod instance and db files). Basically it maps the memory address of the db file so from there you have access to your data presented in the data file.
when the OS maps the address and you performs any CRUD operations then operating system simply flush all the changes to your DB files with the help of Shared view. This is what happens when you do not have journaling enabled on your system. In this time if any of the system crash down than all the data which is not yet flushed to the DB files will not be recovered and thus, your DB will be in inconsistent state.

When you have journaling enabled then Private view comes into picture Private view is the view which do not have direct access to your DB files. So when you perform some operations then mongod write these changes to your private view and then private view maps the data onto journal files. As more and more data comes in Journal file is appended till it’s max limit is reached. Once the journal file is having data so from now if system crashes then your data will be easily recovered. Now the jounal files maps will be mapped to the shared view for flushing to the db files but before flushing the data onto the db files the shared view is again mapped with shared view. And finally the data is flushed to the DB files.

thanks a lot for reading…..

Continue Reading

functions in mongoDB

Ankit Kansal & Nayan Naik 
MongoDB provides you with the functionality of creating functions for session specific as well as at database level.
SESSION SPECIFIC:-  Session specific functions are those functions which you create at that specific time only and they can only be accessed till you are logged in to that client. Once you disconnect from the session the function existence will be removed.
Lets start with a basic example
  1. write a key word function.
  2. write the name of the function which you want to create.
  3. define the number of parameters.
  4. write the logic for which function is to be created.
  5. use return statement if you want to return any data member from your function.

*compiler itself will not check whether you are passing correct data type or not if you pass one number and one character then it will concatenate them and return.

The functions described are session specific so once you close down your session and then want to use again then you don’t them available for you.
To use the function again and again you have to save the function inside the DB itself in the name collection system.js

*To view the definition of a function you just have to write the name of the function and the shell itself return you the whole definition of the function.

MongoDB also enables you to save the function which remains at db level in the name collection db.system.js
                   { _id: “nameOfFunction”,
                     value : function(args..) {Body of the function }

This command will help you to save a function at the DB level in the system.js collection. By default this function acquire a WRITE AND READ lock on your mongod instance hence you won’t be able to read and write on the DB for that particular time.


To call a function the are few ways one way is to use 
db.eval() method as shown in figure below but there are some restrictions in this method as i already told you that when you call a function which resides at DB level it acquires a full mongod lock but there is a way so that you can avoid it but using db.eval() method you can not achieve this functionality.

**If you wish to execute a function that does not block a mongod instance then you have to use eval method with the db.runCommand() method.


From the latest version you can directly call the DB methods just you have to call the >db.loadServerScripts() 
method and then by simply typing the method name you can run all the system.js collection methods.
>printValue(args) –and method will execute.

Thanks for reading…

Continue Reading

Adding Mongo Components to Talend Open Studio

Ankit Kansal & Nayan Naik


Today had an opportunity to implement data-integration using Talend where our source and target was MongoDB,Talend has been playing a major role in implementing Analytics over Big Database Systems and the functionalities they have provided are a treat!!!! Strangely Talend does not have its own MongoDB component by default though it can be downloaded from the Talend Exchange (Thanks to the component developed by Adrien MOGENET , you made our job real easy today !!).
Components provided for MongoDB:

We are going to elaborate the steps we followed to add these components to your TOS and use them.

1)Download the components from and search for Mongo.

2)The components will be zipped format , Unzip the same.

3)Copy these Unzipped Components to the installation Path of TOS.

4)Copy the mongo-1.3.jar file in the component folder into the C:TalendTOS_DI-Win32-r84309-V5.1.1libjava
In many systems you might not be able to see this file then go with ADMINISTRATOR priviliges.

5)–optional for few systems——>>> Inside index.xml add
<jarsToRelativePath key=”mongo-1.3.jar” value=”org.talend.designer.components.localprovider_5.1.1.r84309/components/tMongoDBConnection/mongo-1.3.jar”/>
<jarsToRelativePath key=”mongo-1.3.jar” value=”org.talend.designer.components.localprovider_5.1.1.r84309/components/tMongoDBInput/mongo-1.3.jar”/>
<jarsToRelativePath key=”mongo-1.3.jar” value=”org.talend.designer.components.localprovider_5.1.1.r84309/components/tMongoDBOutput/mongo-1.3.jar”/>

save index.xml

6) Restart TOS

Drag and Drop the components from  the palette under no-sql  tag
 Describe the schema for input 
Describe the schema for Output
By this way you can transfer the data to and from the MongoDB.
* But keep in mind these components does not support Nested MongoDB Documents at any chance….
Thanks for Reading…..

Talend Interview Questions

Author : Ankit Kansal & Nayan Naik

Continue Reading

Hive Query Language

                                                                                             -by Ankit Kansal and Nayan Naik
DDL Operations

1)Creating Hive tables and browsing through them

    hive> CREATE TABLE pokes (foo INT, bar STRING);    the command creates a table called pokes with two columns, the first being an integer and the other a string
    hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
    this command creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).

    hive> SHOW TABLES; lists all the tables

    hive> DESCRIBE invites; shows the list of columns
As for altering tables, table names can be changed and additional columns can be dropped:

    hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
    hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT ‘a comment’);
    hive> ALTER TABLE events RENAME TO 3koobecaf;

DML Operation

1)Loading data from flat files into Hive:
    hive> LOAD DATA LOCAL INPATH ‘./examples/files/kv1.txt’ OVERWRITE INTO TABLE pokes;

Loads a file that contains two columns separated by ctrl-a into pokes table.’local’ signifies that the input file is on the local file system. If ‘local’ is omitted then it looks for the file in HDFS.

    -NO verification of data against the schema is performed by the load command.
    -If the file is in hdfs, it is moved into the Hive-controlled file system namespace.

     The root of the Hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.

hive> LOAD DATA LOCAL INPATH ‘./examples/files/kv2.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-15′);
hive> LOAD DATA LOCAL INPATH ‘./examples/files/kv3.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-08′);

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.
hive> LOAD DATA INPATH ‘/user/myname/kv2.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-15′);The above command will load data from an HDFS file/directory to the table.
Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

JDBC Connectivity

To connect hive using jdbc clients, we need to run hive servers and connect to them from your client at


Note that you will need to run a hiveserver per client since accessing the hive server from multiple clients may land you up into concurrency issues. Although hiveserver is multithreaded, it is not recommended to connect to it through multiple clients. Hive servers connect to metastore, so you will also need to have the metastore service running.
You may refer this link on how to write a JDBC Client –
Are you connecting to hive using the hive CLI? In that case, you may run a single or multiple metastores and all of the hive clients can connect to the metastore and run queries happily.


If you have a mysql metastore configured,all you need to do is to set the parameter javax.jdo.option.ConnectionURL in in your hive-site.xml to the jdbc url of the existing mysql server and metastore database and you should be able to connect to the existing store.
Make sure that in your hive-site.xml, the parameter hive.metastore.local Is set to false. javax.jdo.option.ConnectionUserName and javax.jdo.option.ConnectionPassword is set to username and password on the mysql server you are connecting to and all privileges on the database you are using as your metastore is granted to this user. Last but not the least, make sure you change hive.metastore.uris to point to the mysql host.

You don’t need to run sql script for mysql. Follow these simple steps to setup your mysql metastore

Create file ‘hive-site.xml under hive conf directory.Set the following configuration parameters in that file
    <description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
    <description>JDBC connect string for a JDBC metastore</description>

    <description>Driver class name for a JDBC metastore</description>

    <description>username to use against metastore database</description>

    <value><fill in with password></value>
    <description>password to use against metastore database</description>


If you want to a specific port for thrift uris, you will need to create a file called “” in the conf folder and add this to that file (assuming you want to expose thrift uris on port 9090)
export METASTORE_PORT=9090
Once you have these configurations set up, all you need to do is start the hive service using command.

hive –service metastore

This service runs in the foreground. You may use nohup to run metastore in the background. You should be good to go.
Continue Reading

Hive Installation over Hadoop Clusters

                                                                                        -by Ankit Kansal and Nayan Naik                                                                                    
Recently we had the opportunity to install Hive over Hadoop clusters, so thought of sharing the steps we followed with our readers,
Before we even begin with the installation some pre-requisites,
1)Hive requires Java 1.5 or above however 1.6 is recommended , You could first check if Java is already present and if its already present then which version using following command on your cmd:
    java -version
2)Hive MUST be able to find hadoop , ie the $HADOOP_HOME must be set.
    $HADOOP_HOME=<hadoop-install-dir> ,which in our case was  /usr/local/hadoop
3)Always create seperate directory for hive installation. we made it in the /usr/local directory using the following command.
    mkdir hive
Now coming to the installation:
1) Download the latest stable version Hive from and place it in a directory(/usr/local).The file is in a GZIP format.
2)Unzip the above file and place hive software in /usr/local/hive directory we had created, by executing following commands.
    cd /usr/local

    tar -xzvf hive-x.y.z.tar.gz

    mv hive-0.8.1 hive

3)As mentioned above ,Hive runs on top of Hadoop. It is sufficient to install Hive only on the Hadoop master node.Set the environment variable HIVE_HOME to point to the installation directory:
    $ cd hive-x.y.z

    $ export HIVE_HOME={{pwd}}

4)Add the $HIVE_HOME/bin to you PATH variable.
    $ export PATH=$HIVE_HOME/bin:$PATH
5)Now in your HADOOP_HOME/bin you must create a /tmp and /user/hive/warehouse and give them appropriate priveleges usng the chmod in HDFS before a table can be created in Hive
     $ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp

     $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse

     $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp

     $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

6)Also set the HIVE_HOME to the installation path
      $ export HIVE_HOME=<hive-install-dir>
7)There we are finally there , now to use the  hive command line interface (cli) from the shell:
    $ $HIVE_HOME/bin/hive
Some basic Configuration Information for HIVE:
1)Hive default configuration is stored in <install-dir>/conf/hive-default.xml
2)Configuration variables can be changed by (re) defining them in <install-dir>/conf/hive-site.xml
3)The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.
4)Log4j configuration is stored in <install-dir>/conf/
5)Hive configuration is an overlay on top of hadoop – meaning the hadoop configuration variables are inherited by default.
6)Hive configuration can be manipulated by:
    Editing hive-site.xml and defining any desired variables (including hadoop variables) in it
    From the cli using the set command (see below)
    By invoking hive using the syntax:
    $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2 ,this sets the variables x1 and x2 to y1 and y2 respectively.
    By setting the HIVE_OPTS environment variable to “-hiveconf x1=y1 -hiveconf x2=y2” which does the same as above.
Thanks for reading the post !!!!, In our next post we will discuss some basic HIVE commands.
Continue Reading

Loops in MongoDB


Ankit Kansal & Nayan Naik
MongoDB itself has provided some basic looping mechanism through which you can achieve iteration functionality.
let’s discuss one by one.
FOR LOOP:- For loop is a basic loop available in mongodb and its syntax is just same as we have in other technologies.
For(var iTemp=0;iTemp<100;i++){
–list of statements;
Within the curly braces{..} you can write your own statements and each of them will be executed till the loop condition satisfies.
lets see one examples using FOR LOOP
For(var iTemp=0;iTemp<100;iTemp++){
db.things.insert({idies:iTemp}); –to insert a document in a collection things.
db.things.remove({idies:iTemp}); –to remove a document from a collection things.
print (iTemp); –to print a local variable value iTemp.

MongoDB also provides you with the functionality of checking a condition and then performs some operation.
here is one example using FOR LOOP INTEGRATED WITH IF CONDITION 
For(var iTemp=0;iTemp<100;iTemp++){
if(iTemp%2==0) –checking whether a condition is true or not
db.things.remove({idies:iTemp}); –to remove a document from a collection things.
print (iTemp); –to print a local variable value iTemp.
*This loop will removes all the documents which have even idies.
When you fetch a result and store it in a variable then you have certain attributes by which you can iterate over the retrieved data and perform several actions.
  1. cursor.hasNext()–> It returns Boolean either the true or false i.e. whether after the current document  any other document is present or not.
  2.–> It will return you the present value and move the cursor pointer to the next available document.

Lets see a basic example…

    var curTemp = db.things.find(); –declaring a variable curTemp and initializing it with query result.
    var temp;
    while(curTemp.hasNext()){; –returning a json document and assigning it to temp variable
    printjson (temp);  — it will print json document one by one
    print (temp.idies); –it will print the values of idies field as accessed by . operator.
    *Once the cursor is completely traversed then it will not return any value because the other attribute curTemp.hasNext() return false.
    error:-uncaught exception: error hasNext: false
    USING FOREACH:- Foreach provides you better functionality, using foreach you do not have to use cursor attributes such as curosr.hasNext() and It implicitly performs those operations it automatically iterates on the data/documents available.


    db.collection_name.find().forEach( function(obj)


    let’s go through an example:-The problem defines that i want to insert a column that will contain employee name,department_no and empno in a concatenated format.


    db.employee.find().forEach( function(obj)
    empno_str = (obj.empno).toString()+”-“;
    deptno_str = (obj.deptno).toString();
    name_str =”-“;
    final = empno_str+name_str+deptno_str;
    obj.new_field_name = final;
    print (final);;

    *using obj. you can access any field of the selected document and in this operator save function is updating the selected field as _id field is already present.more info UPDATE USING SAVE
    Continue Reading