Sharing Knlowledge With The World…

Month: January 2013

Intro to MongoDB


Ankit Kansal & Nayan Naik

What is NO-SQL ?

NO-SQL is commonly interpreted as (NOT ONLY SQL).  NoSQL databases are not built primarily on tables, and generally do not use traditional SQL(Structure Query Language) for data retrieval and manipulation.
Usually in no-sql databases data will be stored in form of (key-value) pairs. These databases are generally used when business requires high amount of data irrespective of transaction management.

There are lot of NO-SQL databases available in the market such as:-
5)COUCHDB, e.t.c
But, in our article we will be telling you about the best practices of MONGODB.


                MongoDB is a open source no-sql database and is written in C++. It stores data inside JSON-like documents (using BSON — a binary version of JSON).

Some of the readers who are new to this will  search for what JSON exactly is …so to make it easy for them.
JSON(JavaScript Object Notation) :- It is a  lightweight data-interchange format. It is easy to understand. It is easy for parsing and to be generated.
                                                                                                                              Fig 1.1   
Figure shows a simple json document, where the information is stored in key-value pairs.

   empno:1    –for numerical data types codes(“”) must be avoided.

Terms used in MongoDB corresponding to RDBMS:-

                                                                                                     Fig 1.2
As, clearly mentioned in the above  figure these are the different terminologies used in MongoDb corresponding to RDBMS.


1) SCHEMA LESS STRUCTURE:- MongoDB provides schema independent approach while you are designing your collection.
  • documents in the same collection do not need to have the same set of fields or structure, and
  • common fields in a collection’s documents may hold different types of data
In Fig 1.3 there are two documents within the same collection let say emp.
In first document there are three columns _id,empno, ename where as in second document there are only two columns _id and empno this proves first within same collection it’s not necessary to have same set of fields for each document.
Again, In first document empno stores integer type data where as in second document empno is holding a string data type.
This feature provides flexibility to the application developer to design  application more easily.
2) FULL INDEX SUPPORT:-MongoDB also provides Index support for its collections as we have indexes in other databases for tables. The role and use criteria for  indexes are identical in both the databases.
Def:-An index is a data structure that allows you to quickly locate documents based on the values stored in certain specified fields/keys. 
1)Single Key Index
2)Composite/Compound key Index
3)Sparse Index
4)Unique Index

Some Special Indexes are:-
1)TTL Index
2)Geospatial Index
3)Geohaystack Index

*complete description about indexes will be available in later posts


Replication means duplicity, in mongoDB Replication plays a very important role for the production environment. Replication ensures mongoDB in achieving redundancy, backup, and automatic fail over.To achieve replication for mongoDB environment Replica set is used Replica set is nothing but a group of servers that creates a infrastructure for replication. 
Replication is based upon master and slave relationship that is a master is responsible to collect a data from client application and then slave request the master to make the data available for  replication/copy. fail over recovery is an automated process such that when master/primary node fails down the secondary automatically becomes primary without user intervention and when the failed node becomes available then it must worked as secondary till again something happen.
*implementation will be covered in future posts.

Sharding is nothing but partitioning of your collection.It majorly helps in making your application run faster while you are reading from your database. Sharding requires a Shard key to partition your collection and shard key must be chosen wisely so that your environment should work effectively because a heavy resource utilization will be used. In most of scenarios it’s preferable to choose your shard key having high cardinality.
In sharding environment a monos(router) is used. It’s a router which is responsible to route the request to the appropriate server to serve the request. It’s an automated process it automatically balances data and load across machines. Sharding provides additional write capacity by distributing the write load over a number of mongod (server)instances. 
                                         DISPLAYING SHARDING ARCHITECTURE
                                                                                                                       Fig 1.4
configsvr:-The config servers store the cluster’s metadata: shard server information and the data chunks it stored.
*implementation will be covered in future posts.



 Right now we have just covered the basic features and terminologies relating with mongoDB, in our future posts we are going to have some advanced topic and with implementation strategies. 


Continue Reading

Data Profiling with Talend

Recently had written a technical paper for our Organizationm which showcased the Talend Data Profiler capabilities.This technical paper also gives a brief overview of Talend data profiling on Big databases like Hadoop.


                                                                                         Ankit Kansal & Nayan Naik
      As a consequence of expansion of modern technology, the volume and variety of data are increasing dramatically. Therefore, the reputation blemish and losses caused are primary motivations for technologies and methodologies for Data Quality that have been applied successfully in many implementations. The first step for the Data Quality is Data Profiling. It is the process of collecting the statistics of the data to understand the quality and grade of the data before doing any operations on it. The statistics involve the metrics of data quality, conformance to the standards and business logics, data risks, duplications and many others.
     The aim of this contribution is to show the role of data profiling for a best implementation and capabilities of Talend in Data profiling. The problem is very difficult because the present data sets are in many different forms and are important. Business executives understand that making better and faster business decisions plays most important role in the business development and customer satisfaction and definitely the per-requisite for these kinds of decisions is the effective and efficient data. Bulks of data from several sources are successfully combined in a data warehouse and used to do many business analyses. This kind of accurate analysis can be done only with varied data but not with a single file.
     For example if duplicates are present within a file or across a set of files, then the duplicates might be identified. Record linkage uses name, address and other information such as income ranges, type of industry, and category to determine whether two or more records should be associated with the same entity. The uses of the data are affected by lack of quality due to duplication of records and missing or erroneous values of variables. Duplication can waste money and yield error. If a financial institution or a bank has a customer incorrectly represented in two different accounts, then the bank might repeatedly credit/debit the customer. This kind of analysis should be done at the earlier stage of the implementations and the proper data should be used for reporting for better benefits/profits.
There are various flavors of the data profiling tools in the market. One of the most effective profiling tools is Talend Data Quality. The Talend Data Profiler is part of the Talend Data Quality Suite.


Talend provides a wide range of Analysis and Statistics for examining the data available in an existing data source. The purpose of these features is basically to provide information about the data quality of the source based on a set of metrics. Apart from the basic statistics, Talend also provides additional metadata information obtained during data profiling. Talend provides different structural level Analysis, starting from column-level analysis to multi column analysis to Table Analysis to Schema level Analysis to Database Structure Analysis and offers excellent performance throughout all structures. Talend provides a library metadata, where-in regex patterns (for multiple databases as well as Java), SQL’s,Business Rules,System Indicators and User-defined Indicators. Talend data Profiler can connect to a various databases along with support for Hive database. Profiling can also be done on flat files using Talend.MDM connections can also be made.Apart from the basic Statistics provided by Talend for an analysis, talend provided flexibility to DEFINE used-defined-statistics.

1) Connection Analysis

2) Catalog Analysis
3) Schema Analysis
4) Table Analysis
5) Column Analysis
6) Redundancy Analysis
7) Column Co-relation Analysis
Connection Analysis:
Connection analysis helps us perform profiling on the complete database connected. It gives an overview of the content of your database.
Data Structure Analysis:
Overview of the database , the number of tables in the database , the number of views , no of rows per table/views , the indexes present and the number of key’s present.
Catalog Analysis:
This analysis is specific to the databases which defines catalogs. Analysis will give an overview of the catalog. The analysis computes the number of tables, views and the number of rows per table, views along with the number of keys and indexes for each catalog.
Schema Analysis:
This analysis is specific to the databases where schemas are defined for a database. The analysis computes the number of tables, views and the number of rows per table, views along with the number of keys and indexes for each schema.
Business Rule Analysis:
This analysis defines a business rule over which the table analysis is based. If age should be greater than 18, based on the business rule we come up with an Analysis. Talend provides flexibility in adding multiple rules onto a table. We can add rules in the Library section.
Functional Dependency:
Determines up to what extent the value of one column in a table determines the other column. If the “Zip code” column will always depend on the “City” column.
Column Set Analysis:
This Analysis provides profiling on a set of columns rather than a single column. Patterns as well as Simple indicators can be added to this type of Analysis. This analysis will retrieve as many rows as the number of distinct rows in order to compute the statistics. Hence, it is advised to avoid selecting primary keys in this kind of analysis.
Column Analysis:
This Analysis helps you to profile the data on basis of a single column.Apart for Patterns and SQL’s,Simple Indicators like No of Nulls,Unique Count,Row count etc can be a part of the analysis.User defined Indicators can also be added.
Column Set analysis:
This Analysis is similar to the one defined at the Table Analysis Level.Here we have a data mining option as well which lets Talend choose the appropriate metrics for the associated columns since not all indicators can be computed on all metrics.
Column Set analysis lets you add patterns and the set of Indicators are limited to Row Count,Distinct Count,Duplicate Count,Unique Count.
Redundancy Analysis: Column Content Analysis
This analysis compares data of two columns to check how many values of column 1 are present in column2. This is basically used to verify Foreign Key/Primary Key relationship.
Column Correlation Analysis: Numerical Co-relation Analysis
Shows co-relation between nominal and numerical values. This type of analysis returns a bubble chart, the average of the numerical data is computed and represented on a vertical axis. This helps us easily identify the data quality issues by looking at the extreme values in the graph.
Time Co-relation Analysis:
Shows co-relation between nominal and time data in a gnatt chart.
Nominal Co-relation Analysis:
This kind of analysis shows correlations between nominal and numerical data in a bubble chart. For each nominal value, the average of the numerical data is computed and represented on the vertical axis.
The statistics visualizations in very rich and more analyzable formats. Below are some of the Profiling reports generated using Talend Data Quality.



Data Profiling with Talend also allows users to analyze their data in their Hive database on Hadoop. It offers an “IN PLACE” data profiling which means data does not need to go through the time-consuming process of being extracted from Hadoop before being profiled.
Apart from the basic profiling solutions offered by Talend which helps solving the data redundancy, data inconsistency issues which are persistent in Big Data, Talend also offers a variety of domain specific tests like E-mail validation ,Postal codes,Hex-Color codes, VAT-Number codes, Date-formats, Phone-number formats and many more.
The main advantage of profiling on BigData is, you do not need to do things differently for big data. Its as simple as doing an Analysis on a RDBMS or on a Flat File. Talend provides solutions to perform Profiling on Apache Hadoop,Hive database without having specific expertise in these areas and helps understanding the structure of the Hadoop clusters. Talend Also provides the flexibility to come up with User defined Indicators which can be used in a data profiling Analysis. This UDI is Hive Query Language specific.It has been noted that there is no significant changes in performance while performing a data profiling Analysis on a huge volume of data in a database like Hive. The ease of connectivity with the hive database is another feature which talend offers, All hive table information along with the column data information is part of the metadata in Talend once the connection is established.
Talend offers a variety of Analysis to get detailed understanding on the level of quality in the organization’s data. Talend uses the Hadoop clusters to its maximum advantage, allowing user to add multiple servers and increase performance.
1) Easy connectivity to Hive database
2) A set of System Defined Indicators and User defined Indicators (Statistics) that can be added to the Hive database analysis to get a graphical output for the Analysis.
3) Set of Rules.
4)A source file folder where in we can store Scripts.
5)Talend also provides the capability to set Quality threshholds to the data in order to define a Data Quality parameter.

Comparisions Between Talend and Pentaho
          Capabilities/Analysis                            Talend       Pentaho

  • Connection Analysis                               Yes            Yes
  • Data Structure Analysis                          Yes            No
  • Catalog Analysis                                    Yes            No
  • Schema Analysis                                    Yes            No
  • Column Analysis                                    Yes            Yes
  • Time Co-relation Analysis                      Yes            No
  • Nominal Co-relation Analysis                 Yes           No
  • Column Set Analysis                              Yes           No
  • String,Boolean,Number Analyzer           Yes           Yes
  • Date Gap Analyzer                                No           Yes
  • Date Time Analyzer                               No           Yes
  • value Distribution                                  Yes          Yes
  • Weekdays Analyzer                              No           Yes
  • Reference Data Analyzer                      No           Yes
If you consider using Talend Data Quality for Data profiling it will typically cost you nothing to try it out first as it is open source software. None of this is to say, of course, that your business should necessarily use open source software for everything. But with all the many benefits it holds, you’d be remiss not to consider it seriously.
Though there are multiple Data profiling and quality tools are available in the market, with the most powerful analysis features, visualization options and various connectivity options Talend Data Quality is one tool for consideration.

Continue Reading