Recently had written a technical paper for our Organizationm which showcased the Talend Data Profiler capabilities.This technical paper also gives a brief overview of Talend data profiling on Big databases like Hadoop.
DATA PROFILING WITH TALEND
Ankit Kansal & Nayan Naik
As a consequence of expansion of modern technology, the volume and variety of data are increasing dramatically. Therefore, the reputation blemish and losses caused are primary motivations for technologies and methodologies for Data Quality that have been applied successfully in many implementations. The first step for the Data Quality is Data Profiling. It is the process of collecting the statistics of the data to understand the quality and grade of the data before doing any operations on it. The statistics involve the metrics of data quality, conformance to the standards and business logics, data risks, duplications and many others.
The aim of this contribution is to show the role of data profiling for a best implementation and capabilities of Talend in Data profiling. The problem is very difficult because the present data sets are in many different forms and are important. Business executives understand that making better and faster business decisions plays most important role in the business development and customer satisfaction and definitely the per-requisite for these kinds of decisions is the effective and efficient data. Bulks of data from several sources are successfully combined in a data warehouse and used to do many business analyses. This kind of accurate analysis can be done only with varied data but not with a single file.
For example if duplicates are present within a file or across a set of files, then the duplicates might be identified. Record linkage uses name, address and other information such as income ranges, type of industry, and category to determine whether two or more records should be associated with the same entity. The uses of the data are affected by lack of quality due to duplication of records and missing or erroneous values of variables. Duplication can waste money and yield error. If a financial institution or a bank has a customer incorrectly represented in two different accounts, then the bank might repeatedly credit/debit the customer. This kind of analysis should be done at the earlier stage of the implementations and the proper data should be used for reporting for better benefits/profits.
There are various flavors of the data profiling tools in the market. One of the most effective profiling tools is Talend Data Quality. The Talend Data Profiler is part of the Talend Data Quality Suite.
HOW CAN TALEND HELP IN DATA PROFILING??
Talend provides a wide range of Analysis and Statistics for examining the data available in an existing data source. The purpose of these features is basically to provide information about the data quality of the source based on a set of metrics. Apart from the basic statistics, Talend also provides additional metadata information obtained during data profiling. Talend provides different structural level Analysis, starting from column-level analysis to multi column analysis to Table Analysis to Schema level Analysis to Database Structure Analysis and offers excellent performance throughout all structures. Talend provides a library metadata, where-in regex patterns (for multiple databases as well as Java), SQL’s,Business Rules,System Indicators and User-defined Indicators. Talend data Profiler can connect to a various databases along with support for Hive database. Profiling can also be done on flat files using Talend.MDM connections can also be made.Apart from the basic Statistics provided by Talend for an analysis, talend provided flexibility to DEFINE used-defined-statistics.
1) Connection Analysis
2) Catalog Analysis
3) Schema Analysis
4) Table Analysis
5) Column Analysis
6) Redundancy Analysis
7) Column Co-relation Analysis
Connection analysis helps us perform profiling on the complete database connected. It gives an overview of the content of your database.
Data Structure Analysis:
Overview of the database , the number of tables in the database , the number of views , no of rows per table/views , the indexes present and the number of key’s present.
This analysis is specific to the databases which defines catalogs. Analysis will give an overview of the catalog. The analysis computes the number of tables, views and the number of rows per table, views along with the number of keys and indexes for each catalog.
This analysis is specific to the databases where schemas are defined for a database. The analysis computes the number of tables, views and the number of rows per table, views along with the number of keys and indexes for each schema.
Business Rule Analysis:
This analysis defines a business rule over which the table analysis is based. If age should be greater than 18, based on the business rule we come up with an Analysis. Talend provides flexibility in adding multiple rules onto a table. We can add rules in the Library section.
Determines up to what extent the value of one column in a table determines the other column. If the “Zip code” column will always depend on the “City” column.
Column Set Analysis:
This Analysis provides profiling on a set of columns rather than a single column. Patterns as well as Simple indicators can be added to this type of Analysis. This analysis will retrieve as many rows as the number of distinct rows in order to compute the statistics. Hence, it is advised to avoid selecting primary keys in this kind of analysis.
This Analysis helps you to profile the data on basis of a single column.Apart for Patterns and SQL’s,Simple Indicators like No of Nulls,Unique Count,Row count etc can be a part of the analysis.User defined Indicators can also be added.
Column Set analysis:
This Analysis is similar to the one defined at the Table Analysis Level.Here we have a data mining option as well which lets Talend choose the appropriate metrics for the associated columns since not all indicators can be computed on all metrics.
Column Set analysis lets you add patterns and the set of Indicators are limited to Row Count,Distinct Count,Duplicate Count,Unique Count.
Redundancy Analysis: Column Content Analysis
This analysis compares data of two columns to check how many values of column 1 are present in column2. This is basically used to verify Foreign Key/Primary Key relationship.
Column Correlation Analysis: Numerical Co-relation Analysis
Shows co-relation between nominal and numerical values. This type of analysis returns a bubble chart, the average of the numerical data is computed and represented on a vertical axis. This helps us easily identify the data quality issues by looking at the extreme values in the graph.
Time Co-relation Analysis:
Shows co-relation between nominal and time data in a gnatt chart.
Nominal Co-relation Analysis:
This kind of analysis shows correlations between nominal and numerical data in a bubble chart. For each nominal value, the average of the numerical data is computed and represented on the vertical axis.
The statistics visualizations in very rich and more analyzable formats. Below are some of the Profiling reports generated using Talend Data Quality.
TALEND BIG DATA PROFILING
Data Profiling with Talend also allows users to analyze their data in their Hive database on Hadoop. It offers an “IN PLACE” data profiling which means data does not need to go through the time-consuming process of being extracted from Hadoop before being profiled.
Apart from the basic profiling solutions offered by Talend which helps solving the data redundancy, data inconsistency issues which are persistent in Big Data, Talend also offers a variety of domain specific tests like E-mail validation ,Postal codes,Hex-Color codes, VAT-Number codes, Date-formats, Phone-number formats and many more.
The main advantage of profiling on BigData is, you do not need to do things differently for big data. Its as simple as doing an Analysis on a RDBMS or on a Flat File. Talend provides solutions to perform Profiling on Apache Hadoop,Hive database without having specific expertise in these areas and helps understanding the structure of the Hadoop clusters. Talend Also provides the flexibility to come up with User defined Indicators which can be used in a data profiling Analysis. This UDI is Hive Query Language specific.It has been noted that there is no significant changes in performance while performing a data profiling Analysis on a huge volume of data in a database like Hive. The ease of connectivity with the hive database is another feature which talend offers, All hive table information along with the column data information is part of the metadata in Talend once the connection is established.
Talend offers a variety of Analysis to get detailed understanding on the level of quality in the organization’s data. Talend uses the Hadoop clusters to its maximum advantage, allowing user to add multiple servers and increase performance.
1) Easy connectivity to Hive database
2) A set of System Defined Indicators and User defined Indicators (Statistics) that can be added to the Hive database analysis to get a graphical output for the Analysis.
3) Set of Rules.
4)A source file folder where in we can store Scripts.
5)Talend also provides the capability to set Quality threshholds to the data in order to define a Data Quality parameter.
Comparisions Between Talend and Pentaho
Capabilities/Analysis Talend Pentaho
Connection Analysis Yes Yes
Data Structure Analysis Yes No
Catalog Analysis Yes No
Schema Analysis Yes No
Column Analysis Yes Yes
Time Co-relation Analysis Yes No
Nominal Co-relation Analysis Yes No
Column Set Analysis Yes No
String,Boolean,Number Analyzer Yes Yes
Date Gap Analyzer No Yes
Date Time Analyzer No Yes
value Distribution Yes Yes
Weekdays Analyzer No Yes
Reference Data Analyzer No Yes
If you consider using Talend Data Quality for Data profiling it will typically cost you nothing to try it out first as it is open source software. None of this is to say, of course, that your business should necessarily use open source software for everything. But with all the many benefits it holds, you’d be remiss not to consider it seriously.
Though there are multiple Data profiling and quality tools are available in the market, with the most powerful analysis features, visualization options and various connectivity options Talend Data Quality is one tool for consideration.