Sharing Knlowledge With The World…

Data Profiling with Talend

b
Recently had written a technical paper for our Organizationm which showcased the Talend Data Profiler capabilities.This technical paper also gives a brief overview of Talend data profiling on Big databases like Hadoop.

DATA PROFILING WITH TALEND

                                                                                         Ankit Kansal & Nayan Naik
      As a consequence of expansion of modern technology, the volume and variety of data are increasing dramatically. Therefore, the reputation blemish and losses caused are primary motivations for technologies and methodologies for Data Quality that have been applied successfully in many implementations. The first step for the Data Quality is Data Profiling. It is the process of collecting the statistics of the data to understand the quality and grade of the data before doing any operations on it. The statistics involve the metrics of data quality, conformance to the standards and business logics, data risks, duplications and many others.
     The aim of this contribution is to show the role of data profiling for a best implementation and capabilities of Talend in Data profiling. The problem is very difficult because the present data sets are in many different forms and are important. Business executives understand that making better and faster business decisions plays most important role in the business development and customer satisfaction and definitely the per-requisite for these kinds of decisions is the effective and efficient data. Bulks of data from several sources are successfully combined in a data warehouse and used to do many business analyses. This kind of accurate analysis can be done only with varied data but not with a single file.
     For example if duplicates are present within a file or across a set of files, then the duplicates might be identified. Record linkage uses name, address and other information such as income ranges, type of industry, and category to determine whether two or more records should be associated with the same entity. The uses of the data are affected by lack of quality due to duplication of records and missing or erroneous values of variables. Duplication can waste money and yield error. If a financial institution or a bank has a customer incorrectly represented in two different accounts, then the bank might repeatedly credit/debit the customer. This kind of analysis should be done at the earlier stage of the implementations and the proper data should be used for reporting for better benefits/profits.
There are various flavors of the data profiling tools in the market. One of the most effective profiling tools is Talend Data Quality. The Talend Data Profiler is part of the Talend Data Quality Suite.

HOW CAN TALEND HELP IN DATA PROFILING??

Talend provides a wide range of Analysis and Statistics for examining the data available in an existing data source. The purpose of these features is basically to provide information about the data quality of the source based on a set of metrics. Apart from the basic statistics, Talend also provides additional metadata information obtained during data profiling. Talend provides different structural level Analysis, starting from column-level analysis to multi column analysis to Table Analysis to Schema level Analysis to Database Structure Analysis and offers excellent performance throughout all structures. Talend provides a library metadata, where-in regex patterns (for multiple databases as well as Java), SQL’s,Business Rules,System Indicators and User-defined Indicators. Talend data Profiler can connect to a various databases along with support for Hive database. Profiling can also be done on flat files using Talend.MDM connections can also be made.Apart from the basic Statistics provided by Talend for an analysis, talend provided flexibility to DEFINE used-defined-statistics.

1) Connection Analysis

2) Catalog Analysis
3) Schema Analysis
4) Table Analysis
5) Column Analysis
6) Redundancy Analysis
7) Column Co-relation Analysis
Connection Analysis:
Connection analysis helps us perform profiling on the complete database connected. It gives an overview of the content of your database.
Data Structure Analysis:
Overview of the database , the number of tables in the database , the number of views , no of rows per table/views , the indexes present and the number of key’s present.
Catalog Analysis:
This analysis is specific to the databases which defines catalogs. Analysis will give an overview of the catalog. The analysis computes the number of tables, views and the number of rows per table, views along with the number of keys and indexes for each catalog.
Schema Analysis:
This analysis is specific to the databases where schemas are defined for a database. The analysis computes the number of tables, views and the number of rows per table, views along with the number of keys and indexes for each schema.
Business Rule Analysis:
This analysis defines a business rule over which the table analysis is based. If age should be greater than 18, based on the business rule we come up with an Analysis. Talend provides flexibility in adding multiple rules onto a table. We can add rules in the Library section.
Functional Dependency:
Determines up to what extent the value of one column in a table determines the other column. If the “Zip code” column will always depend on the “City” column.
  
Column Set Analysis:
This Analysis provides profiling on a set of columns rather than a single column. Patterns as well as Simple indicators can be added to this type of Analysis. This analysis will retrieve as many rows as the number of distinct rows in order to compute the statistics. Hence, it is advised to avoid selecting primary keys in this kind of analysis.
Column Analysis:
This Analysis helps you to profile the data on basis of a single column.Apart for Patterns and SQL’s,Simple Indicators like No of Nulls,Unique Count,Row count etc can be a part of the analysis.User defined Indicators can also be added.
Column Set analysis:
This Analysis is similar to the one defined at the Table Analysis Level.Here we have a data mining option as well which lets Talend choose the appropriate metrics for the associated columns since not all indicators can be computed on all metrics.
Column Set analysis lets you add patterns and the set of Indicators are limited to Row Count,Distinct Count,Duplicate Count,Unique Count.
Redundancy Analysis: Column Content Analysis
This analysis compares data of two columns to check how many values of column 1 are present in column2. This is basically used to verify Foreign Key/Primary Key relationship.
Column Correlation Analysis: Numerical Co-relation Analysis
Shows co-relation between nominal and numerical values. This type of analysis returns a bubble chart, the average of the numerical data is computed and represented on a vertical axis. This helps us easily identify the data quality issues by looking at the extreme values in the graph.
Time Co-relation Analysis:
Shows co-relation between nominal and time data in a gnatt chart.
Nominal Co-relation Analysis:
This kind of analysis shows correlations between nominal and numerical data in a bubble chart. For each nominal value, the average of the numerical data is computed and represented on the vertical axis.
The statistics visualizations in very rich and more analyzable formats. Below are some of the Profiling reports generated using Talend Data Quality.

 

TALEND BIG DATA PROFILING

Data Profiling with Talend also allows users to analyze their data in their Hive database on Hadoop. It offers an “IN PLACE” data profiling which means data does not need to go through the time-consuming process of being extracted from Hadoop before being profiled.
Apart from the basic profiling solutions offered by Talend which helps solving the data redundancy, data inconsistency issues which are persistent in Big Data, Talend also offers a variety of domain specific tests like E-mail validation ,Postal codes,Hex-Color codes, VAT-Number codes, Date-formats, Phone-number formats and many more.
The main advantage of profiling on BigData is, you do not need to do things differently for big data. Its as simple as doing an Analysis on a RDBMS or on a Flat File. Talend provides solutions to perform Profiling on Apache Hadoop,Hive database without having specific expertise in these areas and helps understanding the structure of the Hadoop clusters. Talend Also provides the flexibility to come up with User defined Indicators which can be used in a data profiling Analysis. This UDI is Hive Query Language specific.It has been noted that there is no significant changes in performance while performing a data profiling Analysis on a huge volume of data in a database like Hive. The ease of connectivity with the hive database is another feature which talend offers, All hive table information along with the column data information is part of the metadata in Talend once the connection is established.
Talend offers a variety of Analysis to get detailed understanding on the level of quality in the organization’s data. Talend uses the Hadoop clusters to its maximum advantage, allowing user to add multiple servers and increase performance.
1) Easy connectivity to Hive database
2) A set of System Defined Indicators and User defined Indicators (Statistics) that can be added to the Hive database analysis to get a graphical output for the Analysis.
3) Set of Rules.
4)A source file folder where in we can store Scripts.
5)Talend also provides the capability to set Quality threshholds to the data in order to define a Data Quality parameter.

Comparisions Between Talend and Pentaho
    
          Capabilities/Analysis                            Talend       Pentaho

  • Connection Analysis                               Yes            Yes
  • Data Structure Analysis                          Yes            No
  • Catalog Analysis                                    Yes            No
  • Schema Analysis                                    Yes            No
  • Column Analysis                                    Yes            Yes
  • Time Co-relation Analysis                      Yes            No
  • Nominal Co-relation Analysis                 Yes           No
  • Column Set Analysis                              Yes           No
  • String,Boolean,Number Analyzer           Yes           Yes
  • Date Gap Analyzer                                No           Yes
  • Date Time Analyzer                               No           Yes
  • value Distribution                                  Yes          Yes
  • Weekdays Analyzer                              No           Yes
  • Reference Data Analyzer                      No           Yes
Conclusion
If you consider using Talend Data Quality for Data profiling it will typically cost you nothing to try it out first as it is open source software. None of this is to say, of course, that your business should necessarily use open source software for everything. But with all the many benefits it holds, you’d be remiss not to consider it seriously.
Though there are multiple Data profiling and quality tools are available in the market, with the most powerful analysis features, visualization options and various connectivity options Talend Data Quality is one tool for consideration.
REFERENCES

http://www.talend.com/resource/data-profiling.html

Data Profiling with Talend
5 (100%) 1 vote

nayan

View more posts from this author

Informatica Tutorial – The Definitive Guide

Untitled design (8)

Informatica is the most important (and popular) tool in the Data Integration Industry. Actually, it’s a culmination of several different “Client Soft wares”: you need to master Mapping Designer , Workflow Monitor and good old Workflow Monitor if you want to master Informatica.

INFORMATICA IS ONE OF THE BEST ETL TOOLS OUT THERE, AND WE HAVE THE PERFECT INFORMATICA TUTORIAL GUIDE BOOK WITH THE BEST RESOURCES AROUND THE WEB.

nayan

View more posts from this author

Abstraction in object-oriented programming

Abstraction in object-oriented programming

Abstraction came from the Latin word abs, meaning ‘away’ and trahere, meaning ‘to draw’. So we can define Abstraction in object-oriented programming language as a process of removing or taking away the characteristics from something (object) in order to reduce it to a set of essential characteristics.
Through the Abstraction in object-oriented programming, a programmer shows only the relevant data of an object and omitted all unwanted details of an object in order to reduce complexity and increase efficiency.
In the process of abstraction in object-oriented programming, the programmer tries to ensure that the entity is named in a manner that will make sense and that it will have all the relevant aspects included and none of the extraneous ones.
If we try to describe the process of abstraction in real world scenario then it might work like this:

You (the object) are going to receive your father’s friend from railway station. You two never met to each other. So you would take his phone number fron your father and call him when the train arrives.
On the phone you will tell him that “I am wearing white T-shirt and blue jeans and standing near the exit gate”. Means you will tell him the colour of your clothes and your location so he can identify and loacte you.This is all data that will help the procedure (finding you) work smoothly.

You should include all that information. On the other hand, there are a lot of bits of information about you that aren’t relevant to this situation like your age, your pan card number, your driving licence number which might be must information in some other scenario (like opening a bank account). However, since entities may have any number of abstractions, you may get to use them in another procedure in the future.

Lyncean Patel

View more posts from this author

Encapsulation in object-oriented language

Encapsulation

Encapsulation in object-oriented language or in Java is packing of data and function in to single component which enforce protecting variables, functions from outside of class, in order to better manage that piece of code and having least impact or no impact on other parts of program due to change in protected code.
Encapsulation in object-oriented language can also be described as a protective barrier that prevents the code and data being randomly accessed by other code defined outside the class. Access to the data and code is tightly controlled by an interface. (Through functions, which are exposed to outer world.)
The main benefit of encapsulation is the ability to modify our implemented code without breaking the code of others who use our code. With this feature Encapsulation gives maintainability, flexibility and extensibility to our code.
Example:
public class UserPin {
private int pin;
public void setPin (int pin){
//Saving the pin to database
}
public int getPIn() {
//fetching the pin from db and return back
}
}

Encapsulation in object-oriented language makes sure that the user of the class would be unaware of how class stores its data. Also it makes sure that user of the class do no need to change any of their code if there is any change in the class.
As in the above code example we store the ‘PIN’ of the user as integer but say, due to security reason we have to encrypt the ‘PIN’ and then store the encrypted ‘PIN’. And the algorithm that we use for encryption requires ‘PIN’ as String.
public class UserPin {
private int pin;
public void setPin (int pin){
//Convertin pin from int to String
//Encrytpt the PIN
//Saving the pin to database
}
public int getPIn() {
//fetching the pin from database
//Converting back to int
//Returning the pin

}
}
As we saw there is no change in the signature of the functions so the user of the class do not have to change his code.
Also we can implement the security layer as the user access the field through the function (known as getter and setter).
public class UserPin {
private int pin;
public void setPin (int pin){
//Validate the value of the PIN
//Convertin pin from int to String
//Encrytpt the PIN
//Saving the pin to database
}
public int getPIn() {
//fetching the pin from database
//Converting back to int
//Returning the pin

}
}
The fields can be made read-only (If we don’t define setter methods in the class) or write-only (If we don’t define the getter methods in the class).

The whole idea behind encapsulation is to hide the implementation details from users. That’s why encapsulation is known as data hiding.

The idea of encapsulation in object-oriented language is “don’t tell me how you do it; just do it.”

Lyncean Patel

View more posts from this author

Access Apex Rest API Salesforce from TalenD

images

Hello Readers,

This is our follow post on Talend Interview Questions, below are the all required steps to access Salesforce data using your own Talend Instance using APEX REST API.

Step 1

In SF go to Setup, Create, Apps. Scroll to bottom of page where it says Connected apps and click new by visiting the given url

https://www.salesforce.com/us/developer/docs/api_rest/Content/intro_understanding_authentication.htm

Access Apex Rest API Salesforce from TalenD

Access Apex Rest API Salesforce from TalenD

 

Name can be anything as long as you know what it is, callback URL does not really matter, but use same as example. The important thing is selecting the Access and Manage Your data in scopes.

Step  2

After you create it, Consumer Key and Consumer Secret Values are what you use in Call to OAUTH API. Please see the screenshot below.

Access Apex Rest API Salesforce from TalenD

Access Apex Rest API Salesforce from TalenD

 

Step 3

After setting up the Connected App in Salesforce, we need to make a call to OAUth API to get token i.e access token. For making the call we need to have cURL installed. There may be other options but I prefer cURL.

 Step 4

One can download the cURL with SSL for one’s OS  and the required certificate of it from the below link https://support.zendesk.com/hc/en-us/articles/203691436-Installing-and-using-cURL

Step 5

Create a cURL folder on your machine and move the cURL.exe and its certificate to that folder. Setup “Path” environment variable of it so that cURL can be accessed from anywhere in command prompt. Please see the screenshot below.

Access Apex Rest API Salesforce from TalenD

Access Apex Rest API Salesforce from TalenD

 

 

Step 6

Once the cURL is setup, run the below mentioned command in command prompt to get the access token mentioned in Step 3.

curl –data “grant_type=password&client_id=<insert consumer key here>&client_secret=<insert consumer secret here>&username=<insert your username here>&password=<insert your password and token here>” -H “X-PrettyPrint:1” https://test.salesforce.com/services/oauth2/token

Response of this would be something like this

{

  “id” : “https://test.salesforce.com/id/00Dc0000003txdzEAA/005D0000001wi7EIAQ”,

  “issued_at” : “1421777842655”,

  “token_type” : “Bearer”,

  “instance_url” : “https://<instance>.salesforce.com”,

  “signature” : “AJjrVtbIpJkce+T4/1cm/KbUL7d4rqXyjBJBhewq7nI=”,

  “access_token” : “00Dc0000003txdz!ARQAQHJEpvN8IcIYcX8.IfjYi0FJ6_JFICLcMk6gnkcHdzMF1DYd2.ZW9_544ro7CnCpO4zzPmkgQ7bE9oFd8yhBALGiIbx7”

}

Step 7

Use the “access_token” value in tRESTClient in “Bearer Token”. Please see the screenshot below.

Access Apex Rest API Salesforce from TalenD

Access Apex Rest API Salesforce from TalenD

 

 Step 8

Use 2 tLogRow components, one for showing the success result and the other for displaying any error thrown. Please see the screenshot below

Capture

 

Step 9

Execute the job and you see result as below

Capture

 

Thank you very much for reading the article!!!

Please feel free to post your comments.

 

Ankit Kansal

View more posts from this author

Informatica Powercenter Performance Tuning Tips

DABLTUU2uOc

DABLTUU2uOcHere are a few points which will get you started with Informatica Power center Performance Tuning .Some of these tips are very general in nature please consult your project members before implementing them in your projects.

 

1) Optimize the Input Query if it’s Relational (e.g. Oracle table) source –

  1. Reduce no.of rows queried by using where conditions instead of Filter/Router transformations later in the mapping. Since you choose fewer rows to start with, your mapping will run faster
  2. Maker sure appropriate Indexes are defined on necessary columns and analyzed. Indexes must be especially defined on the columns in the ‘where’ clause of your input query.
  • Eliminate columns that you do not need for your transformations
  1. Use ‘Hints’ as necessary.
  2. Use sort order based on need

Note: For # i results will vary based on how the table columns are Indexed/queried…etc.

 

2) Use the Filter transformation as close to the SQ Transformation as possible.

3) Use sorted input data for Aggregator or Joiner Transformation as necessary.

4) Eliminate un-used columns and redundant code in all the necessary transformations.

5) Use Local variables as necessary to improve the performance.

6) Reduce the amount of data caching in Aggregator.

7) Use parameterized input query and file for flexibility.

8) Changed memory related settings at workflow/session level as necessary.

9) When use multiple condition columns in Joiner/Lookup transformation make sure to

use numeric data type  column as first condition.

10) Use persistent cache if possible in Lookup transformations.

11) Go through the sessions Logs CLOSELY to find out any issues and change accordingly

12) Use overwrite queries in Lookup transformation to reduce the amount of data cached.

13) Make sure the data type and sizes are consistent throughout the mapping as much

as possible.

14) For Target Loads use Bulk Load as and when possible.

15) For Target Loads use SQL * Load with DIRECT and UNRECOVERABLE option for large volume of data loads.

16) Use Partitioning options as and when possible. This is true for both Informatica and Oracle. For Oracle, a rule of thumb is to have around 10M rows per partition

17) Make sure that there are NO Indexes for all “Pre/Post DELETE SQLs” used in all

the mappings/Workflows.

18) Use datatype conversions where ever possible

e.g: 1) Use ZIPCODE as Integer instead of Character this improves the speed of the lookup Transformation comparisons.

 2) Use the port to port datatype conversions to improve the session performance.

19) Use operators instead of functions

e.g: For concatenation use “||” instead of CONCAT

20) Reduce the amount of data writing to logs for each transformation by setting

log settings to Terse as necessary only.

21) Use Re-usable transformations and Mapplets where ever possible.

22) In Joiners use less no.of rows as Master ports.

23) Perform joins in db rather than using Joiner transformation where ever possible.

 

DABLTf_fEv0

nayan

View more posts from this author

Informatica Best Practices for Cleaner Development

Informatica Best Practices

 

Don’t you just hate it when you can’t find that one mapping out of the thousand odd mappings present in your repository ??

A best practice is a method or technique that has consistently shown results superior to those achieved with other means, and that is used as a benchmark. In addition, a “best” practice can evolve to become better as improvements are discovered.Following these Informatica Best Practices guidelines , would allow better Repository Management , which would make your Life Easier. Incorporate these practices when you create informatica objects and your life would be much easier:

Mapping Designer

  • There should be a place holder transformation (expression) immediately after the source and one before the target.
  • Active transformations that reduce the number of records, should be used as early as possible.
  • Connect only the ports that are required in targets to subsequent transformations.
  • If a join must be used in the Mapping, select the driving/master table while using joins.
  • For generic logic to be used across mappings, create a mapplet and reuse across mappings.

 

Transformation Developer

  • Replace complex filter expression with a (Y/N) flags. Filter expression will take lesser time to process the flags than the logic.
  • Persistent caches should used in look ups if the look up data is not expected to change often.

Naming conventions – name the informatica transformations starting with the first 3 letters in small case indicating the transformation. E.g. : lkp_<name of the lookup> for Look Up, rtr_<name of router> for router transformation etc.

 

Workflow Manager

  • Naming convention for session, worklet, workflow- s_<name of the session>, wlt_<name of the worklet>, wkf_<name of the workflow>.
  • Sessions should be created as re usable to be used in multiple workflows.
  • While loading tables for full loads, truncate target table option should be checked.
  • Workflow Property “Commit interval” (Default value : 10,000) should be increased for increased for Volumes more than 1 million records.
  • Pre-Session command scripts should be used for disabling constraints, building temporary tables, moving files etc. Post-Sessions scripts should be used for rebuilding indexes and dropping temporary tables.

 

Performance Optimization Best Practices

We often come across situations where Data Transformation Manager(DTM) takes more time to read from Source or when writing in to a Target. Following standards/guidelines can improve the overall performance.

  • Use Source Qualifier if the Source tables reside in the same schema
  • Make use of Source Qualifier “Filter” properties if the Source type is Relational
  • Use flags as integer, as the integer comparison is faster than the string comparison
  • Use tables as lesser number of records as master table for joins
  • While reading from Flat files, define the appropriate data type instead of reading as String and converting
  • Have all ports that are required connected to Subsequent transformations else check whether we can remove these ports

 

  • Suppress ORDER BY using the ‘- – ’ at the end of the query in Lookup transformations
  • Minimize the number of Update strategies
  • Group by simple columns in transformations like Aggregate, Source qualifier
  • Use Router transformation in place of multiple Filter transformations
  • Turn Off the Verbose logging while moving the mappings to UAT/Production environment
  • For large volume of data drop index before loading and recreate indexes after load
  • For large of volume of records Use Bulk load increase the commit interval to a higher value large volume of data
  • Set ‘Commit on Target’ in the sessions

 

These are a few things a beginner should know when he starts coding in Informatica . These Informatica Best Practices guidelines are a must for efficient Repository and overall project management and tracking.

nayan

View more posts from this author

Top Informatica Questions And Answers

Interview-Preparation-and-Practice-Featured

Hey Folks, As Discussed in our earlier post this our subsequent post regarding Informatica Interview Questions. please subscribe to get the free copy of PDF with answers and leave a comment.

Informatica Questions And Answers :-

1)   What is the difference between reusable transformation & shortcut created ?
2)   Which one is true for mapplets ( can u use source qyalifier, can u use sequence generator, can you use target) ?
3)   What are the ways to recover rows from a failed session ?
4)   Sequence generator, when u move from development to production how will you reset ?
5)   What is global repository ?
6)   How do u set a variable in incremental aggregation ?
7)   What is the basic functionality of pre-load stored procedure ?
8)   What are the different properties for an Informatica Scheduler ?
9)   In a concurrent batch if a session fails, can u start again from that session ?
10)  When you move from development to production then how will you retain a variable value ?
11)  Performance tuning( what was your role) ?
12)  what are conformed dimensions?
13)  Can you avoid static cache in the lookup transformation? I mean can you disable caching in a lookup transformation?
14)  What is the meaning of complex transformation?
15)  In any project how many mappings they will use(minimum)?
16)  How do u implement un-connected Stored procedure In a mapping?
17)  Can you access a repository created in previous version of Informatica?
18)  What happens if the info. Server doesn’t find the session parameter in the parameter file?
19)  How did you handle performance issues If you have data coming in from multiple sources, just walk through the process of loading it into the target
20)  How will u convert rows into columns or columns into rows
21)  What are the steps involved in the migration from older version to newer version of Informatica Server?
22)  What are the main features of Oracle 11g with context to data warehouse?
24)  How to run a session, which contains mapplet?
25)  Differentiate between Load Manager and DTM?
26)  What are session parameters ? How do you set them?
27)  What are variable ports and list two situations when they can be used?
28)  Describe Informatica Architecture in Detail ?
29)  How does the server recognise the source and target databases.
30)  What is the difference between sequential batch and concurrent batch and which is recommended and why?
31)  A session S_MAP1 is in Repository A. While running the session error message has displayed
‘server hot-ws270 is connect to Repository B ‘. What does it mean?
32)  How do you do error handling in Informatica?
33)  How can you run a session without using server manager?
34)  Consider two cases:
1. Power Center Server and Client on the same machine
2. Power Center Sever and Client on the different machines
what is the basic difference in these two setups and which is recommended?
35)  Informatica Server and Client are in different machines. You run a session from the server manager by specifying the source and target databases. It displays an error. You are confident that everything is correct. Then why it is displaying the error?
36)  What is the difference between normal and bulk loading? Which one is recommended?
37)  What is a test load?
38)  How can you use an Oracle sequences in Informatica? You have an Informatica sequence generator transformation also. Which one is better to use?
39)  What are Business Components in Informatica?
40)  What is the advantage of persistent cache? When it should be used.
41)  When will you use SQL override in a lookup transformation?

Please provide your name and email address for your free download.


Ankit Kansal

View more posts from this author

Leave a Reply

Your email address will not be published. Required fields are marked *

PageLines