Fuzzy lookup in SSIS 2008 to keep data integrity

Thursday, August 5, 2010 |

“Human makes mistakes” which is quite obvious. While making a data entry it is possible to make typo but as a database professional, it is our duty to keep data consistent.  Fuzzy Lookup is helpful in this case. Before we start making package in SSIS, let us have some pre-preparation for that. We are going to create one source table (it could be any source like excel, csv file etc. but we are making it in SLQ Server), one reference table which is guaranteed to have proper data. Here is the TSQL to create, source and reference, table and insert some dummy data.

create table fuzzyLookupSource
(
      firstName varchar(10),
      LastName varchar(10),
      BirthDate datetime
)
insert into fuzzyLookupSource
select 'Rites','Shah','02/07/1980' union all
select 'Rajen','Shah','03/31/1983' union all
select 'Dharmesh','Kalaria','04/09/1980'  union all
select 'Jesica','Cruize','05/05/1980'  union all
select 'Roger','Moore','04/15/1980' 
GO

create table fuzzyLookupReference
(
      firstName varchar(10),
      LastName varchar(10),
      BirthDate datetime
)
insert into fuzzyLookupReference
select 'Ritesh','Shah','02/07/1980' union all
select 'Rajan','Shah','03/31/1983' union all
select 'Jessica','Cruise','06/05/1980'  union all
select 'Dharmesh','Kalaria','04/09/1980'
GO

Observe the data in both table, in first, source, table, there are some typos which you can compare with your second, reference, tables and get the purified data.

Anyway, once you are ready with both the tables , create one new project in BIDS (Business Intelligence Development Studio) and drag one “DataFlow” task from tool box to your “Control Flow” tab. Double click on “DataFlow” task to configure it so that it would redirect you to “DataFlow” tab.
Now, create one “Ado Net Source” which will refer our “fuzzyLookupSource” table in sql server database. Double click on “Ado Net Source” to configure it and look at below image to have crystal clear idea about its configuration.


Now,  drag “fuzzy lookup” transformation task below your “Ado Net Source” and connect extended green arrow from “Ado Net Source” to your fuzzy lookup. Double click on “Fuzzy Lookup” task to configure it.
In “Reference Table” tab, give reference of your database and our reference table which is “FuzzyLookupReference” in our case. Look at image below for more idea.


Click on “Columns” tab to configure which column to check with reference from source table and select “firstName” and “lastName” column and connect it so that our fuzzy lookup task will compare these two fields from source to reference table.


Once you configure “columns”, you have to click on “Advanced” tab, you can set “Similarity Threshold” which will give you how much identical both fields are…. If it is 1 than it is perfect match, if it is 0 than no match or data not present in reference table so more near to 1, good match it is. We are not going to take any decision like if it is greater than .50 then do this otherwise do that so it would be ok if you don’t change “Similarity Threshold”.


Now, drag “SQL Server Destination” task so that this matched and unmatched data could fall in SQL Server table, though we have not created any SQL Server table for this so far. Connect green extended arrow from “Fuzzy Lookup” transformation task to “SQL Server Destination” task. Before we configure “SQL Server Destination” we would like to do one more thing. Double click on GREEN arrow between Fuzzy Lookup task and SQL Server destination task.
We would like to see data in grid while running this package and before it fall into our destination table, we are going to specify this only now.
As soon as you click on green arrow, it will open “Data Flow Path Editor”, click on “Data Viewer” tab and click on “Add” button to add “Grid”.


Now, double click on “SQL Server Destination” task to configure it. Give details of your SQL Server and database into “Connection Manager” name. since we don’t have destination table already created for our data, we are going to click on “New” button besides “Use a table or view” property which will create one destination table in our SQL Server.


Now you are ready to run your package, hit F5 to run it, when it crosses Fuzzy Lookup Task, it will show you data in grid, check it and click on “Green Arrow” above the grid in same dialog box so that data falls into our SQL Server table.


you can later on check the same data into SQL Server by executing TSQL Query, while generating new table in SQL Server, if you didn’t have rename the table, it would be by default [SQL Server Destination]. So you can execute something like

SELECT * FROM [SQL Server Destination]

Reference: Ritesh Shah
http://www.sqlhub.com
Note: Microsoft Books online is a default reference of all articles but examples and explanations prepared by Ritesh Shah, founder of
http://www.SQLHub.com



0 comments: