Network World
Thursday, August 21, 2008
DNSstuff.com
Get information about your IP
IP Information
50+ On-demand DNS and network tools

Brian Egler's SQL Server Strategies

Microsoft Subnet

Navigation

Too many catalogs in the mail? Fuzzy Lookups may help!

Have you ever received unwanted catalogs? Or multiple catalogs from the same company? Well, there's a cool new website that will help you eliminate unwanted catalogs to simplify your life and save natural resources. Sound good? Here it is http://www.catalogchoice.org/. But why do we get multiple catalogs from the same company? Even wanted ones...The answer is: those companies are not using "Fuzzy" technology. SQL Server Integration Services (SSIS) has a couple new transformations called the Fuzzy Lookup and the Fuzzy Grouping that may help.

SSIS is the new ETL (Extract, Transform and Load) tool from Microsoft that supersedes the previous DTS tool delivered with SQL Server 7.0 and SQL Server 2000. As discussed in a previous blog entry, SSIS has a new level of sophistication that means it can compete with expensive third-party ETL tools. It includes a new Lookup transformation that allows you to do an exact match on an existing Customer list, for instance. But what if we don't have an exact match? For instance, the company may have gathered potential customer information from a variety of sources such as market surveys, referrals or partner web sites. We may have completed one form with a middle initial and another without. Or maybe we've used the word "Street" in our address on one form then "St." on another. Or maybe there is just a typo during data entry. The result may be multiple entries for the same person and multiple catalogs in the mail.

The Fuzzy Lookup transformation will match one or more columns between an input data source and a reference table then will give a "Similarity" score based on how close the data matches. This internal algorithm will give a score between 0 and 1 with 1 being an exact match. Then it's just a matter of deciding on the threshold to accept using a Conditional Split transform. For instance, you could define .70 as the similarity threshold. Rows that have a similarity score of .70 or greater will be deemed as duplicates and therefore identify an existing customer in our example. Less than .70 will be judged to be a new customer. For instance, example address column values of "111 Broadway" and "111 Broadway Ave" give a similarity score of 0.77.

But what if we want to remove duplicates within a single input data source? That's where the Fuzzy Grouping transformation takes over. Using the same algorithm, multiple columns can be compared and again a conditional split will remove duplicates based on a threshold value. In our example, that will remove duplicate entries for new customers.

The end result: no more duplicate catalogs in the mail. But then it's up to you to stop the catalog coming in the first place...

Cheers

Brian

Recent posts:

Teaching an old dog new tricks with Report Designer 2008

Don’t PASS up these Conferences!

Need a Change? Try Katmai!


About Brian Egler

Brian D. Egler, MCITP-DBA/MCSE/MCT, is currently an instructor with Global Knowledge, teaching various Microsoft training courses such as MCSE, MCITP-DBA and other SQL Server courses. He is a SQL specialist and an expert on Exchange, Windows, .Net and XML. Egler has been a technical instructor for 16 years and has more than 10 years experience with SQL Server, data modeling, database design, application development including IMS, DB2, Sybase. In addition, he is member of the Project Management Institute.

RSS feed

Egler's archive.

Global Knowledge sponsors a monthly giveaway on Microsoft Subnet. Check out the Microsoft Subnet home page for details.

Global Knowledge offers a comprehensive catalog of Microsoft courses, including:

MCITP: Database Administrator Boot Camp
SQL Server 2005 Administration
MCITP: Enterprise Administrator Boot Camp
More Microsoft Courses

Microsoft Subnet

RSS feed Microsoft news RSS feed

Advertisement: