Have you ever received unwanted catalogs? Or multiple catalogs from the same company? Well, there's a cool new website that will help you eliminate unwanted catalogs to simplify your life and save natural resources. Sound good? Here it is http://www.catalogchoice.org/. But why do we get multiple catalogs from the same company? Even wanted ones...The answer is: those companies are not using "Fuzzy" technology. SQL Server Integration Services (SSIS) has a couple new transformations called the Fuzzy Lookup and the Fuzzy Grouping that may help.
SSIS is the new ETL (Extract, Transform and Load) tool from Microsoft that supersedes the previous DTS tool delivered with SQL Server 7.0 and SQL Server 2000. As discussed in a previous blog entry, SSIS has a new level of sophistication that means it can compete with expensive third-party ETL tools. It includes a new Lookup transformation that allows you to do an exact match on an existing Customer list, for instance. But what if we don't have an exact match? For instance, the company may have gathered potential customer information from a variety of sources such as market surveys, referrals or partner web sites. We may have completed one form with a middle initial and another without. Or maybe we've used the word "Street" in our address on one form then "St." on another. Or maybe there is just a typo during data entry. The result may be multiple entries for the same person and multiple catalogs in the mail.
The Fuzzy Lookup transformation will match one or more columns between an input data source and a reference table then will give a "Similarity" score based on how close the data matches. This internal algorithm will give a score between 0 and 1 with 1 being an exact match. Then it's just a matter of deciding on the threshold to accept using a Conditional Split transform. For instance, you could define .70 as the similarity threshold. Rows that have a similarity score of .70 or greater will be deemed as duplicates and therefore identify an existing customer in our example. Less than .70 will be judged to be a new customer. For instance, example address column values of "111 Broadway" and "111 Broadway Ave" give a similarity score of 0.77.
But what if we want to remove duplicates within a single input data source? That's where the Fuzzy Grouping transformation takes over. Using the same algorithm, multiple columns can be compared and again a conditional split will remove duplicates based on a threshold value. In our example, that will remove duplicate entries for new customers.
The end result: no more duplicate catalogs in the mail. But then it's up to you to stop the catalog coming in the first place...
Cheers
Brian
Recent posts:
Teaching an old dog new tricks with Report Designer 2008
Brian D. Egler, MCITP-DBA/MCSE/MCT, is currently an instructor with Global Knowledge, teaching various Microsoft training courses such as MCSE, MCITP-DBA and other SQL Server courses. He is a SQL specialist and an expert on Exchange, Windows, .Net and XML. Egler has been a technical instructor for 16 years and has more than 10 years experience with SQL Server, data modeling, database design, application development including IMS, DB2, Sybase. In addition, he is member of the Project Management Institute.
Global Knowledge sponsors a monthly giveaway on Microsoft Subnet. Check out the Microsoft Subnet home page for details.
Global Knowledge offers a comprehensive catalog of Microsoft courses, including:
MCITP: Database Administrator Boot Camp
SQL Server 2005 Administration
MCITP: Enterprise Administrator Boot Camp
More Microsoft Courses
|
|