Any quality data cleansing process includes deduping. Data deduplication is the procedure of identifying duplicate record and removing duplicates.
Why should you de-dupe?
– Duplicated records distort the data by adding repeated information that lead to wrong reporting, i.e. incorrect insights
– Duplications twist marketing campaigns by decreasing the response rate and killing return on investment.
– Duplicates reduce company workforce productivity by contacting the same customers multiple times.
– Duplicate accounts ruin customer loyalty and, thus, decrease profitability.
What defines a duplicate?
At first sight it may sound trivial. You see the same names, the same phone numbers, emails, etc. and you action is super clear: delete one and use the other. Yet, marketers, analysts, data scientists know that it is a very complex and time consuming action. However, there are reasons for duplicates to exist in your database. For example, you maybe billing two different departments for your enterprise customers, the same address belongs to multiple people who all order separately and finally a corporate phone number can belong to multiple locations and you are selling per location.
To define duplicates you therefore need to have a plan on what is defined as a duplicate by consulting multiple data stakeholders. After that you need to match records. Through data matching we can reveal exact matching records and delete them keeping one as a surviving record. The more complex way of data matching is to find records that look similar and have a high probability to be the identical piece of data.
What fields help identify a duplicate?
For contact data, usually you start with the email address as one of the best identifiers. Mobile phones are just another type of the unique identifiers. Today people have the opportunity to keep their cellphones’ number even when they move and when they change phone providers.
For Account data, it is helpful to look for the domain names. Mailing address is just another way to match record. The more fields you apply during deduping procedure the cleaner your data will become. Matching the fields will allow you to select master or surviving record that you keep.
What is master/surviving record?
To dedupe you must keep one record and merge related data from other records. This record will be your master record. To choose master record you can follow certain rules.
Start with outlining the data that comes from the most reliable source. Always rely only on the reputable and valid sources of data. Use a third party for data validation regarding, for example, domain names, mobile phone providers, zip codes, etc.
Make sure you select the most recent data. Pick up the latest update available in your data. Keep in mind that people change their places of living and jobs pretty often.
Look up at the record that is more detailed compared to similar one lacking descriptions. Be careful with abbreviations, they can easily create noise.
Disregard vague attributes and null data. For example, select the zip codes that are complete.
Selecting the master record always set up rules depending on the nature of your business. For example, you may want to flag customers as master while prospects are merge.
When master record is identified, the other duplicate records can be disregarded or merged with the master record. Always write down your logical deduping steps and discuss them with your team to refine the best logic.
Why do you need to test the deduping logic?
Prior to running deduping algorithm based on your rules don’t forget to create a backup file. Run tests and verify the results by randomly checking the data. Following this pattern will allow you to find any errors if any and fix them on the way. Also keeping the backup file will prevent the data loss.
Duplicates never stop. Constant data changes, i.e. data update/maintenance, new data additions and often simple manual data entry, are prone to new duplicates surfacing. Maintaining your data clean is crucial for your organization’s success. When set up, run your deduplication process on a regular basis or refer to professional data cleansing teams for help. It will save you time and help keep your data clean, functional and effective.