Clean data does not happen over night, it requires proper processes and an ongoing data cleansing initiative. A substantial part of data cleansing procedure is data deduping. The latter is the practice of defining of duplicate/repeated records sets, selecting the master/surviving record and merging or purging of non surviving records.  However, prior to de-duplicating process it is important to perform some simple logical clean up and match up records.

Let’s assume that  you have identified and grouped similar data sets and it is obvious that some records represent data entry typos. For instance, there are accounts where part of the address is “Apple Crescent”, “Apple Cres.” and “Apple Cr”. All these records have the same account id and require pairing three addresses to one, i.e. “Apple Crescent”. Another example can be identical email addresses of “Adam1” ending with “gmail.co”, “gmail.com” and “gmail.cpm” that should be matched up.  Phone number area codes are often messed up but can be easily clarified using the full address or by looking up through the reliable data sources. You can easily match up records that have abbreviations of some sort that represent the same company or organization. Performing these match ups will significantly simplify the whole deduping process.

Let’s say the simple match up is done, the duplications are defined and master records are selected. Next question is what to do with the non surviving records? There are two options: merge or purge. The objective of merge/purge action is to have one distinctive record that is complete and to abolish inessential data.

For example, there are five records containing customer data. All records have the same name and family name but one record contain full address and other records are missing it or only both record number three and record number four include full email address or phone number. These scenario requires the logic of filling up missing information. The complete data should be restored or replaced in all non surviving records.

So what is Data Merging?

Definition of data merging is to select partial data from two or more records which ensures the minimum data loss.

It is critical for merging to pull and insert only the most recent, updated data. It can be done by applying the algorithm of most recent data. The other important part of merging is appending the data that might be essential for historical trends analysis. The goal is to keep as much data as possible while removing the bad data.

Very often customer data is filled up using various sources and often the format of these files differ. In such cases the merging algorithm should include data reformatting, i.e.applying  same format to the same kind of data.

What is data purging?

Purging is the action of removing non surviving records. In other words, deleting duplicate records. There are some cases of simple redundancy where the record just repeats/duplicates  itself and there is no value in keeping the records. Purging of such records allows to clean up the data. Purging is widely used in deleting non essential, non surviving records, bounced records and historical accounts that are kept in historical files and need to be deleted to free up space.

When to use data purging vs. data merging?

Both merging and purging actions are substantial tools of data cleansing that allow to keep the data updated, complete and efficient. Merging is preferred method to ensure no data loss occurs, however, it comes with a risk of over-riding good data with bad. Purging is preferred to data that is identical or completely useless. For example, if data completeness is close to 0% or the source of data cannot be trusted then data purging is preferred.

Data cleansing companies, typically will utilizing both techniques to ensure that proper process is used to fit different scenarios. For de-duping or data cleansing services, please contact StrategicDB at www.strategicbd.com to get started!

Menu