Deduplication is a method used to eliminate redundant copies of data from a database with the objective of reducing storage and cost whilst improving customer experience and loyalty. Something like 2.5 quintillion bytes of data are produced each day and that is only accelerating with the growth and buzz around Big Data and the Internet of Things (IoT).
There are many technologies on the market that can assist with deduplication, such as front-end systems that carry out real-time verification and storage efficiencies. A great example that is often quoted is where companies might send a group email to all staff with a 5MB attachment. When backing up, deduplication software will ensure only one copy is saved to optimize storage and reduce cost.
However, companies who are just starting out don’t always have the budget to invest in deduplication technology and instead, rely on the software and tools that they already have. The best example of this is Microsoft Excel. Whilst Excel is better known as a spreadsheet or financial application, it can be very effective as a deduping technology. This article explains how Excel can be used to dedupe your data.
Using Excel to remove duplicate data
Excel has a feature which can remove duplicate rows of data. Marketing teams will often use this for ensuring email addresses are unique before sending a campaign as an example, the reason being that many marketing platforms don’t have the built-in functionality (unless it is paid for).
Before starting to dedupe it is important to retain backups of the original dataset in case of any problems requiring the need to roll back changes. Some changes in Excel are not reversible when it includes deleting rows.
Imagine you have a spreadsheet database in MS Excel of customer details containing Name and Email Address fields. For an email campaign, you would like to dedupe Excel to ensure customers are not mailed more than once. When doing duplicate removal, only the cells in the table that you select will be affected so if you have information on other tabs outside of those, it will not be impacted.
There are 3 quick steps in Excel to remove duplicate rows:
- Select (highlight) the cells that you want to remove duplicates from. For example, if your Names and Emails are in cells A1 to B500, highlight those cells.
- On the Data tab in Excel, click the Remove Duplicates icon, highlighted below for reference.
- A pop up will appear asking which columns contain the data where duplicates are found. In this case, that will be the email address field, so select the check box next to that option. You can select as many fields as you like here.
Excel will tell you how many duplicate values will be removed. Click OK and you are done.
Using Excel to highlight duplicate data
You may not always want to delete duplicate data using Excel. Imagine your email addresses are in a back office system which does not have the technology to find duplicates, you may simply use Excel to highlight them before going back to the system to formally delete the records.
This can be achieved using Conditional Formatting in Excel. In simple terms, this sets rules against a range of data, telling it to do some action if it meets said rule. The screenshot below shows where you can find this option
Once you choose to highlight duplicates, Excel will automatically find them in your selected range and apply the rule you have chosen – in this case, Light Red Fill with Dark Red Text
Now you have identified all the duplicates, you can take whatever action is required rather than permanently deleting them. This negates the risk of losing records and is often preferred when setting up email campaigns initially to check the data.
Potential risks of Excel deduping
Whilst deduping in Excel does offer an option to smaller businesses who don’t have the option to invest in the technology it does come with a few risks that should be noted.
Firstly, Excel won’t automatically back up your data. If you do make any mistakes when deduping having forgotten to backup, there is no audit trail and you run the risk of losing customer records. It is important to backup everything before applying any dedupe back into another system.
Also, Excel cannot offer a permanent deduping solution. It is good for one-off campaigns but cannot solve other issue with duplicate records e.g. storage and cost where it doesn’t integrate with other core platforms you might have. For example, you will need to run this process in Excel before every campaign and it could become time consuming if you don’t start implementing a more proactive or real-time solution.