Data has become vast over the last few years and companies are continually looking to derive value from it wherever possible. Every business has several sources of data, from multiple channels and virtually always in different formats. It is not uncommon to have batches of transactional or financial data, website data such as Google Analytics, perhaps unstructured data like social media comments, images, documents, speech and many others.
Having all this data available is fantastic but it is often underestimated as to how much work needs to go into keeping that data clean for it to be ready to create an impact. You could have the best Data Scientists in the world building some amazing algorithms which could generate millions of dollars in income but if the data feeding it doesn’t make any sense, their true value is ultimately zero. And worst your data scientists now have to spend their time data cleansing and data preparing instead of coding.
For any adventurous companies looking into the world of Big Data, AI and machine learning, having a data cleaning strategy in place is imperative to success. New processes and technology can do amazing thinks but if they are being fed the wrong information, implementations will end in failure.
And what about the small businesses, who have a data set of 5,000 companies they are trying to sell into or an ecommerce shop with perhaps a 100 new sign-ups a month?
Whilst data cleaning strategies will be dependent on your line of business and the data being ingested, there are some generic guidelines that be followed to ensure you are making the most out of what you have.
- Does it make sense?
Without advocating “gut feel” to analyze data, most people are able to look at a subsection of data and see whether it makes sense at face value. This could be taking a random 10% sample of your data (if you have billions of records, something like 1% could suffice) and going through it manually whilst answering some standard questions like:
- Does the data match what the fields headings say?
- Do columns that are supposed to contain data of a certain format, contain that format e.g. are date fields formatted as dates
- Are all contact fields like phone, email and mailing address complete?
- Are there any duplicate rows?
- Is there any incomplete information or anything that appears inconsistent within the rows of data?
- Is the data usable or do I have to pull a different variations of the same thing cause its spelled or inputted a bit differently?
If the answer to any of these questions is no, there is clearly a case for data cleaning. Sometimes the results are inconclusive and you need to take another sample to be certain but sense checking is a logical first step in the data cleaning strategy.
- Can you correct the incorrect information?
If you are happy with the sense checking, it is time to start deploying the data cleaning exercise. This will involve finding out the best way to correct it. For example, if you have inconsistent data, is that information already in the business for you to be able to complete it? There might be a need to use third party data appending/ enhancement or verification tools if you cannot complete the data cleaning with business data alone.
There might be some items that are typos e.g. J0hn instead of John where you can clean them quickly but if customers have incorrect email addresses it will be difficult to work out the correct version without speaking with the customer first.
As a rule of thumb, data cleaning can be split into two sections; items you can fix logically such as typos with existing data or small amendments that you have 100% confidence are correct. Missing data will need more research and analysis before data cleaning can take place which could involve investment in tools, time or data cleaning services.
- Data cleaning tools
For data cleaning tasks, MS Excel can be enough where there are fewer than 100,000 records and there are logical patterns in the data. Excel can only work with structured information but has enough power with it functions to fix problems such as typos or even lookup addresses from zip codes if required (VLOOKUP).
More complex tasks might need data programming tools such as Python or R which are both open source platforms. These tools can take on vast amount of data for cleaning and process it incredibly quickly. Like Excel, they do generally need logical patterns but can do more complex algorithms with multiple libraries on offer to do so.
However, your best bet is to hire a data cleansing company who have access to the best tools on the market and who do this professionally. Often the cost of hiring a data cleansing firm is less than the time it would take data scientists or company employees doing it manually.
- Third party data
The use of third-party data sometimes known as data enrichment, is a great way of data cleaning. For example, imagine if you had zip codes for every customer but failed to collect the full addresses as they were not needed when starting up. However, you now need to send direct mail which requires an address list.
A third-party like Google can lookup each zip code via an API and return the addresses for each to do the mailing. The data can then be appended to your existing systems if there is adequate reason to do so in line with compliance.
These are some of the key aspects to achieving clean data but the list is heavily dependent on your business type. With the growth of Big Data an unstructured information, the nature of clean data is far harder to understand than ever before. For example, how would you go about a data cleaning exercise for all the information gathered by an Amazon Alexa device? Something like that could be a thankless task but hugely valuable for the future of the AI and IoT technology.
Businesses need to have a clearly defined data cleaning strategy to ensure they remain competitive and provide the best possible customer service. If you are looking for data cleansing companies for your custom data cleansing project, StrategicDB can help!