Machine learning is becoming a part of our daily life. It is a core element in many mobile applications, health diagnostics, image and voice recognition, natural language processing, translations, sport achievements, business decisions, marketing insights – just to name a few. There is one common element that all industries have in common is the usage of data. In most cases large datasets are humanly not computable. If it is so, then quality of data is crucial to the development and running of successful machine learning algorithms.
Data quality matters in model training and validating. It is especially important when running your model against new samples or implementing already trained model. The latest is known as a scoring process. Data quality also matters when a model has been trained, during its implementation, monitoring and when it needs to be re-calibrated.
Clean data is required for both supervised and unsupervised training. “Dirty” or incomplete data impacts your model’s ability to produce a usable and accurate model. Data cleaning needs to be embedded in all stages of the machine learning cycle: training, validating, tuning, systematic monitoring of the model performance, model re-calibrating, and preparing of scoring samples.
One may argue that machine learning is usually associated with “big data” or data lakes and, therefore, some “noise” in data is not so critical when training a model. In most cases, it is true that machine learning models training requires millions of examples and incomplete data is just part of the process. However, unclean data may produce undesired results such as creation of incorrect segments in classification tasks or erroneous predictions. Outliers may affect applications that are built on big data as well as applications that are trained on a limited number of observations. Indeed, modern machine learning models such as decision trees, random forest, artificial neural nets, gradient boosting are more robust dealing with outliers than traditional regression models. Even though, the very existence of outliers may be a real problem. For example, unsupervised learning such as clustering may produce special cluster(s) grouping outliers. If these outliers are the result of data collection errors then that cluster is incorrect and future scoring may lead to wrong cluster identification.
Furthermore, some machine learning algorithms require the original data to be transformed first.
Generally, data cleaning in machine learning should be focused on features (also known as inputs or predictive variables), labels (also known as target variables or objective functions), incomplete data such as omitted features or missing values, and existence of redundant data. The latter may cause a biased result overweighting information that exists in redundant records.
Uncleaned data affects machine learning in two ways: (1) wrong, incomplete, biased data used to train a model and (2) “dirty” sample data scored by a model.
The former problem affects quality of the trained model. As a result, wrongly trained model will produce bad results even in case of clean scoring samples. The severity of that problem is that the model performance criteria such as misclassification rate, mean square error, R-square, Gini or ROC indexes may indicate a good model with no overfitting, comparing training and validation data. In reality, the model being well fitted on unclean data may be irrelevant against a good dataset.
The problem of a “dirty” sample for scoring will affect the outcome of that scoring even if you have an excellent fitted model. In cases when the number of problematic samples is small this problem probably will be undetected. If there is a lot of bad samples, the issue may be detected by establishing a systematic monitoring process when a degradation of the model performance is noticed. However, in this case the wrong scoring result in the past cannot be corrected simply because it is too late. In addition, acknowledging that the model is underperforming may lead to unnecessary model re-calibrating or re-training. As a result, worse model will be produced replacing the good one.
Let us consider an example where time series of daily sales have a seasonality effect at the end of each month. Assume a situation where date format has been wrongly transferred mismatching days and months. As a result, there will be no days after 12. Predicting the sales, the existing seasonality effect will be ignored even using the most advanced recurrent neural network. Therefore, the outcome will be simply wrong.
Incomplete data is one of the major machine learning problems. Some machine learning algorithms are not allowed to have missing values at all. They may ignore complete rows even if a value of just one field is missing. Some may recognize a missing value as a valid one. What should be noted is that all these cases lead to suboptimal model during training and incomplete results or elevated errors during scoring. Even trying to populate missing values with some arbitrary approaches, such as, using average values or more sophisticated ones, cannot mitigate the issue in most cases.
Another typical example of “dirty” data is related to uncleaned geographical places. Thus, if there is an inconsistency between city and state fields, then New York may be recognized either as city or state, the same for Washington. More intriguing is that Geneva can belong either to Switzerland or to New York state. Another example is formatted numeric data, such as phone numbers, used to identify geographical location. These numbers may be affected by poor data quality mixing area code with, for example, country code. Performing spatial machine learning algorithms on examples with such a poor data quality may wrongly locate places as far as thousands of miles apart from their actual positions.
The bottom line: clean data is a crucial attribute to succeed in machine learning and you should consider to perform effective data cleaning prior to training a model and then when using it.
So how do you deal with incomplete, wrong and unstandardized data? Well, before you can begin to think of ways to clean your data you need to identify where bad data is located. Is it affecting one field only? Does it impact most records or a few? A quick way to do so is to actually examine a small dataset and see if there are any issues. Once you have found your problem, you can then decide if you want to have it cleaned or exclude certain data points or fields from your models. Your second choice is to clean your historical data using third party data cleaning services and fixing your data quality issue from happening in the future.