Cleaning data is essential prior to beginning your analysis or loading data into your business intelligence tool. If you load data that is inconsistent, inaccurate, unstandardized, duplicated, incomplete and outdated it will be impossible to get the right analysis done. What can be worst is that without realizing you have bad data is that you will make the wrong business decision which can impact business’s bottom line. Here is a 7 step approach to cleaning data for your analytics.
- Data Audit – The first step is to analyze the actual integrity of your data. You can do this by exporting the first 100 rows of your data set(s) with all the columns that you wish to use in your analytics. Examining the data will provide you with your data cleansing plan. Some things to look for includes: data completeness, any standardization/normalization issues, are there duplicates, how outdated is your data (if you do not filter by dates), any formatting issues and how correct is the data (which can be hard to establish but still should be examined.
- Formatting – The second step is to format your data so it is usable. This could include standardizing your fields by grouping them or identifying the need to use text mining techniques on text fields. Date fields may need formatting and so on.
- Data Validating/Verification – Perhaps the most important part of the process is making sure the data pulled is accurate. Sometimes you can reference the data from a different data source or validate it using third party databases. Alternatively, you may wish to consult different departments to understand how reliable the data is for your analysis. You may wish to exclude data points that maybe not accurate or decrease the confidence level of using those variables for your analysis. Depending on data structure you may not be able to verify or validate the data.
- Data Appending – Once you have validated or verified information, you may uncover that you are missing critical information that may skew your models or make it impossible to have any reliable results. The solution is to figure out how to acquire this data. You may look at other internal data sources that can host this data, acquire the data using third party data providers or conduct research to try to find these data points.
- De-duping – You may uncover that you have duplicate records in your system, this can skew your results. Therefore, de-duping based on business rules should be included as part of your data cleansing. You can use de-duping tools to speed up your data cleansing process.
- Identify data filters Now that you have formatted, validated, appended and de-duped, you may wish to exclude certain data points from your analysis by applying a filter(s) to your overall data set. For example, you may have discovered that 30% of your data is from the 1970s and is not applicable to your eCommerce business today, therefore excluding any data that is older than 5 years maybe beneficial.
- Running basic metrics – Finally you can begin your analysis, however, prior to doing so it is recommended to run some basic description metrics to finalize your data cleaning initiative. This could involved basic record count, some basic stats on percentage of users that x, and so on. Running these basic results with different people in your organization can help validate your data cleansing initiative and help with establishing a higher confidence in your future analysis.
Cleaning data for analysis can be time consuming and will typically take longer than doing the analysis! If you are looking to clean your data so you can focus on analysis, StrategicDB can help, we are a full service data cleansing and analytics company, focusing on high data integrity for business decision making.