- This post shares our challenges and insights of working with data from various sources as a data team at AIMMS.
- We used Great Expectations to power our data monitoring system.
- With this system we identified data quality issues and prevented unexpected results in our analysis.
Most people working in analytics will recognize the following scenario: you need to perform analysis on data which are generated or produced by other teams. Because you are a consumer of the data source and are not managing the data source itself, you need to investigate the data source to learn if the analysis can be done and how confident you can be of the results. However, this is not enough since data can change, and such a change can have a huge impact on the results you are delivering as an analytics team.
The data challenge at AIMMS
As a data team at AIMMS we are tasked with creating insights based on the data we collect at AIMMS. This means combining different data sources and performing all kinds of analysis on the resulting data. As probably in most organizations, data at AIMMS is scattered across multiple locations and data is maintained by different teams for their specific use case, for example the commercial teams are maintaining the contract data while the product teams maintain data about product usage. We realized if we would like to generate accurate insights, we need to trust the data we use. In other words what we expect of the data should always be true or we should be aware that our expectations are not valid so we can act.
In this post I would like to share our approach and experiences with implementing a data monitoring system to make sure the data we use for analysis has the quality we expect of it.
The data monitoring system under the hood
To realize our data monitoring solution we decided to use a tool called Great Expectations (GE). GE is a tool where you can define your expectations about your data. For instance, if you collect a datapoint such as the number of visits to a website you expect that this datapoint is always a positive number and that this number is an integer. GE has a lot of ready-to-use expectations available and it’s possible to write your own custom expectations. Expectations can be defined in simple Python code which makes the system flexible and extensible.
We set up our data monitoring solution as a job that runs weekly on our data platform and checks the data based on the expectations we defined. The result of a run of GE, called a checkpoint, which is a HTML document is sent to the data team by mail so the team can act on this information and cleanup any obscure data.
Results of our data validation monitoring system
The data monitoring solution is now running in production for more than a month and we have identified various issues in our production data: 6 of the data pipelines delivered data with low quality. With our data monitoring solution, we were able to identify, monitor and resolve the problem within days instead of being surprised when the data entered the analysis and impacted the results.
I am curious about your experience and how you approach data validation especially when working with AIMMS Low Code Platform or SC Navigator. Let me know in the comments what you think.