This world is full of data-intensive applications. Data is everywhere. And it is a logical evolution that in the last decades the paradigm of making management decisions has drastically changed from the director board’s gut feeling to the diligent analysis made by departments full of highly-educated people who construct the results of their work based on the concrete of market, sales, customer, and other sources of data.
More different data sources you have—the more people are involved in gathering, massaging, and analyzing the data. Mobile apps, marketing, online store, stock management… So many sources of such insightful and useful information for making decisions! And they all will be desiring to integrate with analytical engineering and data teams. The bigger your company is — the more integration points your data team will have with its data partners.
“Well, yes” you would say, “what can happen here?”. Oh, my sweet summer child!
It is obvious that in this changing world you can never expect all your systems work stable, without failures and, what is more important, without changes. Changes are the natural way for a company to react to external turbulence. Even if you are not a fast-growing business, you need to adapt. Changes in laws, court decisions, and new markets available for your goods to be placed. The world is challenging your company to be flexible and agile.
All the written above is having a huge impact on your data team. At some point, you can end up with a mess of data providers asking you to implement a change for integration right here, right now as they have already changed their code and deployed it onto production servers.
You will have to obey. You will have no time for improvements, proof of concepts, or even sometimes no time to fix bugs. Each and every engineer in the data team will be working on a new change request. You will lose control of the company’s data future as everyone’s participation in the growth of the data platform and team evolution will be miserable. That will be a data hell. The hell you cannot escape from because most of these data sources are involved in the decision-making process and stopping or postponing these changes will have a significant impact on financial results respectively.
Hopefully, I have not intimidated you to the shock. Keep yourself fastened, kids, everything is under control. It is not only you who waits for the arrival of this not bright and not shiny future with a lot of turns and twists. Many companies are trying to solve these not-anymore-hypothetical problems right now. And there is one logical and reasoned solution. But firstly, let’s try to find the cause of the problems.
When your data team agrees with another team on new data supplies, I hope they are discussing the future integration in detail (say “yes” here, don’t disappoint me, please). You have to agree on a tool and technology to use, on data types and frequency, schedules, and event triggers. Would it be a file integration via SFTP or a new Kafka topic? Would it be a daily load or a process that should start when a new file arrives in the object storage? Would each record have an integer identifier or a string-based one? And so forth. This negotiation and its results are usually well-documented and shared among all parties.
The main cause of the potential problems here is that future compliance check will be done manually. You can compare this to the difference between manual and automated test execution. Yes, you have a test. Yes, you can execute it. No, you are not protected from production issues here as we all are the people. You can only rely on rules, instructions, and regulations for code deployments in your company and pray for people to follow them. You’ve asked what can happen here at the beginning. Well, you got a call from your modern expensive alerting system that your night pipeline execution has failed because of an unexpected file structure. Changes on the data provider side have been recently implemented, you already received a file with the new structure. There is no way to revert their changes. You only can apply a hotfix on your pipeline.
How to protect your data car from lane departure? Correct, you can jump into another car with cruise control which will keep your car in the lane for you automatically.
First of all, the contract between your data team and their provider (let’s call it a data contract) should be formalized and transformed into a machine-readable format. It could be JSON, it could be YAML, it doesn't matter. You just need to agree on the company standard here. After the agreement is made, you have to define and describe all the existing data contracts between the data team and their suppliers.
The second step is to store all the data contracts somewhere. You can use any object storage like AWS S3 for that. If you decide to grow, you will be able to manipulate those contracts with middleware like Lambda.
The third and most important step is to add a contract validation step into data providers’ CI processes. This needs to be done on their side to prevent potential integration problems at the source, before the time when the data will be generated in a different format. Providers can implement this step on their own. Or you can create a company-wide solution for that and provide them with a simple tool. The key here is that the contract is managed by the consumer.
The whole idea of data contracts is to generate a sample from the provider’s application each time they make changes, and assure everyone that the sample is compliant with the data contract. If the data team is not ready to receive data in a new format, then it makes no sense for the provider to promote those changes into production and their CI should fail at this step.
This solution can definitely be improved and adapted for different situations. For example, contract versioning can be added in the further steps as well as support for different data formats. For really big teams, it might be worthwhile to add contract status tracking into the service.
One more thing. You probably won’t be able to inject this validation into external data generators like public data providers. However, you can implement checks at the place which is closest to the integration point. Of course, in the case of large files, validation of all incoming data could be not cost-effective but there should be another article about optimizations.