DataChef's Data Product Programming Model: An overview
March 6, 2023
Has anything like this ever happened to you?
- Developing three Spark applications to move 100 rows of Excel data through a pipeline?
- Developing a streaming pipeline but only receiving a few events per week?
- Spending several sprints on a product that the business only needs once?
- Platform complexity causes data products to be delivered late.
You are not alone, and the following might interest you! 🤞🏻
Developers typically first design the main architecture when working on a data platform. This happens regardless of the methodology they use. Any future data products should be based on this architecture. Consequently, it is designed to support the organization’s futuristic data product possibilities.
Team members’ interests sometimes influence this design. Kafka will likely be part of that architecture if the leading developers are Kafka experts. It is reasonable to expect the main architecture to incorporate bleeding-edge technologies or paradigms when the CTO promises an innovative solution.
Everything seems bright and smooth at first. Also, the system meets business needs to a certain extent. However, the real problem arises when the team realizes the following:
- Actual and expected throughput differ significantly.
- The availability of data sources and products varies.
- There are some experimental products, and some data products may only need to be processed once.
- As a result, delivery timelines and estimates are becoming unrealistic.
The Only Constant in Life is Change - Heraclitus
And that’s true in the world of software and data products.
What are we missing?
A major problem with having the architecture before building the data products is that no one knows what they should look like. The data mesh paradigm, for example, provides extensive documentation on how to shape a data product. However, their primary focus is on the data and platform rather than the architecture. Typically, architecture is designed once, and then data teams build products based on it.
Some teams develop sample data products, proofs of concept, or actual data products based on one data source. However, they lose sight of all the future data products they will need to build.
Due to this limited view of the available data and what should be expected in the future, the problems listed in the previous section will occur.
Due to the unlimited resources provided by cloud services today, optimizing your application for the highest possible load is easier than ever. The scalability potential of the proposed architecture would interest any manager, precisely how much it can scale up. However, nobody is interested in the scale-down potential. The bleeding-edge stream processing architecture processed 10k events in six months, and now all eyes are on developers for the actual impact, which costs the organization $5000 per month.
Listen to your customers!
We miss a valuable asset that can help us prevent this issue. Our customers, the people who are about to use our product. This is, of course, the first principle of a successful outcome. Listen to their needs. And listen carefully. After all, what we create is supposed to generate value for the organization by optimizing our customers.
However, this doesn’t mean that we need technical guidance from them. They are not tech-savvy like us, but they are business experts. And we need to understand each other with clear boundaries.
On the other hand, they also need to understand our challenges. Communication should be bidirectional. It helps them to see if the complexity of their requirement’s implementation is worth the development cost.
Only when we design our products based on clear mutual understanding and tailor them towards their needs will we be confident that the products we’ve developed provide the optimal value we were looking for.
In need of a common language
While working on client projects at DataChef, we realized this problem and committed to overcoming it. In our work, we place a high priority on the value of the products we create and the impact they have on the overall performance of the organization. Feeling the lack of effective communication, we developed a common language.
This language, called the “Data Products Programming Model,” is incorporated into our process of creating data products and helps avoid the issues we initially mentioned in this article.
What is the Data Products Programming Model?
To design a data product, we have three main goals to achieve. Each data product we create should:
- Understandable: All parties interacting with it should be able to understand it.
- Easy to Implement: Fast delivery and short feedback cycle helps to prevent unnecessary costs of developing complex data products.
- Lightweight: So it’s easy to maintain and change in the future, to adapt to new requirements.
With these three goals defined, the programming model acts like a compass to help the engineering team achieve these qualities. Now let’s have a high-level overview of how it happens:
The programming model should be simple enough for non-technical people to pick up and use in the design conversation. One of the primary audiences for this programming model is business experts. Of course, they are not interested in learning the technical complexity of existing systems. Therefore, we don’t want to spend time training them on an unfamiliar concept in which they probably won’t have much interest. The programming model achieves this goal by being defined around a single image, which takes less than 5 minutes to describe. (so no special certification or resume points are expected from it 🤓).
- Making the business aspect of the product understandable to engineers.
- Visibility of the implementation challenges to business customers.
- Tailoring a unique architecture for each data product.
Incorporating a programming model into the data product design process results in lightweight components that are easy to maintain and manipulate.
Easy to implement.
This one is a direct outcome of the last two goals. We ensure the final implementation will be “reasonably” low cost by designing a lightweight and understandable architecture.
In this blog post, we started by defining the main problem and discussed how we try to resolve this issue for our clients at DataChef. In the rest of this series, we’ll cover the end-to-end process of utilizing this model and integrating it with our data product design workflow.