All About Modularity and Scale (Software Stack @ Zenatix, Part-3)
This is the third and the final article in the series of articles I have written, to provide an inside peek into the IoT Stack that we have been developing at Zenatix. The first one in the series discussed about the core components that constituted the initial architecture and the second one focused on how we closed the loop with automated control and analytics (metrics and issues).
As the business started to grow, a few things starting becoming imperative, resulting in us taking a few critical steps in the 2018–2020 duration that helped us prepare for an order of magnitude larger scale.
- Our stack was becoming a complex monolith with complex logics that were getting interleaved across different parts of the stack. As a startup, since the speed of execution was important, comprehensive testing was not done. The whole stack was not written with testing in mind, or as people call it Test Driven Development (TDD) was not followed. As a result, incremental small changes were becoming overly complex. That is when we decided that we need to start on the journey of converting this monolith into a service oriented architecture.
- Diverse UI requirements started coming in — for both different types of users within an organisation (e.g. a store manager vs a regional manager vs a national head) and for users across the organisations. The way the dashboard was built, it was not very easy for us to accept and promptly deliver custom requirements for some of the users.
- Edge stack was getting deployed at 1000s of locations and new business requirements keep coming in for interfacing with new devices. There was a need to rethink of edge as a micro-platform interacting with diverse variety of wired and wireless devices.
- Requirements from cloud infrastructure were becoming complex. We had free credits from one provider expiring but had a whole bunch still leftover from another cloud provider. To be cost effective, we were hosting some part of the stack across different cloud providers. In the process, focused DevOps emerged as a requirement.
Before I get into technology aspects of how these requirements for modularity and scale were addressed, let me first comment on a couple of aspects that started gaining importance, at this stage of the company.
Need for the “Product” role — While the initial years were all about building things iteratively (as suitable product market fit is recognised), as the stack matured and was ready to serve multiple use-cases, it was important that the focus need to be defined. This is where a separate “Product” function (this role was mixed with the technology team until this time) starts becoming important. At Zenatix, we started building product function with following thoughts:
- We need to think of two types of users — internal (our deployment and troubleshooting team for whom we have been building the technology components to effectively manage large scale deployments) and external (our customers), and define product requirements for both.
- Start with someone internal to the company, who has the context, and later on get someone from outside who is experienced in product role.
- Product should be independent of technology and hence the reporting should be planned in that way.
Establishing Product function helped in two major ways — requirements from Business side got filtered at the product level and a high level roadmap got created with a focus on where we want to go as a company.
Deeper focus on culture — As we keep scaling the company, it becomes all the more important to talk repeatedly about the culture. In early stages, new recruits work closely with the founders and the culture aspect rub off. As the company scales up, we had to ensure that several of these culture aspects are deeply ingrained so that they get passed on to the new recruits (who in a scaled up case do not work closely with the founders).
Coming back to technology development to cater to the critical needs outlined above.
A monolith to a service oriented architecture — We started taking one service at a time (whichever was the one in which we had to make significant changes or feature additions) out from the monolith. It started with Issues and the most recent one coming out is the Metrics service. We aim to finish this transition into a fully service oriented architecture in the next 3–6 months, starting almost 18 months ago. A lesson learned through this technology transition was that this (transition into a service oriented architecture) is a long journey and one should plan for it accordingly. As we started separating out services from the monilith, we started dockerising them.
As these services started increasing, we realised that managing so many services is becoming tedious. That is when we realised that we need a journey to move our services into a Kubernetes cluster.
It was a very new infrastructure technology but we committed to the transition (understanding the benefits it would provide) and over time have moved all the services (other than the database) into the cluster environment. Such a service oriented architecture also necessitated changing the way we ingest data and thus came in Kafka cluster inside the stack.
Modular Dashboard — Over years we faced many requirements for dashboard requiring configurations that vary not just as per each customer but also for different stakeholders within the customer (e.g. starting from a restaurant manager to Operations head managing 100s of restaurants across the country). As we worked on the suite of services (that we together term internally as configurable dashboard) for this configurability, we also introduced a real time database that enabled us to provide fast and responsive web interfaces, inspite of the huge volume of data that gets processed by the widgets therein. We have come such a long way in achieving the goal of configurable dashboards across customers and for different use cases, that I also sometimes get surprised at the dashboards that the program management team is configuring, for the new use cases that the business team keeps bringing in, without any involvement from the technology team.
Edge as a micro-platform — In our attempt to simplify the deployment (and make it as plug and play as possible), we decided to invest into a low power, multi-hop wireless mesh technology. This required a significant effort on the end node side (building upon the OpenThread protocol from Google). This also required significant development at the edge gateway supporting IPv6, much more than simply creating a driver as we had been doing for other wired and wireless interfaces we had supported thus far. Over the years, we also did a whole lot of ad-hoc changes in the edge IoT stack. As a result, we felt that there is now a need to re-architect the edge middleware knowing all the detailed features we need to support and implement those features in the right way.
We undertook this massive edge stack restructuring (which was the first TDD that we undertook and for which the complete testing environment was also developed) which followed months of efforts to ensure that all the gateways running on the field get transitioned into the new stack. This update, across all the field gateways, became challenging since several of the gateways were running older (and varied) versions of the OS and had missing dependent libraries that required manual interventions at many of the sites. This was compounded by the limited bandwidth (due to 2G connection or 3G being rolled back by the telecom provider) available across a number of locations.
While getting on this journey of restructuring edge middleware, we have internally started looking at edge+end nodes as a micro-platform that requires thinking through — how a wide variety of end nodes communicate (in a seamless manner) to the edge gateway; intelligent decisions at the edge; and developing the API layer that communicates effectively with the cloud based device management layer.
As the IoT stack evolved, we realised that we need to bring together all the services we have been building to manage the 1000s of devices that we have installed and continue to install. And thus came in the Device Management platform, through which we are structurally unifying the different services (such as remote updates — internally called as OTA or Over The Air Service and Device Shadow) in the cloud with a suitably planned edge SDK and a user interface (developed using the same configurable dashboard platform that we use for our IoT stack). While our internal team is already using this unified platform, we are enhancing it to get it ready for many of our partners who help us in deploying and maintaining the systems across geographies. More on this when I get to writing about the next version of our stack a couple of years down the road.
The current architecture is shown in Figure 3.
A lot of effort and planning has gone into evolving the technology from the early stage architecture (shown in my earlier article) to the one shown in the Figure above. Lots of detailed discussions and research have been done in every single change, carefully evaluating all the different options through multiple Proof of Concepts (PoCs)before deciding on even if we should move forward in that direction and if we should then with which of the possible options.
Looking forward to scaling up the technology for the next leg of growth at Zenatix.