At Wimdu we are working hard to move to a microservice architecture. As one of the first projects in our efforts, we started to integrate some parts of our main application’s data with a 3rd party service. It looked like an ideal project, as it helped us easily achieve the two main technical targets we had set: First, to build basic tooling that could be reused across teams in the future to support a smoother microservices architecture, and second, to aggregate more knowledge in distributed systems in the company, facing and solving real-world problems with distributed systems. And believe us, we indeed faced many problems.
The current understanding of microservices architecture in our team is that we believe that a microservices architecture gives the biggest advantages compared to big monolithic systems once teams start suffering the problem of slow development pace. Usually, slow development pace is a good signal that the system has become too big and hard to maintain and it is time to try new architectural solutions that were overkill in the past but might fit the current needs.
With respect to the micro part, we think it has very good marketing aura – everybody is talking about microservices, but, it can also lead to misleading approaches and useless discussions, namely: how micro is micro? How many lines of code I must have in the service? How to make sure I am not insulting microservices gods with too big services? The answer, as always, is an “apply common sense”. The final goal of microservices is to have a distributed loosely-coupled system with independently developed and deployed services each one having its own responsibility. The size of a service should be comfortable for the developers to support and develop it and must follow the single responsibility principle. I would say, do not focus on the micro part, it is the last topic we should be worried about, the other technical topics are many more important. When you start extracting the services from the big monolithic application you need to make sure you are extracting bounded context and usually it is big. It is ok to start out splitting the bounded context into a couple of big services first. But to reach a good architectural solution you probably will iterate several times until you come to ‘pretty small’ services each of them fulfilling the goal of ‘1 service 1 responsibility’. Once you feel like your services are ‘pretty small’ and each of them is doing one thing and doing it well, you can agree that you have a good architectural solution.
For our particular system we decided to produce an event stream from the main application to a message bus. Then, consume this event stream in our microservices which in turn synchronise the data to the 3rd party API. With this setup we encountered many common problems in distributed systems: applications failing, network errors, duplicate events, events consumed in a different order than they were produced, stale events, race conditions…
We always kept in mind that any part of the system can fail and this motto was the driving force in most of our design decisions. We haven’t regretted that. When you build resiliency in the implementation it pays later. For example, we had many different failures from RabbitMQ failing due to a 3rd party service being inaccessible for several hours, something we would have not expected to happen that way, but that wasn’t actually a problem due to the design of the system. It is easier to maintain and much more flexible – when something breaks it is much easier to bring the system to the correct state.
We also had a particularly interesting problem: simultaneous update of the same entity by different instances of the same service using 3rd party APIs. Here we had race conditions and as a result, uncertain results of the update operation and possibly incorrect state of the entity in the 3rd party system. In cases like this you either need to do idempotent update operations or synchronize services using locking. The idempotent operations are impossible in our case because we are using a 3rd party API which we cannot change. That means the solution available for our problem was distributed locking. We will comment about this in future posts.
Another interesting use case we had was the necessity to synchronise all the existing data with this 3rd party API before serving all the data using an event stream. Even more, we had to do full resynchronisation a couple of times when we found some bugs or developed some new features. Evented architecture helped to solve this problem. We set up two event streams – first for normal application events and the second for synchronisation events. The synchronisation events can be generated from the snapshot of a database. With these two streams we got the flexibility of stopping all the consumers for normal events while doing full resynchronisation. That meant that we still accumulate events produced from the main application in the RabbitMQ queue and can consume them after full resynchronisation is done. This helps in cases when we want to avoid rewriting the newest data with old synchronisation events.
For normal events we usually run 6 instances of our consumer services – according to our measurements of regular workload this is enough. But full re-synchronization is a different story – in this case we have a peak workload of millions of events in the queue. To handle that we spawn much more consumer services – around 150 in our case. Fortunately it is very easy and fast to spawn many docker containers and you can automate everything. The consumer services in this system are not CPU-bound nor memory-bound, they are API-bound and when we spawn more than 150 services the API starts failing.
One more thing we learned the hard way was that if you are going to apply microservices in your architecture you need to involve real DevOps engineers into this migration. By DevOps engineers we mean engineers with focus on system administration who are a part of the development team, not part of a separate team, and who work closely together with developers on the technical solutions. The only difference between DevOps and Software Engineers worth mentioning is that DevOps people work most of their time on the infrastructure level and Software Engineers mostly on the application level, but they also work in pairs in order to share knowledge inside the team and achieve a better understanding for everyone involved. You need this setup because if you have complex monolithic application which is becoming extremely hard to maintain, with microservices you are moving this complexity towards the infrastructure – it is going to be much harder to maintain than before, but this infrastructure effort will eventually relieve some pressure from the application developers. In microservices, infrastructure is the biggest complexity point and if you have no DevOps resources to support it, do not expect the migration to go fast, effective and smooth. Right now we are working hard to have these diverse teams. In the end, the objective is for all of our developers to have knowledge about systems, but it is also something that will take time, and having good DevOps around will always be a necessity.
The main advantages we have seen in the microservices architecture are:
Currently our infrastructure efforts go towards continuous deployment in AWS, using Docker images of our services built automatically after a push to a branch, in EC2 machines provisioned by Puppet.
In future blog posts we will focus on some technical details of the solutions we are applying here, so stay tuned!