How We Scale Up Bukalapak Engineering Team

How We Scale Up Bukalapak Engineering Team
Photo by Headway / Unsplash
🕰️
Author's Note: This post was originally published as an article in LinkedIn

Since its founding seven years ago, Bukalapak has scaled up more than 6000x in the past five years alone, rapidly outpacing various established players in the Indonesian market and becoming one of the largest e-commerce in Southeast Asia with >$1 billion in annual transactions and >2 billion monthly views from nearly 70 million visitors.

Compared to other players in the market, we have an atypical strategy for achieving our significant growth and scale-up. Rather than spending senseless amounts of promotions or massive subsidies to acquire unsustainable traffic/activations, we focus on doing hundreds of growth hack experiments and countless incremental improvements to increase the quality and experience of using our product and service.

This strategy relies heavily on having a robust top-class engineering team that can scale up rapidly to keep up with the rapid growth. As we scale up, we found out that scaling up an engineering team is not just about adding more engineers to the team, but also about building the right kind of culture and organization that is capable of supporting productive collaboration and knowledge-sharing between hundreds of engineers.

This focus on scaling up our engineering team begun eighteen months ago when Bukalapak asked me to join the management team as VP of Engineering. At that time, we had started to experience growing pains and bottlenecks as we scale beyond our first couple dozen engineers. In this post, I am going to share my personal experience in leading and transforming the engineering team at Bukalapak to achieve its current scalability.

Metrics for engineering scalability

One of the first things I established during my first month at Bukalapak is finding the set of metrics that we can use to objectively measure the overall productivity of our engineering team as we transform our development process. After thorough engagement with dozens of engineers and stakeholders, our consensus converged towards the following set of metrics: team size, development velocity, and engineering quality. We believe any strategy toward achieving excellent engineering scalability will need to track, focus, and maintain a healthy balance between these three metrics.

Team Size Metric

This metric is probably the easiest to track and measure. We simply define this metric as the number of engineers within our engineering team. Scaling up this metric requires us to identify bottlenecks in our recruiting pipeline while keeping a careful balance between hiring the right talent and our growth demands.

Development Velocity Metric

We choose this metric so that we can have visibility regarding whether our development output is slowing down due to numerous bottleneck or not. Since we use Scrum, we can define this metric as the total number of stories (including bug-fix stories) deployed to production during a specific period. The complexity for each story may vary, but we found out that when sampling across the entire group of teams, individual or seasonal variance tends to smooth out and we got a pretty reliable metric for the total velocity for the engineering team.

Engineering Quality Metric

Sole focus on velocity will be detrimental if we do not also take into account the quality of the output that we deliver to production. We define this metric as the number of emergency-level incidents or bugs occurring in production. Concurrently, we also devise a strategy to triage and estimate the impact of each production incidents and map them into several existing severity levels. Due to the wealth of data that we collect in production, in most cases, we can use those data to estimate the impact of each incident with a reasonable accuracy and within minutes from being aware of the issue.

One year worth of data

We meticulously log and track the above metrics on a weekly basis, making it easy for us to see the progression of our engineering team and compare the data from time to time. The following chart shows one year worth of said data, aggregated by month and using first month's data as the general baseline.

In the first few months of our transformation, we can see the rapid growth in total velocity as we address the lowest hanging fruits in our laundry list of productivity bottlenecks. We are also delighted to know that can maintain our productivity afterward, even as the size of our team more than doubled.

Perhaps the most satisfying data in the chart is the emergency incidents, where we manage to bring it lower than before even as our total velocity nearly tripled. There is a genuine truth to the adage that the more complex a system is, the more bugs it has.

We had quite a concern in the second month when we observed that as our total velocity grows, the number of emergency incidents caused by our bugs also increased linearly. Fortunately, we were able to deploy various quality improvements and painstakingly beat this number down month by month. After twelve months, from the ratio of incidents per story, we manage to increase our engineering quality by 10x.

What did we do?

There is no silver bullet behind our scale up. It was a continuous process and teamwork over the span of more than one year where we execute dozens of action plans to eliminate productivity bottlenecks, streamline our development process, and empower people more in giving them the necessary support and trust for them to work efficiently.

That being said, there are several action plans that stand out above the rest and have more significant impact compared to the rest:

  • Foster a healthy sharing, helping, and learning culture. This was something already prominent in our culture when I joined Bukalapak, all we need to do was to provide a framework that can nurture this culture. For example, we built a chat bot and created a Telegram group where people can give virtual points - called high-fives - to each other as a measure of thanks for helping them out at work. We also encouraged various knowledge & learning guilds to form and self-organize. Guilds are internal communities centered on a particular work-related domain that anyone can join to learn together and took turns to present exciting topics with each other. Anyone can initiate any guild, provided that there is enough level of interest for the subject. So far we have dozens of guilds already popping up by itself, covering topics ranging from Artificial Intelligence to Agile Methodologies to Software Craftsmanship.
  • Daily release trains. Since early on, we already have the capability to deploy or rollback our services to production on a moment's notice and with zero downtime, and we sometimes do so more than a hundred times per day across all of our services. Switching to fixed release trains, thrice a day, is actually putting the brakes on our deployment frequency, but at the benefit of having more stable release checks and easier rollback in case that release went haywire.
  • Canary release. For each daily release, we first deploy the release to a small subset of users, less than one percent out of the overall user base, and observe whether there are any spikes in resource usage or errors caused by that release. If the release is problematic, we can rollback that release within minutes. Our services are entirely stateless, so we had to devise a way to emulate sticky sessions in our load balancer. The adoption of canary releases is the primary cause of the significant drop of emergency incidents at M9 in the chart above.
  • More real-time monitoring and alerting. We already have various technical-level instrumentation and monitoring in place, but those monitoring cannot capture mistakes in our logic or presentation layer. For example, if we can gather and monitor transaction data in real-time, we will know within minutes if there is a bug in our checkout system if the number of transactions dropped precipitously. Over the past year, we have worked together with the data science team to collect billions of data points per day and build thousands of high-level data visualizations that can give us glanceable insight into the health of our system.
  • Use consistent development methodology. When we started our transformation, all of the product teams do not have any agreed upon development methodology, every one of them is free to define their own, or even decide not to adopt any methodology at all. The result was, predictably, quite chaotic, and it is especially hard to align any plans that require cross-team collaborations. We decided to adopt Scrum as our methodology, mainly for the sake of better planning, predictability, cross-team alignment, and task organization, but we alter several of its practices since we deem Scrum as is to be too rigid for the rapidly changing competitive landscape of e-commerce. These changes are inspired by the eight years of Scrum practice at bol.com, my previous place of work, which also happens to be the largest e-commerce in Benelux region.

So what's next for Bukalapak Engineering?

Development process improvement is a continuously ongoing challenge, and we will undoubtedly keep a close eye on our engineering scalability metrics to project our scale up capacity and capture emerging bottlenecks early on.

Right now we are adopting the model of independent cross-functional teams that are empowered to a significant degree and given an enormous amount of trust and freedom to execute their vision.

This model works well even at our current size of nearly 300 engineers, but we foresee a scalability horizon looming within a couple of years as we yet again double the size of our engineering team. There are other possible models that we can explore and experiment with, and we will share an update about this in the future.

One more thing...

We are still scaling up the team and continue to hire top tech talents that can help us connect and empower the economy of millions of Indonesians. Our tech talents are a diverse mix of talents from all over Indonesia and dozens of former Indonesian diaspora overseas, many returning home for good to join us here at Bukalapak due to our purpose and culture.

Part of the reason why we can provide an exciting home for our talents, and for our diaspora to return home to, is our strong focus in building healthy tech culture by amalgamating various cultures and best practices from all over the world, gained through our ex-diaspora talents.

Interested in joining and helping us improve the economic prosperity for millions of Indonesians? Take a look at our StackOverflow page and send us your CV. :)