Introduction

Since the outbreak of COVID-19 and the international lockdown, we've seen a huge increase in the number of people using Crowdcast. We’re so grateful for the opportunity to empower people in making the shift from in-person to online events during these challenging times.

Aside from the uncertainty we’ve all been experiencing, we faced many technical challenges at Crowdcast as we worked to scale to the new levels of demand for online events. We’ve been working hard to resolve the problems that have come up as quickly as possible, and we’re deeply sorry to everyone who was affected by any of these technical issues.

At Crowdcast, one of our core values is transparency, and we believe you should be aware of what we are doing to make things better. So we've decided to create a blog post to keep you plugged into our behind-the-scenes efforts to scale Crowdcast.io.

Sudden Growth

Crowdcast launched in 2016 and our traffic grew completely organically 100% year over year. When COVID-19 and the lockdown happened in March 2020, we suddenly had a growth rate of around 50% every two weeks.

Increasing number of users on our platform
Increasing number of users on our platform

The number of active users and sessions, or live broadcasts, on the platform increased nearly 10 times.

Increase of sessions in Google Analytics
Increase of sessions in Google Analytics

Like the rest of the world, we didn’t expect all of the changes that were coming.

The extreme growth created many challenges for the small team of developers working at Crowdcast before March 2020.

During the 3 month period following the start of the lockdown, our engineering team grew from two to six full-time developers and our customer success team went from two teammates to nine. Our new team had some catching up to do.

Technical Challenges

Software engineering has a lot of things in common with other engineering disciplines. Like architecture, for example. If you try to build on a weak foundation, you're in for trouble.

In software architecture, the data model and the database system is often referred to as the foundation.

In our case, we built too much of our application logic on Google Firebase, which then caused a bottleneck in our performance. But because it was the foundation of a huge part of Crowdcast, it was very difficult to adjust for afterwards. It's a database, and we were using it to store information, help users login, and make our chat work. When Crowdcast launched it allowed us to move quickly and iterate but it ended up causing problems for us as we grew, because we had many essential parts of Crowdcast relying on one external service.

When we had multiple large events going on at the same time, the events put too much stress on our Firebase, which caused timeouts. When there’s a "timeout," the website won’t load.

For reference, the worst timeout in the history of Crowdcast happened during the Hay Festival, where we ran an event with 36,000 registered users.

charts showing a critical load
High load bringing our site down

Luckily, the Hay Festival organizers kept their cool, and the event went well and over the next 10 days they brought half a million people together for the online festival. The Hay Festival audience members who couldn’t get in live during that time were able to watch the replay.

Working with an online festival that had hundreds of thousands of people watching taught us new ways to identify any performance bottlenecks in the future.

Finding Solutions

Thousands of our users were affected every time we had a timeout—so the engineering team was eager to come up with solutions quickly.

Understanding our traffic

It was hard for us to tell what exactly was causing the high load on our database, because Firebase doesn't offer that information directly. But we were able to set up a system to track that data every two minutes, and that gave us some insights into where spikes were happening.

Caching, caching, caching

Once we understood which parts of the system were causing the most traffic, we started to look for ways to optimize.

One solution we found was caching, which is a way to speed things up by storing information so it can be accessed faster. Caches are useful for storing user information that doesn’t usually change often, like usernames, avatars, and event details. Caching helped to take the load off of the database, especially during times with higher traffic to the site.

By making these changes, we made great progress in keeping the database from getting overloaded. But caching has its downsides too—a common bug introduced by caching is that the system shows data that hasn't been updated. So whenever a user would update info like their username or profile picture, the cache needed to be refreshed to reflect the changes.

person playing whack-a-mole game
When our team fixed one set of problems, new issues popped up

Updating our chat

We also found that large events with an active chat were causing problems. Firebase was updating for everyone each time a new message was sent in the chat, which was putting even more load on Firebase, and in the worst cases it caused severe outages of 10-20 minutes for the whole site.

To solve this, we architected a new custom solution and moved the chat into its own separate microservice. The serverless technologies we use for the chat can now automatically scale with traffic, because each part is independent from the other parts of Crowdcast.

Moving Forward

As a result of the increased traffic due to COVID, we’ve implemented many improvements under the hood to increase our stability and resiliency overall—but clearly there is more to do.

We've recently experienced minor outages due to third-party services we rely on to make Crowdcast work. One solution we're implementing is setting up these services in multiple regions. So even if service to one region goes down, we’ll still have a back-up to make sure we stay online.

In addition to these fixes, we're working on a massive platform upgrade with a new foundation that we’re hoping to release this fall. Our main focus for the next version of Crowdcast is performance, platform stability, and accessibility.

We take the responsibility of being a space where millions of people come together to communicate very seriously, and we want to thank everyone who’s been using Crowdcast over the past few months for your patience as we’ve been moving through these growing pains.

We're more excited than ever about the road ahead as we continue towards our vision of being a global plaza for live video conversations. See you on a crowdcast soon!