Storm Forecast

Open? Or Closed? On Workload Models for Performance Testing

2020-07-29T23:00:00+00:00

There are many differentiating properties when it comes to load and performance testing tools. The general workload model is one aspect that is often overlooked. However, tests that run using the wrong workload model can vastly underestimate latencies and provide a false sense of security. In the following we explain what workload models are and we shed some light on why we believe that StormForger is using the “right” approach for many, if not most cases.

Workload models are a topic that we talk about a lot when giving presentations, making product demos or onboarding our customers. We think it is essential to have a basic understanding about this for having good and realistic performance tests.

Workload models at first sight are a boring, apparently "theoretical" topic. As with many theoretical topics it turns out that workload models have a very substantial impact in practice. Workload models greatly influence what you are actually testing and are probably way more relevant than one might think. It also explains why you get vastly different results with different performance testing tools (because they implement different models). I'd also argue that this fundamental principle is often simply overlooked when choosing a tool to run any kind of performance test.

Workload describes a unit of work that is being executed, e.g., by a simulated agent and is usually a series of requests and other steps. This could be one or more business transactions like "load start page", "search for product", "add to basket", "begin checkout", …

Workload models describe the basic principle how a defined workload is executed in order to perform a performance test. In this article, we want to differentiate between open and closed workload models.

Closed Workloads

In a closed workload model, you define a fixed number of concurrent agents, isolated from one another, each performing a defined sequence of tasks (the workload) over and over again in a loop. There may be a pause between iterations, but that is not important for this article.

Here is some pseudo code to give you an idea how a closed workload tool works in principle:

for 1 to $concurrency do
  fork do
    while do
      executeWorkload()
      
      sleep($iterationDelay)
    end
  end
end

A typical, well known and very simple example for a closed model is Apache Bench (or ab for short). With ab you basically state how much concurrency you want and what you want to hit and it will try to perform those requests as fast as possible. JMeter is another example of a closed model load generator.

Open Workloads

In an open workload model, you have a defined rate of arrival at which new agents are spawned. Agents are isolated from one another, performing a defined sequence of tasks and are terminated when they have finished. Started agents are independent from one another and new agents are launched regardless of the state of currently active agents.

Again, here is an example in pseudo code how an open model implementation looks like:

while do
  fork do
    executeWorkload()
  end
  
  sleep(1/$rateOfArrival)
end

StormForger uses an open workload model with the ability to mimic a closed model.¹ Another example of an open workload model is tsung (which we've been using internally for many years).

What’s the big deal?

One obvious difference between the models is that closed workload systems are easier to reason about: You know the number of agents in the system beforehand (it's constant). For open models this number is a function of Little's Law.²

To quickly re-iterate: Closed model systems have a fixed number of agents executing workloads. New work is only scheduled for an agent, when it is done with the previous one. This leads to a fundamental drawback: The system under test (SuT) coordinates the test itself. If the SuT is slowing down or stalls, then the entire test is impacted. During this time, all agents waiting for responses also stall, no new requests are made and load is taken away from the SuT which in turn allows the system to recover. Currently active agents in an open model system would also be impacted, but the crucial difference is, that new agents continue to arrive at the system. This keeps the pressure up at the target and better mimics real world situations like marketing campaigns: A slow shop experience will not stop customers from clicking ads, hitting reload or opening newsletters.

Imagine what happens to the rate of new transactions per time when the System under Test has a hick-up:

There are also other issues regarding closed models, like the "Coordinated Omission Problem", a term coined by Gil Tene.³

True open models only exist in theory though. Resources are limited and there are practical upper limits on the number of active clients, which is technically unbounded. In case you are hitting resource limits with your open model tests, they should be considered inconclusive, discarded and repeated with more testing resources (or with a lower traffic model). This is the main reason why it is critically important to monitor the test itself closely which is what StormForger does automatically for every executed test.

Conclusion

There are good reasons to use open and closed workload models. It is important to know the difference though.

We at StormForger believe that the open workload model is the one you probably want to use for many scenarios. It reflects better what happens in real world situations and it is better suited to detect problems with your system under test in many cases. Open models are harder to reason about but it's worth it to ensure your load tests are realistic and your system can perform in production as expected.

Mimicking a closed workload model in StormForger is possible. You can specify an upper bound of clients to start per arrival phase. Using this, in combination with an endless loop around each scenario you effectively get a closed workload model. ↩
Little's Law (queueing theory) states, that for a stable system L = λW, where L is the average number of waiting customers in a stable system, λ is the average rate of arrival and W the average wait time of a customer.

This can be directly translated to an open model performance test: With an arrival rate of 100 per second, with an average time in the system of 15 sec (including requests and think times) the average number of active agents is: 100 arrivals/sec * 15 sec = 1500 clients. ↩
The "Coordinated Omission Problem", a term coined by Gil Tene, see "How NOT to Measure Latency" (video).

Gil talks at length about what happens to your measurements and how it leads to false assumptions especially if systems under test stall for closed workload tests. While compensation is possible to a certain degree in those cases, it quickly gets difficult for non-trivial workload models. ↩

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Cookie Domain Mapping

2020-06-24T23:00:00+00:00

Performance testing often happens on non-production systems, like a QA or dedicated "perf" environments. In some cases it is easier to directly access those systems for testing instead through a load balancer serving traffic under a shared domain. This approach leads to a typical problem though: Cookie domain mismatches and lost sessions.

Let's say you are running an ecommerce site and have a QA environment under shop.qa.example.com. The shopping cart is managed by dedicated service running under cart.qa.example.com. In production the entire shop, including the cart component, is served from the same domain under https://example.com. Requests to /cart/* are routed to the shopping cart, while the rest is routed to the shop system.

In production, when everything is served under the same domain, you don't have any problems. For performance testing there can be any numbers of reasons to directly test against upstream systems. There is usually one problem though: Cookies. The standard cookie rules will not send cookies set for one system to another if the domain does not match.¹

For this reason, we recently added a small feature called "Cookie Domain Mapping". Cookie Domain Mapping allows to change the rules how cookies are handled in a StormForger test by providing mapping configuration for the domain property of cookies.

How can we apply this to our ecommerce example we've already described? The goal is, that cookies set by either system {shop,cart}.qa.example.com have to be available to every other system. To achieve this, you can provide a mapping like this:

session.setOption("cookie_domain_map", {
  "cart.qa.example.com": "shop.qa.example.com",
});

Cookies set by {shop,cart}.qa.example.com will be handled like as if they were set by shop.qa.example.com (storing cookies). The same happens for the decision logic if cookies should be send back to the server (reading cookies). Note that cookies won't be sent to other domains, like static.qa.example.com.

Example

To get a more complete picture, let's take a look at a simple user journey with three steps: Visit start page, view a product and add a product to the shopping cart.

Visit start page: A new client visits the start page. In the background a session cookie is generated and returned with a Set-Cookie header:

session.get("http://shop.qa.example.com/");

Nothing special happens here in terms of cookie handling.

View a product: The client now visits a product page. The value from the session cookie is for example used to fill the "recently viewed articles" list.

There is also no special treatment required for this, as the start page was served from the same domain and cookie handling works automatically in those cases:

session.get("http://shop.qa.example.com/4711");

Add product to shopping cart: Now we want to add the item to the shopping cart by performing a POST request. The session cookie is required to internally associate the cart to the correct user. But the domains do not match. We have to configure a domain map to send the cookie regardless of the mismatch:

session.setOption("cookie_domain_map", {
  "cart.qa.example.com": "shop.qa.example.com",
});

session.post("http://cart.qa.example.com/add", { payload: "ean=4711" });

🎉

Conclusion

Cookie Domain Mapping is a simple feature to manipulate the handling of cookies being saved and looked up by simulated clients in a StormForger test. It allows to make Cookies seamlessly work across different domain names which can become handy when testing non-production environments.

Check out our documentation on "Cookie Domain Mapping" to learn more.

This is of course a gross simplification. 😀 If you really want to understand how this works, check out RFC6265: HTTP State Management Mechanism. As it turns out, handling cookies correctly is surprisingly complex… 😞 ↩

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Load Testing Engine Evolution

2020-06-15T23:00:00+00:00

StormForger's load testing engine has been based on Tsung since we started our SaaS offering in 2014. In 2019 we began to design a new engine which now has been rolled out to all of our customers. Today I'd like to shed some light where we are coming from and what our reasoning is for moving forward without Tsung.

I'd like to start with how we came to use Tsung¹ as our basis to build what you now know as StormForger.

The Origin

I have been a long-time user of Tsung since around 2009, almost 5 years before StormForger was founded. It started as the foundation of an evaluation framework for my master thesis on "Design Patterns for Scalable, Service-oriented Webarchitectures" and I continued to use it in my time as a freelance consultant.²

Tsung was always fascinating to me mainly because of its use of Erlang and the design approach. It is relatively old (first commit 2001³) and often misunderstood in terms of its strengths and weaknesses. Articles trying to compare Tsung to other performance testing tools often fall short because of a lack of understanding how Tsung is meant to be used.⁴

Tsung is incredibly efficient and scales easily to hundreds of thousands of concurrently active clients on a moderately sized distributed load generator cluster. Because of Erlang's, or better BEAM's, soft-real time properties it works beautifully for measuring duration which allowed us to run very large scale test early on.⁵

But Tsung is not without its flaws. One major shortcoming of Tsung in context of using it in an automated fashion is its design to be used by an operator: Tsung has been designed to be used by a human, sitting in front of a machine (or being SSH'd into one), running a bunch of commands, looking at logs etc. Tsung does not provide nice, machine-readable error messages, some statistics are in a strange format and meant to be processed by a Perl script generating gnuplot programs⁶. Over time we have found several improvements and workarounds for the fact that Tsung was never meant to be automated. Changing this fundamentally is very hard though.

Another source of problems for us was that Tsung is a multi-protocol⁷ load testing tool, almost more like a framework for network-based performance testing. StormForger on the other hand has always been focused on HTTP. This made it very hard sometimes to change things in Tsung because of its generic approach to many aspects. For example, supporting HTTP/2 would be a massive undertaking⁸ which we considered multiple times. In the end it's a good example that you cannot have it both ways: Highly generic and still strong in specialised scenarios.

We came to the conclusion that we need a purpose build engine, that is less generic and built specifically to our needs.

The Goal

The goal with our new engine was to be specialised to what we need: focused on HTTP and most importantly built to be automated and integrated from the beginning.

We have a lot of interesting features in mind to kick off the next generation of our engine, but our first goal was to have new foundation to build upon. This was also important to us, because we have to keep in mind the many thousand test case definitions that our customers have written over the years. Since many of them run automatically we need to minimize the migration efforts our customers have to do.

We quickly came to the conclusion that we should aim for the following:

Be as close as possible to a drop-in-replacement for the current feature set.
Support new need-to-have features right away if this is not in direct conflict with point 1.
Design and build for automation from the beginning, including live profiling in production, better monitoring and observability and simpler operations in general.
Layout a foundation for new features with possible breaking changes to existing test cases.

So far we are quite pleased with the outcome: Only some undocumented, special features we built for our customers needed some adjustment. Without directly breaking compatibility we were still able to bring some new features and better behaviour right from the beginning to all our customers. For example this includes better debuggability for our customers, support for HTTP/2 and TLS1.3 and even faster "time to first request" when a test starts.

We also greatly improved our internal development and testing processes which already allowed us on several occasions to deliver new features rapidly.

The Migration Path

Before we actively started over the migration of StormForger customers to our new engine, we made a lot of internal experiments and sanity checks in addition to our usual automated test suite. Since our test case DSL is a declarative description, we could also challenge our new engine with many thousands of existing test cases to see if we missed something – all without actually running tests against our customers's infrastructure.

The next step was migrating the first customers to our new engine. We started with this a few months ago by picking customers with whom we have shared Slack channels as part of their extended support package and offered them to give our new engine a test. We only switched over a couple of customers and observed their test runs very closely. Since many run their tests automated on a daily basis, we were able to gather lots of data, address smaller issues right away and pushed out an updated version rapidly.

Some weeks ago, we defaulted all new customers to our new engine while only a small fraction remained pinned to our old engine based on Tsung. This helped customers with a larger test code base and lots of active DevOps teams to get the required evaluation done and to ensure that everything is working as expected.

The Road Ahead

Our new engine solved some long standing design problem we had right away and enabled some features right from the start: Mainly HTTP/2 and multiple open connections per simulated client - which was simply not possible with Tsung before. We also improved a lot of our internal tooling and significantly improved the tools we can offer our customers to create and debug their test cases.

With our new engine and a new foundation in place, we are already working on new features related to capturing more metrics, better and deeper integration into our customers's development and QA processes as well as making our test case DSL even more expressive.

We will keep supporting our beloved, legacy engine for a few more weeks for the unlikely case that customers encounter an issue. After that we can tackle the next features we have in mind that are a breaking change to our old engine.

It is not without mixed feelings that we have to say: Farewell, dear Tsung. Thank you for many billion requests and years of great service. We'll miss you. 🤧.

Although we never denied that we use Tsung, we did not communicate it publicly. Some folks are surprisingly good at looking for details like a commenter on Hacker News in 2014 shows: He pointed out, that parts of our logs looked like an Erlang process identifier and our keywords in our DSL have a lot of familiar terms to Tsung users. Crazy! ↩
At one gig I had as a freelancer working on performance optimisations and scalable architectures I actually worked together with Lars and Stephan ↩
Tsung's first commit in May 2001: "New repository initialized by cvs2svn" ↩
I won't call names, but usually comparisons are biased. Tsung seems to be the misunderstood underdog in most articles I've read so far and I can't remember a writeup where Tsung actually made it as "the winner" 🤔 ↩
Actually one of the reasons Tsung is so efficient with many ten-thousand of active users is because of Erlang's process hibernation feature ↩
http://tsung.erlang-projects.org/user_manual/reports.html#tsung-summary ↩
Supported protocols are HTTP, WebSockets, WebDAV, SOAP, PostgreSQL, MySQL, LDAP, MQTT, AMQP and Jabber/XMPP ↩
I don't want to go into all the details why HTTP/2 would be quite hard to implement. Tsung comes from a time where there was a little 3rd party ecosystem available for Erlang. Everything is hand crafted and implemented in Tsung directly, down to the HTTP client and connection handling. While certainly possible it was not feasible with our resources. ↩

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Don't give up yet… keep-alive!

2020-06-04T23:00:00+00:00

There is one source of performance problems we've been encountering even before we started with StormForger and still see every time: Missing HTTP Keep-Alive. This article is about why this is still a problem and relevant in 2020 and why performance testing is an important tool to uncover such issues.

HTTP is with its 24 years a well aged fellow among the web protocols.¹ Today we are mostly using HTTP/1.1² or HTTP/2 and if you have fully embraced the new HTTP/2 world in your entire system this article is mostly an anecdote of past issues. But HTTP/1.1 is still alive and kicking for many systems. And even given its age, people are still forgetting about a very important feature that previous versions did not provide: Keep-Alive.³

To clarify, I'm not talking about TCP keep-alive (which is disabled by default). Also I'm not talking about other kinds of keep-alive mechanisms for other protocols which are equally important to keep an eye on. Today, we will focus on HTTP keep-alive.

How does HTTP work?

HTTP (at least prior to HTTP/2) is a very simple protocol. For a given request to fetch data from a server the following steps happen (simplified):

DNS lookup is made (not in picture),
a new TCP connection is established,
the TLS handshake is performed,
request headers and optional payload is sent,
the response is read and
the connection is closed.

The last point is the topic of this article: Don't close the connection!

HTTP 1.1 learned to re-use an existing connection: If the response was read entirely, a new request could be sent using the existing connection. This happens automatically if both parties understand it. Unless the client sets the Connection: close request header or the server actively closes the connection it will be reused for subsequent requests. Sounds like a no-brainer, right?

Why is this important? Why bother?

We seem to forget about the fact that there might be an issue with keep-alive. Almost everyone seems to be aware that his concept exists, but few are actively checking that everything is working as expected. You might be surprised how often keep-alive is not configured properly!

The other issue is: Developers and operations people heavily underestimate the impact of doing a DNS lookup, establishing a TCP connection and making a TLS handshake. Over and over again. For every single HTTP request. Every. Single. Time.

From our experience we can tell that the overhead will add up very quickly. And it does not make a big difference what kind of system you are building. Even for internal or even local systems there is usually not really anything to gain from closing the connection. You don't have to take our word for it – there are many resources out there supporting this.

What we and our customers are observing when running tests with missing keep-alive is slower response times even for moderate load. If more and more requests take longer to process, more connections stay active so more resources are consumed and blocked. In many cases systems under tests do not recover until traffic stops.

Here is a quick example I've done a while back for a talk at the AWS User Group in Cologne. I used a simple StormForger test case to give you an idea how the TCP reconnects impacts latency (find the test definition at the end of this article). The following image is a latency histogram over all requests made by this test (available in all StormForger reports):

You might have already guessed it: Left is with keep-alive, right is without. Same target, same request, same response.

Yes, this is a simple and a bit artificial example, but not so far from many setups we see our customers are testing. We see a clear bimodal distribution: One maximum where new connections need to be established and the other when an existing connection is being used. The difference is rather significant.

The difference comes from multiple factors:

only spend DNS, TCP and TLS once per peer (multiple times if you are using a pool of connections)
allocating a TCP socket is also not for free, especially when the system is under load
resources are finite and keeping sockets around can also quickly add up. Also look out for sockets in the TIME_WAIT state.
worst-case: You can also run out of ephemeral ports.

If you want to learn more about TCP, sockets and TIME_WAIT and how to optimize your servers, check out this great article by Vincent Bernat: https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux.

Keep-Alive and Current Architectural Approaches

The issue with keep-alive being overlooked is that the impact gets bigger considering some currently trending architectural approaches.

Take for example Server-less or Function-as-a-Service (FaaS)⁴. With FaaS you need to be stateless, but an application is usually not really fully stateless. Most of the time you solve this by externalizing state to other components and services. And how do you access the state again? Quite often it is done via HTTP. You should also check out Yan Cui's article on HTTP keep-alive as an optimization for AWS Lambda.

This especially affects Microservices: HTTP is often selected as the communication protocol of choice.

Again and again we are witnesses when our customers uncover these problems using performance tests and have rather quick wins in terms of latency, stability and general efficiency.

Conclusion

Use HTTP keep-alive. Always.

More importantly don't just assume it is used, check it. It can easily be tested with curl via curl -v http://example.com and looking for * Connection #0 to host example.com left intact at the end of the output. Testing it on a larger scale and especially the impact is also done easily with a performance test using StormForger. Catching a misconfiguration or a unintended configuration change using automated performance testing is even better because you minimize the risk of the potential havoc.

More Details

I've been using a simple test case to showcase the impact of HTTP keep-alive. We have two scenarios, each weighted 50%. One session does 25 HTTP requests with keep-alive (which is the default with StormForger) and the other one does 25 HTTP requests without keep-alive.

Note that our testapp does HTTP keep-alive by default:

definition.session("keep-alive", function(session) {
  // Every clients gets a new environment, so the first
  // request cannot reuse an existing connection.
  context.get("http://testapp.loadtest.party/", { tag: "no-keep-alive", });
  
  // HTTP Keep-Alive is the default, so for all the following
  // requests in this loop, we can reuse the connection.
  session.times(26, function(context) {
    context.get("http://testapp.loadtest.party/", { tag: "keep-alive" });
    context.waitExp(0.5);
  });
});

definition.session("no-keep-alive", function(session) { 
  // Setting the "Connection: close" header, we signal our
  // client to close the connection when the transfer has
  // finished, regardless if the server offers to keep the
  // connection intact.
  session.times(25, function(context) {
    context.get("http://testapp.loadtest.party/", {
      tag: "no-keep-alive",
      headers: { Connection: "close", },
    });
    context.waitExp(0.5);
  });
});

Actually HTTP is even older, but I'm referring to RFC1945, or HTTP V1.0. HTTP V0.9 actually dates back almost 30 years. ↩
HTTP 1.1 is actually a collection of RFCs: RFC 7230, HTTP/1.1: Message Syntax and Routing, RFC 7231, HTTP/1.1: Semantics and Content, RFC 7232, HTTP/1.1: Conditional Requests, RFC 7233, HTTP/1.1: Range Requests, RFC 7234, HTTP/1.1: Caching, RFC 7235, HTTP/1.1: Authentication ↩
Technically HTTP 1.0 could also support keep-alive but it was opt-in and not actually specified how this should work in detail. If the client wants a connection to be reused, one has to send Connection: keep-alive and check if the server responds with the same header. Only then (depending on the implementation) the connection was kept intact after a request. ↩
Node.js's HTTP client or better HTTP Agent does not keep connections alive. You have to configure it explicitly, which is a bummer, because Node.js is a pretty popular technology for FaaS and Server-less applications. ↩

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Performance Testing at over 1 Million Requests per Second

2020-05-28T23:00:00+00:00

At StormForger we regularly evaluate our load testing setup and infrastructure to make sure everything works as expected for our customers and to anticipate issues early on. Since we have been working on a revamp of our core load testing engine for the past weeks and months it was about time to scale our internal testing efforts up to eleven!

Goal

We recently finished switching over our customers to our new load testing engine. While having done a lot of testing internally, we thought: Why not go a little over our requirements? No sooner said than done. Our goal? Make more then one million requests per second with at least one million active clients. 😱

To be honest: Most of our clients have less traffic they need to simulate. Some require a large amount of bandwidth, others have very little, but very valuable traffic. But we always need to be aware what our limits are so that our customers can rely on StormForger.

Premise

The use case we selected as the basis for our experiments is based on a fictitious HbbTV scenario were the requirement is to have a large number of viewers having their HbbTV enabled devices sending a regular heartbeat.

Although streaming is getting more and more popular, "normal" broadcast television is still VERY big around the world. HbbTV is a solution to bring more interactivity into the broadcasting industry. It is also used for analytics purposes which we want to use as an example in this article.

What we had in mind boils down to a very simple scenario per simulated client:

randomly pick a station identifier from a data source
make a POST request acquiring a session token for the station being watched
enter a loop for several minutes
make a heartbeat HTTP request including the session token roughly every second

By the way, some readers will remember that we have written about TV-related and very large tests before in the past. Compared to our adventures testing a very large scale interactive German TV show in 2014, this test will be way bigger!

Challenges and Test Setup

The first problem we encountered was actually not our load generator setup itself. We have many years of experience in ad-hoc provisioning of cloud resources and managing them without any manual intervention. The problem was to set up a target that can actually handle the load. So how to test a very efficient, scalable load testing engine?

As you might know, we have our open source testapp publicly available. This is a simple Golang application and we are running it on a cheap yet beefy bare metal server. The box is quite powerful but not nearly enough to handle north of 1,000,000 requests per second with many more established TCP connections.

We are always looking at different cloud technologies mostly out of curiosity. One of the things that caught our interest already in the past was AWS Fargate, AWS's solution to run containers without having to manage servers or clusters. Since we build a docker image for our testapp already, we thought why not try to run and scale this on Fargate?

After some experiments on how efficient our testapp runs on Fargate we ended up with the following configuration:

one AWS Network Load Balancer to have a single target to test
150 Fargate containers with 2 cores and 4GB RAM each running our testapp (we hit some problems in utilising more than 2 cores efficiently, but that's a story for another time).
test target and load generator cluster located in Dublin, Ireland

To set up and manage our target on Fargate we used fargatecli which is a command line tool to setup Fargate tasks and services. Once we have requested an increase in the allowed Fargate tasks, provisioning our test target was very simple. First we created a new network load balancer (NLB) and then created a new Fargate service with 150 instances of our testapp.

Preparations

Before we actually ran the fully scaled test, we set ourselves some intermediary steps. The issue going from 0 to 100 usually is that problems quickly generate a lot of noise if you run at higher traffic scenarios. Generally it is a good idea to cover some basic ground first.

We defined three steps which is something we recommend our customers to do as well:

0) Measure the base capacity of the target system

First we made a series of tests against different combinations of CPU/memory configurations of our testapp deployed on a single AWS Fargate task to establish a baseline. Our goal was to get close to 80% CPU utilisation per container.

While our testapp usually runs on a beefy bare metal server we quickly realised that our app behaves quite a bit different on AWS Fargate: We get around 6.250 req/sec/core on our bare metal 8 cores box and only 3.250 req/sec/core on 2-core 4GB containers. 🤔 Since our goal is not to optimise for performance here and scaling Fargate is not an issue, we had to grit our teeth and carry on. 😁

We also noticed that the default nofile limit in Fargate tasks is 2048/4096 (soft/hard) which we had to increase as we were planning to have many thousand connections per container.

1) Run at ~10% for 5 minutes

The first step towards our goal went very smoothly. We mainly used this step to verify our results from our capacity testing step to improve our confidence for the next steps.

2) Run at ~50% for 10 minutes

Because we are not using AWS Fargate in our daily operation, we immediately hit the default account limit of available AWS Fargate Task instances. Before we could continue we had to request an increase with AWS Support. When this was approved we could run our test without further issues.

Fire it up!

After passing our internal milestones to verify that our target would not be overloaded by accident and everything behaves the way we want it for our experiment, it was time for our first test run.

It took almost exactly 11 seconds for our load generator cluster to be provisioned after we hit the "Launch new Test Run" button. 17 seconds later we reached 1 million requests per second!

After that we have performed a series of tests at that scale to pinpoint possible optimisations in our test infrastructure and analytics components. It turns out that processing hundreds of millions of requests per test run takes a moment and generates quite a bit of data that need to be taken care of.

Conclusion

Using our new load generator engine (which is now enabled by default for all customers) scaling a StormForger performance test up to very large scenarios is the same effort from the user perspective compared to a very small scale test.

Setting up a test target that can handle over a million active clients and well over 1 million requests per second was a bit more involved. Our usual and recommended approach to make intermediate steps towards your testing goal has proven to be a good idea — especially because we saw significant differences in vertical scalability on Fargate for our testapp (investigation pending). Horizontal scaling on the other hand was very straight forward: Just crank up the number of tasks and you are good to go!

During our testing, we generated hundreds of millions requests which needed to be analysed. We confirmed some assumptions we had with our test analysis pipeline which we are planning to work on to further improve processing times as our goal is that nothing should take longer then 60 seconds. Once analysed we were glad to see that our reports and latency analysis tooling work as expected with sub-second delays.

Do you have questions or remarks regarding AWS Fargate, large scale testing or other performance topics? Just drop us a line :)

Some More Details

Here are some more details on the setup:

we used fargatecli to setup the Network Load Balancer and Fargate tasks
we used 150 Fargate container, each with 2048 CPU units and 4096 MB memory
each container handled around 6.500 requests per second at 80% CPU utilisation
we used HTTP/2 and gzip compression for all requests

The StormForger test case definition that was executed is pretty straight forward. Note that we artificially delay the responses of our testapp to get a bit more realistic "processing time" of the target.

definition.setTarget("http://testapp-lb-657638500f694428.elb.eu-west-1.amazonaws.com:8080");

// 1,000,000 active users, ~5min per User: 3333.33 arrivals/sec
definition.setArrivalPhases([{ duration: 10 * 60, rate: 3333.33, }]);

definition.session("up-to-eleven", function(session) {
  const channels = session.ds.loadStructured("hbbtv_channels.csv");
  const channel = session.ds.pickFrom(channels);

  // Get a "watching token" for the selected channel
  session.post("/random/get_token?delay=100&channel=:id", {
    tag: "get_token", 
    params: {
      id: channel.get("id"),
    },
    extraction: {
      jsonpath: {
        "token": "$.token",
      }
    }
  });

  session.assert("token_received", session.getVar("token"), "!=", "");

  const pings = session.ds.generate("random_number", { range: [280, 320]});

  session.times(pings, function(ctx) {
    ctx.get("/ping?delay=100&token=:token", {
      tag: "ping",
      params: {
        id: session.getVar("token"),
      }
    });

    ctx.waitExp(1);
  });
});

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

AWS Fargate Network Performance

2018-06-03T23:00:00Z

AWS Fargate is Amazons solution to run containers without managing servers or clusters. For many aspects AWS Fargate is similar to AWS Elastic Container Service but without having to deal with EC2 clusters. It is basically "server-less containers".

Let's take a first look at AWS Fargate's network performance!

TLDR; Sustained network throughput is very stable, but not symmetric and does not really grow with assigned container resources.

AWS Fargate was recently launched in Dublin/Ireland (eu-west-1) which is a good opportunity to take a look at some performance characteristics of that service.

There are various articles on AWS EC2 network performance, sometimes between instances, sometimes to other AWS services like S3 (e.g. by Andreas Wittig). For EC2 we know those numbers very well and we regularly check them in order to recommend correct StormForger cluster sizes to our customers. Known network performance is very important in order not to be mislead when doing performance testing. For EC2, raw network performance in terms of bandwidth depends on instance type and more importantly on instance size within one instance family. The general rule of thumb: The bigger the instance (and the more you pay), the better networking gets.

Back to AWS Fargate. We were wondering what kind of network performance can one expect from AWS Fargate? And how does the container's resource sizing (CPU and memory) relate to its network performance? These might be important parameters to know if you have a bandwidth dependent workload.

Preparation and Test Setup

To asses network performance we are going to use iPerf3. To make the measurement not being limited on the iPerf server part, we are using beefy 72 core c5.18xlarge instances, which are advertised with up to 25Gibt/s network performance. During preparation we did sanity checks between two c5.18xlarge instances where we saw 22Gibt/s sustained throughput with peaks reaching close to 25Gibt/s. That should provide us with enough headroom for our experiments.

Our test target will be AWS Fargate containers launched in eu-west-1. Fargate allows five tiers of "CPU Units" with a range of memory (check AWS Fargate documentation). Since we were interested in the relation of network performance and CPU/memory sizing, we decided to test 10 scenarios, the minimum and maximum memory values for each of the five CPU Unit tiers:

CPU Units	vCPU	Memory (MiB)	Price per hour (USD)
256	0.25	512	0.019
256	0.25	2048	0.03805
512	0.5	1024	0.038
512	0.5	4096	0.0761
1024	1	2048	0.076
1024	1	8192	0.1522
2048	2	4096	0.152
2048	2	16384	0.3044
4096	4	8192	0.304
4096	4	30720	0.5834

We also tested the same 10 configurations with iperf's --reverse option to verify symmetric network performance, resulting in a total of 20 scenarios.

All test scenarios are configured to omit the first 10 seconds of measurement to skip past TCP slow-start window and also (initial) peak performance. Actual measurements are performed for 60 seconds using two parallel TCP streams. We are primarily interested in sustained network performance, ignoring shot throughput peaks.

Results

The results were quite interesting and quite a bit surprising to us. But without further ado, let's take a look at the results.

Here is a plot of all tested configurations with the average bandwidth measured over 60 seconds. "Fargate out" refers to traffic being send from AWS Fargate (iPerf server) to our EC2 instance (iPerf client) done via iPerf's --reverse flag. "Fargate in" is the other direction, without --reverse, sending data from EC2 to the Fargate container.

Here is the same result data:

CPU Units	vCPU	Memory (MB)	Outgoing (MBit/s)	Incoming (MBit/s)
256	0.25	512	268	136
256	0.25	2048	268	136
512	0.5	1024	268	136
512	0.5	4096	625	319
1024	1	2048	268	137
1024	1	8192	455	454
2048	2	4096	3404	319
2048	2	16384	637	636
4096	4	8192	2753	456
4096	4	30720	637	636

Although for most measurements the bandwidth is not super high, the measured bandwidth is very stable over time. For all tests there was an initial peak, but the sustained bandwidth was very solid with little to no variation. This is very nice as this makes it quite predictable.

In addition to the network throughput we also took a look at the containers CPU utilization (normalized to 100%). This draws a very interesting picture which could at least explain why network performance is not symmetric (in higher then out in all cases):

Conclusions

We are not surprised by the relatively low throughput of the smaller container configurations (this is basically very similar to EC2 network performance). Two aspects strikes us as very odd:

Network throughput on AWS Fargate does not seem to be symmetric. Often there is "just" a 2x difference, but it goes up to over 10x.
Ingress performance of 2048 CPU / 4 GB memory configuration is way off at 3Gbits (same with 4096 CPU / 8 GB memory). And the throughput goes down again when we increased CPU/memory for the container.

This is really strange observation. The biggest container you can get with 4096 CPU Units and 30GB memory has roughly the same network performance as the 50% cheaper 2048 CPU/16GB option.

The initial assumption was that network performance correlates with allocated resources like with EC2. At first this seems to be the case, until the 2048 CPU Unit tier where things got really odd. Compared to pricing and network performance for EC2 the price per MBit is actually not that bad (at least for ingress).

For a follow-up one might look into more details like measuring all memory tiers (1GB increments currently), vary the amount of TCP streams in iPerf, etc. There might be configuration tweaking possible to increase network performance overall, but that does not explain the immensely (positive) performance outliers.

Do you have questions or remarks regarding AWS Fargate or other performance topics? Just drop us a line :)

More Details

In case you are interested in a bit more detail of the test setup, here you go!

We used an unofficial fargate CLI tool by John Pignata, which has a really nice and simple interface. We had to patch the tool to support the newly available regions, including eu-west-1 which we wanted to use for testing.

We used the following Dockerfile to set up the iPerf3 servers running in Fargate:

FROM alpine:latest
RUN apk --update add iperf3 \
    && rm -rf /var/cache/apk/*
ENTRYPOINT ["iperf3"]
CMD ["--server", "--json", "--verbose"]

Creating a AWS Fargate task based on this Dockerfile with our desired parameters was very straight forward. This command will build the container image, upload it to your private Docker registry and launch a Fargate task with the given resource configuration:

$ bin/fargate task run iperf-test --region=eu-west-1 --cpu $CPU_UNITS --memory $MEMORY_UNITS

When the container is ready, you can start iPerf from the beefy EC2 instance like this:

$ iperf3 \
  --time 60 \
  --omit 10 \
  --parallel 2 \
  --reverse \
  --verbose \
  --json \
  --get-server-output \
  --client $IPERF_TARGET_CONTAINER_IP > measurements/c$CPU_UNITS-m$MEMORY_UNITS.json

After each test, we wait 20 seconds before starting the next experiment. This is go give both systems a bit to cool down.

The iPerf3 client (EC2 instance) and all Fargate containers were started in the same VPC using the same Security Group within eu-west-1 region. No other user processes were running on the EC2 instance, besides network monitoring tools for sanity checking.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Determine Your Performance Impact by Meltdown & Spectre

2018-01-04T23:00:00+00:00

Major security related issues were disclosed just a few days ago affecting CPUs across all vendors and architectures, including Intel and AMD. These vulnerabilities have become known as Meltdown and Spectre and are very severe.

Mitigations have been released for many systems and environment, but you should check if you are fully patched up before continue reading this article! The problem effects all systems, regardless of virtualization or not.

There is still a lot of speculation on possible performance impacts caused by mitigations of Meltdown and Spectre. While AWS states that they "have not observed meaningful performance impact for the overwhelming majority of EC2 workloads", other reports indicate quite an impact (e.g. reported on the PostreSQL mailing list).

PostgreSQL SELECT 1 with the KPTI workaround for Intel CPU vulnerability https://t.co/N9gSvML2Fo

Best case: 17% slowdown
Worst case: 23%
— The Register (@TheRegister) January 2, 2018

The security problem is related to the isolation of user and kernel processes so mitigations try to attack there. The performance degradation happens because the user process has to ask the kernel for many tasks for example IO-related operations like disk access or networking (system calls). This probably explains why a pure database workload is more impacted than a typical web application that does much more non-IO business logic.

Determine how your performance is impacted

While not protecting against these issue is not an option, you might want to know what the performance and thus your business impact is.

Ideally you already have knowledge of the performance characteristics of your system. In this case you can compare pre and post patch behavior and look for potential issues, like increased resource utilization or latencies. Hint: If you are looking for a nice overview for Linux performance analysis, check out Linux Performance Analysis in 60,000 Milliseconds by Netflix.

In any case, the only way to reliably determine that your business won't be affected by the yet unknown performance penalty is to do performance tests.

How to get started?

Performance testing is hard and you need to invest some time. However to get the first impression just create a test case for your web application or HTTP API.

Start to run tests with 300 Clients for free!

If you have any questions just drop a line – we're happy to help!

Read Further

Disclaimer

Performance testing is not security or penetration testing!

The field of operation of StormForger is performance testing and not any type of security testing or security auditing, please refer to experts like cure53 for this.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Advices on performance from AWS' "Well-Architected Framework"

2017-03-21T23:00:00+00:00

“AWS Well-Architected Framework” is a guideline for review and improvement on cloud-based architectures. Amazon Web Services recommends “general design principles” and describes best practice examples.

In this blog posting we sum up some advices on performance testing.

First let's start with an explanation of the general design principles suggested by AWS followed by a closer look at some of the so-called “five pillars” of the AWS Well-Architected Framework (p. 2). Mentioning “five pillars”: In a former blog post from October 2015 the author Jeff Barr counts only four pillars. So, the here not mentioned pillar “operational excellence” seems to have gained more importance over the time.

Security, Reliability, Performance Efficiency, Cost Optimization & Operational Excellence – Five Pillars of the AWS Well-Architected Framework

In their definition of a Well-Architected Framework AWS speak about how they help customers to make “architectural trade offs as your designs evolve” and about the effects and learnings on performance after deploying into live environment (p. 1). Based on this they created the AWS Well-Architected Framework – a “set of questions you can use to evaluate how well an architecture is aligned to AWS best practices” (p. 2). Five topics (“pillars”) are defined: security, reliability, performance efficiency, cost optimization, and operational excellence¹ (p. 2). AWS describes that customers have to “make trade-offs between pillars based upon your business context” (p. 2) when they improve their architecture. For example

the optimization of performance efficiency has crucial effects on your cost efficiency.

So, it's recommended to do performance analysis and optimization of performance bottlenecks on a regular basis to gain effect on e.g. cost efficiency.

General Design Principles

The definition of the AWS Well-Architected Framework is followed by a bunch of general design principles “to facilitate good design in the cloud” (p. 2). We pitch on three points concerning load and performance testing. The first one deals with testing systems at production scale:

In the cloud, you can create a production-scale test environment on demand, complete your testing, and then decommission the resources. Because you only pay for the test environment when it is running, you can simulate your live environment for a fraction of the cost of testing on premises. (p. 3)

AWS make it easy and affordable for you to test in the cloud. You can run your tests without a huge increase of costs.

Another point is AWS’s suggestion to do improvements through “game days”:

Test how your architecture and processes perform by regularly scheduling game days to simulate events in production. This will help you understand where improvements can be made and can help develop organizational experience in dealing with events. (p. 3)

An event might be the transmission of a newsletter, any other promotion or a business related event like the launch of a new website or product. In this case you can make use of spike testing which is comparable to a load or stress test. If you are curious about the different types of testing check out our blog post about types of performance testing. One more appreciable point deals with allowing for evolutionary architectures:

(…) In the cloud, the capability to automate and test on demand lowers the risk of impact from design changes. This allows systems to evolve over time so that businesses can take advantage of innovations as a standard practice. (p. 3)

It is common to introduce new features or a new technical product to your architecture, test it in an automated, cloud based environment and learn about the performance impact of the introduction. Especially test automation is one of StormForger’s major topics as we deliver all the tools you need to do continuous load testing in the cloud.

Operational Excellence

Another pillar that is interesting when you consider load testing:

"The Operational Excellence pillar includes operational practices and procedures used to manage production workloads. This includes how planned changes are executed, as well as responses to unexpected operational events.” (p. 33).

As mentioned before in this article spike testing or stress testing are powerful ways to analyze the effects that planned or unplanned events may cause. AWS advise their customers to document, test, and regularly review “[all] processes and procedures of operational excellence” (p. 33).

Reliability

As part of the reliability pillar the AWS white paper mentions “Change Management” and asks “How does your system adapt to changes in demand?” (p. 49). The listing for best practices contains two suggestions: Automated scaling and load testing. While Amazon provides a multitude of automatically scalable services (like Amazon S3, Amazon CloudFront or AWS Elastic Beanstalk) the choice of a load and performance testing tool remains your own decision. StormForger is a both low barrier and comprehensive tool for continuous performance testing with additional support for agile and DevOps teams. With our JavaScript DSL to describe your test cases and our comprehensive reporting we provide you everything to get started early and enable you to test as often as needed to fulfill your requirements and learn about your systems behavior. Learn more about how StormForger works.

Performance

The white paper actually speaks about the “Performance Efficiency” pillar which is a bit redundant, as performance actually means resource efficiency.

On page 20 the white paper asks the question: “How do you select the best performing architecture?”. AWS offers various kind of support and services to help answering these questions and help you to establish a good working architecture. But there is one thing to keep in mind: “data obtained through (…) load testing will be required to optimize your architecture” (p. 20). Of course, we totally agree: You just can’t avoid load and performance testing!

Depending on your workload various resource types and sizes can differ to fit your performance requirements. With a defined test case it is easy to rerun the same test case against different infrastructure setups and gather needed data and learnings.

Deploy the latest version of your system on AWS using different resource types and sizes, use monitoring to capture performance metrics, and then make a selection based on a calculation of performance/cost. (p. 53)

What AWS is suggesting here is what we call Configuration Testing. Configuration Testing does not only cover your software configuration, but also your entire environment. Rerunning of performance tests is part of a review routine to “ensure that you continue to have the most appropriate resource type” (p. 56). Approaches change, new technologies develop – use these progressions to refine your architecture and improve its performance.

Load testing is also necessary in terms of using trade-offs to improve your performance²: “Trade-offs can increase the complexity of your architecture, and require load testing to ensure that a measurable benefit is obtained.” (p. 25)

It all boils down to: Performance Testing is not a one hit wonder. Continuity is key to success and you should be setup for doing performance tests whenever needed. In general you should start to test early and do it on a regular basis. If possible integrate performance analysis in your development and testing processes, ideally alongside your functional tests in your continuous integration systems.

Conclusion

AWS released a fine and valuable white paper worth reading, a practical guideline to design and steadily improve a well-architected framework. They offer a lot of useful services you might check out. Most of the targeted problems we see in the wild on a regular basis and speak out the same recommendations.

We always recommend to start early with small, easy and understandable test case scenarios. It is easy to setup your first load test with StormForger for free. Sign up, learn more in our documentation and get in touch with us to get a personal on boarding.

In a former blog post from October 2015 the author Jeff Barr counts only four pillars. So, the here not mentioned pillar “operational excellence” seems to have gained more importance over the time. ↩
Focused on space-time trade-offs for performance efficiency. ↩

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Performance Testing – Pre and Post Cloud?

2016-08-19T10:00:00+00:00

The last part of our blog posting series "Why Load and Performance Testing in the Cloud?" will give an answer to the question "what is the difference in all this to pre-cloud times?". The article is based upon a talk I gave at the AWS PopUp Loft in Berlin.

Difference to pre-cloud?

After my overview on the types of testing I'd like to raise the question: Is this actually any different from what it used to be pre-cloud?

I would clearly have to answer: Yes and no. The testing needs and methods have not changed a lot. The needs have maybe increased a bit because our environments are tending to be more complex overall.

The main difference is though that multiple, scalable, performance testing environments used to be very expensive and very hard to manage. Not many could actually do that for real projects.

This is where the automation capability of the cloud can really make a difference. If you have your entire infrastructure, services, servers and code automated, spinning up testing environments suddenly becomes viable. You can create entire infrastructures and environments fully automated, run a series of performance tests, gather all relevant data and shut it down again within a couple of hours and for very little money.

One important thing to keep in mind, though, when aiming for such a testing environment: Do not forget about state. With state I am referring to databases, caches that are warm, etc. Having a reproducible environment, including test data is one of the bigger challenges. Frankly, I haven't seen a good and sound solution for that yet.

Conclusion

In a nut shell, you can state:

Scaling resources is not scaling application
Understanding is very important to make your system efficient and scalable
Complexity is still around and will become apparent again in non-trivial ways when it comes to performance
You should utilize the cloud to make testing simpler

Infrastructure and software architecture in general is nowadays ever changing and evolving. Getting solid data on how your system is performing is not only relevant for your application in particular but an important piece when it comes to do Continuous Architecture. Your architecture is ever evolving to run tests and validate it!

Want to know more? Switch to the previous articles:

Slides

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Types of Performance Testing

2016-07-08T15:00:00+00:00

The second part of our blog posting series shows an overview on the different types of performance testing. Learn more about load testing, scalability testing, stress, spike and soak testing, configuration testing as well as availability and resilience testing. The article is based upon a talk I gave at the AWS PopUp Loft, DevOpsCon 2016 and other occasions.

Types of Testing

In the last blog post I wrote about performance, scalability and the importance of performance testing in the cloud era. So, now, let us take a closer look at the performance testing methods.

It is needless to say that some of the types of testing are not really special when it comes to a cloud environment, but others are especially interesting.

Load Testing

Load testing is sort of the simplest form of performance testing. You induce a normal or expected workload to a system under test and observe it. You can use load tests to determine general system behavior, latency and throughput. In general load tests are used to verify your quality criteria.

Stress Testing

Stress testing is basically a load test, but we are applying a higher-than-expected workload and see how the system behaves under serious stress and when exceeding the design limits. You want to learn when your system breaks and how it starts to fail when being in a serious traffic situation.

A typical approach is to steadily increase the load to see where the system under test begins to violate its non-functional requirements. You can use this "tipping point" to describe the capacity of the given system, like "we can handle 1000 concurrent users per application server before we start to violate our quality requirements".

Scalability Testing

With scalability testing you are changing the perspective to answer the question: How effective can I grow? You can run a series of stress tests and gather data on how effective you really are.

Using stress tests in a series where you steadily increase the system's ressources, you can easily tell if your system can translate this into additional capacity.

Knowing how well and how far your system will scale by adding resources, you can now make an informed decision: You might not need to do anything right now or you need to take action (e.g. remove bottlenecks). Or you simply add more resources to your problem to mitigate a scaling issue for the time being. Suffice to say, that this is a basic requirement for capacity planning and to do cost estimation.

Spike Testing

Spike testing can be used to determine how well your system can cope with sudden traffic spikes. It is comparable to a load or stress test, but modeled as a sudden burst of traffic. It can be a good preparation for a planned marketing campaign or an unplanned event like being featured on Reddit or Hacker News. Spike Testing can tell you if you are making good use of the elasticity of the cloud when faced with these kinds of events.

Soak Testing

Soak testing, again, is basically a load test where you hold the load over longer periods of time to look for long term effects, like memory leaks, disk space filling up, etc. The duration of a soak test depends on your situtation. Usually a soak test runs for several hours.

Configuration Testing

While load, stress, spike and soak testing are not particular special when it comes to the cloud, the next testing method is one of the most interesting ones: Configuration Testing.

The perspective shifts now to looking at the changes in performance if the configuration is modified. The change might be positive, but also negative in case you want to optimize for costs (remember: Performance is resource usage per unit of work, money is also a resource). The point is that you know and can quantize the change.

Configuration can be almost anything here: your environment, services that you are using, dependencies of your software — all can be seen as configuration.

Configuration testing is, for obvious reasons, a very important tool in order to learn about the impact of a system's environment to its performance. It is always a series of test runs where you compare and analyze the impact of multiple configurations.

Configuration Options

Now, it starts to get really interesting: We have a lot of configuration options to choose from when it comes to the cloud. These options include:

Instance types selection (for EC2, RDS, EC, ES)
Auto scaling configuration (Scaling Policies, Instance Launch Time, Scaling Lifecycle)
Throughput provisioning (EBS IOPS, DynamoDB throughput, Kinesis bandwidth)
Service usage optimization (ELB pre-warming, Index Usages)

Finally, you have approached the parts that you have more or less fully under your control, like the operating system, network stack and other kernel settings, software/web server/app server configuration, dependencies, etc. etc.

To emphasize this once more: The aim is to look for change in performance. So, you can use configuration testing to optimize for costs as well, while you know what kind of trade-off you are taking.

Availability & Resilience Testing

The last type of performance testing I will introduce is availability & resilience testing. I will only briefly touch this, because this is probably enough for one single blog article. The idea is to look at certain processes and behavior under load and check if you have covered this.

What about deployments? Possible even with DB migrations under load?
Ever changing, ever evolving infrastructure? Automatic scaling environments?
Failure scenarios and failover mechanisms?

Principles of Chaos Engineering

The idea to give this its own testing category is inspired by Principles of Chaos Engineering. You most probably have checked for those points in some form of manual or automated functional testing, but did you also check for them under load? While you aren't all Netflix and can't do that in production, at least think about to do that in a test environment using artificial traffic.

Want to know more? Take a look at the previous article:

Why Load and Performance Testing in the Cloud?

And here you find the third part of our blog posting series:

Performance Testing – Pre and Post Cloud?

Slides

Find the slides of the DevOpsCon Berlin 2016 talk [DE] here:

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Why Load & Performance Testing in the Cloud?

2015-11-14T09:00:00+00:00

At the AWS PopUp Loft in Berlin I gave a talk on why load and performance testing is still (or especially) relevant in the era of cloud infrastructure. I expanded the talk for AWS Summit Berlin 2016 and again at DevOpsCon 2016. Here in our blog you will find it in three crunchy bites, beginning with performance and scalability and the importance of performance testing in the cloud era. I will give an overview on the different types of testing and explain the differences between pre and post cloud times.

Abstract

The Cloud™ is infinite and scalable. Period. Why is it important to test for performance and scalability characteristics of a cloud-based system? Won't AWS scale for me as long as I can afford it to? Yes, but… AWS only operates and scales resources. They won't automatically make your system fast, stable and — more importantly — scalable. Performance testing is crucial to understand your system and architecture design and your cloud hosting environment.

Performance

I state: Load and performance testing is still a thing in the cloud. To prove that, I will briefly cover some basics, starting with performance.

Performance is often used in a not very well defined way. The most abstract description of performance I can think of is the ability of a system to fulfill a task within defined dimensions. It is a measurement for efficiency! The most often used dimension is time and a unit of work could be a request or transaction. This would then describe the performance of a system in terms of request latency for example.

You can extend this definition to describe the efficiency of a system or server to get the performance statement "1 instance can handle 250 rps at a 99th percentile of 250ms" .

Scalability

Next up: Scalability. That is a term often used next to or even interchangeably with performance even though they are not as much related as many think. While performance is a measurement for efficiency, scalability on the other hand describes how effectively you can translate additional resources into additional capacity.

This distinction is crucial because it can help you want to design a scalable system which you don't want to fall for performance optimizations when you are actually aiming for scalability.

An example for an ideally scaling system would be: If you tenfold the resources and have tenfold the capacity and you can hold the linear scaling, then you have done a terrific job!

But you have to keep in mind, that one does not imply the other. You can have an inefficient system that scales very well. Think of a video encoding service that can potentially be scaled very well simply by adding servers. But you can also have a highly efficient, optimized system that might not be scalable at all, maybe because the communication overhead between multiple nodes would make that impossible.

Jonas Bonér (Founder & CTO, Lightbend) has a nice presentation on that topic and he asks two good questions:

How do I know if I have a performance problem? If your system is slow for a single user!
How do I know if I have a scalability problem? If your system is fast for a single user, but slow under heavy load!

Load and Performance Testing

So what is performance testing? The Wikipedia article on software performance testing gives a nice definition:

In software engineering, performance testing is a testing practice performed to determine how a system performs in terms of responsiveness and stability under a particular workload. — Wikipedia on Software Performance Testing

In other words: Performance testing is a family of non-functional testing methods, which all:

induce a well defined workload
in order to observe the system under test
in order to verify and understand its performance characteristics

There are a bunch of testing methods that can be seen as types of performance testing. The main difference is that they have different goals and testing perspectives. I will have a look at all the types in part 2.

So Why Performance Testing In The Cloud?

I think it is set that the cloud is here to stay. Some of the most interesting key points that the cloud offers us are: IaaS, PaaS and other-things-as-a-Service. Combined with APIs, a high degree of automation and resources that are available on-demand we get a cost effective and scalable environment that was not available in this dimension before.

It is obvious that you should care about performance. But some might ask: Why care about performance testing in the cloud? You can always take out your credit card and add more resources to the problem. But scaling resources is not the same as scaling your application. You have to design your system very carefully to make that statement true as there are many pitfalls to avoid.

A simple — possibly even stupid — example would be that you have your ELB set up (which scales automatically), and your application servers are scaled automatically by the AWS Auto Scaler. But you forgot to take care of scaling your persistence layer as well.

The reason why you still have to think a lot even in a managed environment like AWS is understanding.

Know your application architecture!

You need to understand the performance related characteristics and implications of your application architecture — and it doesn't really matter if it is a distributed, microservice, or monolithic style system. You are also running your system in a complex environment, in this case the cloud. And you might utilize other third party services that also have an impact on your systems performance.

The reason why you have to have a good understanding on what is going on when it comes to performance is because complexity has not simply vanished. It is a lot easier these days to get started to and run systems based on managed cloud resources and services. But the underlying complexity has merely moved somewhere else (to be someone else's problem).

The issue is, that complexity can have a non-trivial impact on performance which can make it very hard to reason about a system if you have to handle it as a black box.

Read more in part 2 and part 3 of our blog posting series:

Slides

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Performance Testing an E-commerce Platform

2015-08-13T09:00:00+00:00

In June 2015 our customer GALERIA Kaufhof relaunched
their E-commerce platform galeria-kaufhof.de.

Several teams worked for about a year on this greenfields project aimed at building a new foundation for their customer shopping experience.

Prior to the official relaunch we conducted comprehensive performance and load tests for quality assurance, configuration testing and capacity planning.

New Architecture

inoio, which is one of GALERIA Kaufhofs contractors, wrote a nice (German) blog post (translation)

on the general architecture of this endeavor.

The goal of the rebuild was to get rid of the monolithic system and introduce a new scalable, shared-nothing, self contained systems architecture to be ready for future features with a reduced time-to-market. Everything from the operational environment up to the user interface and user experience was redesigned and build from scratch.

Kaufhofs team and architecture is divided along business domain areas into scrum teams:

Front-end Integration: Integrating the systems to a coherent website experience
Explore: Controlling teasers
Search: Product search and navigation
Evaluate: Details on products
Order: Order process
Control: Customer account handling
Foundation Systems: Horizontal services like media & asset delivery, feature toggles etc.
Platform Engineering: The machine room. Covering tools, deployment and platform operations.

Systems contained in these domain areas, also called "verticals", are designed to

not share any code
be loosely coupled and
technologically independent (they currently have Java/Scala and Ruby on Rails teams).

The only thing these self contained systems share, is the platform they operate on. Each of these systems has the authority about its front-end, business logic down to data storage.

Testing Approach

We joined the effort backed by the platform engineering team (PENG for short). PENG provided insights like log and monitoring data while we conducted our testing and analysis.

Our general approach is to assess systems in a bottom-up approach. For this project we ended up with three test categories:

Service Tests: We first identified performance critical components of each vertical and developed a test case scenario for each. The purpose of these fine-grained test cases is to get a starting point and baseline for further testing. These isolated tests are also used to artificially stress code paths when troubleshooting performance related issues.
Vertical Tests: The next step is to compile these isolated tests into vertical or service-wide tests that will cover multiple endpoints and features of a given service. These tests are still isolated on a vertical level so that each team can run those tests without effecting other systems.
Combined Tests: In the last step more complete tests are compiled. These tests are modeled after reference data from the existing shop system and will reuse scenarios developed in previous tests. By design these tests will cover almost all verticals and are aimed at identifying previously undiscovered performance critical dependencies. The goal is to get a broader view at the system and it is the first time a more user-centric workload is modeled.

There were also a couple of tests to establish a baseline in network performance and latency. As well as a set of specialized tests to stress the shop's content delivery and caching architecture.

Infrastructure Provider Evaluation

Using test cases from our first step as reference, we evaluated different OpenStack-based providers and their flavors of compute instances and other configuration aspects. All providers were located in Europe, but we still had to verify that the bandwidth and base latency between the individual providers and our load generators in Frankfurt (AWS eu-central-1) and Dublin (AWS eu-west-1) are well known and understood.

The platform engineering team build a operating environment where each team can deploy their application to. They make this possible by utilizing OpenStack APIs and automate the entire provisioning processes using Puppet. This made it very simple to bootstrap and manage different target environments and e.g. make configuration related changes.

For our customer we ended up in comparing multiple providers with traffic originating from Frankfurt and Dublin.

"The flexibility of StormForger enabled our platform engineering team to run large load tests targeting different environments and data centers with ease."

— Torsten Hamper, Head of System Engineering eShop Systems, GALERIA Kaufhof GmbH

Conclusion

GALERIA Kaufhof succeeded with their relaunch project and managed to create a modern, well designed and scalable E-commerce platform.

Evaluating different target environment is as important as testing system and application level configurations. Understanding the basic performance characteristics is also crucial for capacity planing and resource estimation e.g. when you have to ensure the system is ready for big marketing events and traffic spikes.

In case you speak German and would like read more about the ongoing development check out GALERIA Kaufhofs blog at galeria-kaufhof.github.io.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Building a Startup with NoSQL

2014-12-10T10:00:00+00:00

At NoSQL Matters 2014 conference in Barcelona I was privileged to give a talk on how we at StormForger go about screening and selecting NoSQL and other technologies; with which to build our products. Although NoSQL is not necessarily the primary focus here at StormForger, we still use a number of these exciting systems and I would like to share our thoughts.

Startups and NoSQL

Startups are agile & open minded, they often consist of small teams and have to be kind of pragmatic for that reason. I won't go into defining NoSQL, I'd like to refer to the Wikipedia definition for that. Important is that NoSQL is (or at least should be) selected for two reasons: ease of use or you have a very special problem to solve.

I'd argue that most of the time, you don't have a uber-special problem to solve. You are probably not dealing with big data (TB+ of data, or many billions of items), you don't have to guarantee like five nines availability and don't have to scale right away to millions of requests per second…

If you want to be lean and agile, you should focus on the ease of use aspects (development & operation). NoSQL is about modeling data in other means than tabular relations, so maybe your data model fits to a document store, or column store, or key-value store. If that is the case, it might be "easier" to just use a NoSQL database and not try to fiddle around with mapping your data structures to PostgreSQL or MySQL.

But ease of use does not stop with how you model your data. Keep in mind that you still need integration into your environments, languages and frameworks, you need good tooling for your operation needs, general maturity and reliability is also important and last but not least, you need some kind of support — community or commercial.

Polyglot Persistence at StormForger

We at StormForger — besides being a startup — have different kinds of needs for data persistence. Following the polyglot persistence approach, we have found different tools for each job.

Here are some examples for NoSQL usage at StormForger. We use…

InfluxDB for time series data,
Redis for all our caching needs, and
Elasticsearch for log aggregation & analysis.

There is no one size fits all solution for us — and most probably not for you either!

Being lean and pragmatic

We have one more use case for data which is not really a good fit for tabular relations: Our test case definitions consist of highly structured and complex data structures.

Although we have begun to evaluate solutions for this need as well, we saw that we can be pragmatic about that for now. Our current solution? Serialize the JSON and just stuff it into MySQL. Although we might be in need for a more sophisticated solution in the near future, we don't need it right now to build our MVP and test our first assumptions about what the customer actually wants.

Especially with limited resources that you have in a startup context it's very important to take a step back every time you encounter a new interesting technology. I myself often also fall for the fancy new stuff, but as a startup we want to get one thing right: be lean and test if the product we envision is actually what the customer wants.

Conclusion

It can be perfectly fine not to have the optimal technical solution upfront. Maybe it's fine to serialize your structured data like we do, or maybe you can just use PostgreSQL's hstore or the upcoming jsonb fields.

If your product is indeed fundamentally based on dealing with e.g. highly structured data directly, take a look at document stores with powerful query capabilities, like e.g. ArangoDB. Or you have to crunch terabytes of data with tools from the Hadoop eco system. In all other cases: Be agile, think lean and focus on validating your ideas!

Besides embedded here, you can find the slides on Speaker Deck.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

2nd Screen Mobile App: Backend API Load Testing

2014-11-12T08:57:00+00:00

Having had the opportunity to support the grandcentrix team in May to help Quizduell im Ersten, I was happy to once again be called to assess another interactive TV show with a 2nd screen app.

For RTL interactive I tested a sophisticated backend for an upcoming casting show. The backend was accessed through their RTL Inside app. This opportunity presented me with a very interesting scenario and in this post I would like to outline how their system was designed and how I was able to test the performance and scalability of the architecture.

"We were actually quite relaxed during the show premiere. Everything worked flawlessly and our complex workflows performed pretty well. Sebastian provided a great deal of help with his professional guidance and the extensive load tests! This brought us very close to an optimal situation, where the going live is not the first-time peak-traffic for the architecture we designed. Load testing from a professional third party is a major success factor for a software project and should be on everyone's top priority list."

— Jérôme Patt, Project Manager at RTL interactive

Please note, that for obvious reasons, I'm not allowed to mention any further specific details. Suffice to say RTL interactive (RTLi for short) was pleased with the findings and all the systems tested performed well in production. Please direct your inquires regarding RTL interactive, RTL, or the mentioned show to RTL's press office. If you have a similar project or any questions regarding performance, load or scalability testing, please feel free to get in contact with me!

The Show

The show format is that of a casting show (otherwise known as a talent show). Several acts are introduced to the audience and they are set to perform behind a giant set of tv-screens.

The unique feature of this casting show is, that the viewers at home are also the jury. While a show act performs on stage, they (the audience at home) can decide with their vote whether they like the particular act or not. The big display wall fills up with pictures of viewers who voted in favor of this act. While the voting progresses the audience's appreciation is shown and once the act reaches 75% before the performance is over, the wall is lifted and the act may continue to the next round.

Before viewers can participate, they have to download and install the RTL Inside app and register to be a show juror. Optionally they can then choose to

upload a photo,
use their Facebook picture or
don't want to have a picture at all.

For every show act there are two interaction steps for the viewers at home:

check-in
vote

Check-In: In the check-in phase viewers have to signal if they are going to vote for the next act or not. The user check-in is required and is then used to calculate the total percentage value. When the check-in phase is over, and the act starts to perform, the voting phase begins.
Vote: During the voting phase, the viewer uses the RTL inside app to signal if he or she likes the current act or not.

Needless to say that an interactive show format like this running on RTL has strict quality and performance requirements. Beside the potential reach of such a show, the synchronized check-in and voting phases were expected to produce a significant number of requests at their peak.

About 2,000,000 jurors signed up using the "RTL inside" app, resulting in 7,400,000 positives votes and 9,900,00 check-ins.

— RTL press release, roughly translated from German.

Background & Architecture

The backend systems were developed in-house at RTL interactive. I had the opportunity to work very closely with the development and operation teams and was impressed by the chosen architectural approaches.

RTLi decided to run the show's backend systems on Amazon Web Services (AWS). They committed to an interesting hybrid approach using many AWS managed services together with custom services build on top of Amazon ECs.

The RTL Inside app had a special area created for the show where an in-app browser is used to make up the user interface. The app communicates directly (authenticated via AWS Signature Version 4) with various AWS services, such as: Amazon S3 and Amazon SQS. This allows for the system to scale almost automatically without having to deal with auto scaling nor anything other complicated moving parts. Other components were handled by services running on Amazon ECs and 3rd party content delivery networks.

Performance, Load & Scalability Testing

When launching a high profile interactive TV show format like this, you have to test relentlessly and extensively before you go live. Despite the fact that the RTLi team put great effort into making the optimal use of the scaling properties of AWS, dynamic performance and scalability testing is always mandatory. Results and findings of our load testing profile were also used as a basis for the capacity planning process.

The system itself was very nice to test, since the majority of interactions happened through AWS APIs. The architecture makes heavy use of Amazon SQS internally which results in a highly decoupled environment. As a result, most parts could be tested one after another in isolation which is always a great testing property — especially when you aim for very high throughputs and service quality. Another great aspect is that while the team was investigating a finding from previous test runs, the other system components could be tested independently, thus cutting down on time invested.

Modelling Test Cases

Several test scenarios where modelled: From simple cases to test single components to complete and comprehensive test cases which interact with all APIs like a user would do through the app: sign up, photo upload, state polling, check-in, vote.

Since almost all test cases had to interact with AWS, I had to implement the AWS Signature Version 4 calculation algorithm in an efficient as possible manner. Since the signature is based on the request (e.g. payload), pre-computation was not an option, as every user has its own unique API keys, authentication and device tokens. Efficiency was important because we want to test with a lot of users all of which generating a good number of concurrent requests!

Some services were developed and operated by RTLi using Amazon ECs and Amazon Elastic Load Balancing. Needless to say that they had to be thoroughly tested as well. General service quality, performance and scalability were the primary testing goals, besides system stability in edge case and desaster scenarios. Having a solid basis for capacity estimation and understanding the scaling properties of those systems and services was very important too.

End-to-End Test

Despite the testability of system components in isolation, we also performed extensive end-to-end tests to assess the entire process chain, spanning from processes running on Amazon ECs, background workers consuming Amazon SQS messages down to data feeds for the TV studio and other administrative dashboards reading data from various metric systems and Amazon DynamoDB. In addition to client facing APIs there were several background systems involved e.g. to process votes and to aggregate and analyze data.

The extensive end-to-end tests served several goals:

testing component integration under load
gathering data for capacity planning
performing scalability analysis for services running on Amazon ECs
assessing the system stability under stress and peak workloads
ensure that service quality requirements are fulfilled

On a related note: AWS also always recommends testing architectures build with AWS services in a proof of concept approach. As always, be aware to give your provider a heads up when you run large scale load tests! :) For AWS this could mean to let them pre-warm Amazon Elastic Load Balancers, change the partitioning of Amazon S3 buckets for very high throughput or increasing all kinds of account limits. In case you are in doubt, reach out to AWS Support.

Conclusion

We had a number of important findings which could all be addressed prior to the first show and I was told that all systems worked flawlessly in production. Together with the excellent AWS Enterprise Support, the RTLi team and I were able to pinpoint unexpected effects and conduct root cause analysis for strange latency impacts and service behavior we were seeing during tests.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Coming up: A whole week on performance and infrastructure in Barcelona, Spain - we are in!

2014-11-08T17:07:00+00:00

From November 17th to November 22nd the conferences O'REILLY® Velocity EU 2014, WebPerfDays and NoSQL Matters will take place in Barcelona, Spain.

In the week starting on November 17th 2014 the O’REILLY® Velocity EU 2014 takes place in Barcelona, Spain. Sebastian (@tisba)

In the week starting on November 17th 2014 the O’REILLY® Velocity EU 2014 takes place in Barcelona, Spain. Sebastian (@tisba) and me (@larsvegas) are looking forward to attending this conference. I’ve been at O’REILLY® Velocity EU in London back

in 2010 and really loved the conference. I am curious how it will be this time.

On Thursday, November 20th 2014 the WebPerfDays follow as the next event. WebPerfDays, for those who have never heard about it, is an unconference for the Web Performance Community. Sebastian will do an ignite talk on "Load Testing with over 1.000.000 Users", while I will do an ignite talk on "Continuous Performance and Load Testing". I will offer an open space session on this topic as well. Come and join us!

Concluding the week of conferences with trainings and talks on NoSQL is the NoSQL Matters on Friday and Saturday. Sebastian will be giving a talk on how to build a startup with NoSQL where he will give you some insight on how we at StormForger are using NoSQL technology and how you can too!

Let’s have a good time - drop us a line or give us a shot on twitter: @tisba or @larsvegas

See you around! :)

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Amazon Web Services launched EU (Frankfurt) Region in Germany

2014-10-23T15:28:00+00:00

Amazon Web Services launched a new Region in Frankfurt am Main / Germany. We offer you a discount on Load and Performance Testing your application in the EU (Frankfurt) Region.

The new Region¹ supports most of the services as @jeffbar announced on the Amazon Web Services Blog.

Now Open - #AWS Germany (Frankfurt) Region - EC2, DynamoDB, S3, and Much More - http://t.co/MauSKIm0Nq pic.twitter.com/MmZow0DLLR
— Jeff Barr (@jeffbarr) October 23, 2014

To be honest, I personally don’t know Frankfurt that much. Like probably for most Germans Frankfurt is about:

the Frankfurt Stock Exchange²
the Frankfurt Airport³
and food like Frankfurter Würstchen⁴ or Grüne Soße⁵

And of course, Frankfurt has DE-CIX⁶, the largest internet exchange point worldwide in peak terms.

Finally, Frankfurt now has an Amazon Web Service Region. Nice!
Welcome your data in Germany!

Since we want to make the web a faster place, we offer:

15% discount*

for full service and consulting on Load and Performance Testing
your HTTP API in the new Amazon Web Services EU (Frankfurt) Region.

Request a concrete quote!

Happy Load Testing :)

* This offer is limited available and valid until 31st Dec 2014. Project example 3 day Load Testing - First day: definition of non functional requirements, second and third day: load testing and analysis, assessment and recommendation.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Getting Started with Time Series Data

2014-09-08T12:08:00+00:00

Last week I gave a talk at NoSQL Matters Dublin 2014 on "Getting Started with Time Series Data".

In this presentation I gave a quick introduction into time series data and databases. In particular I presented InfluxDB's Query Language and how you can organize, down-sample and aggregate data when it arrives at InfluxDB using their Continuous Query feature. You can find the slides over at Speaker Deck or embedded at the end of this post.

I blame @tisba who's infected me with InfluxDB, but I must confess @dweet_io + @InfluxDB == fun Will keep you posted … #IoT #timeseries
— Michael Hausenblas (@mhausenblas) September 6, 2014

Michael Hausenblas immediately got inspired and blogged about using InfluxDB to get sensor data from https://dweet.io (Internet of Things) and visualize it using Grafana. Michael's post is a must read if you are interested in getting to know InfluxDB.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

StormForger on Hacker News

2014-07-02T13:22:00Z

You might have noticed that StormForger was trending on Hacker News on Friday, 20th June. For quite some time we were listed 3rd place. Thanks a lot for voting and your interest!

Stormforger – Cloud-based Load Testing as a Service https://t.co/i1wHk3yMDQ
— Hacker News (@newsycombinator) June 20, 2014

It was quite a surprise when I started to note StormForger was mentioned more and more on twitter and it took a while to find out what was going on: Someone posted a link to our landing page!

https://t.co/0mTXufY40Q @StormForgerApp, folks from Cologne are gaining traction with their wonderful project. #loadtesting #testing
— ziya aktas (@zaktas) June 22, 2014

We got tons of valuable feedback, requests for beta access and quite a few sign ups for our newsletter. If you want beta access, sign up!

Beside the event on Hacker News a lot of stuff is going on and things are rapidly moving forward. Exciting times! :)

PS, a little side note: Although the landing page was hit by quite some traffic spikes on Friday evening, the server was very relaxed. The reason is pretty simple: static content + nginx! ;-)

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Load Testing an interactive TV Show with over 1 Million Users

2014-05-27T11:30:00+00:00

On May 12th the new and innovative show Quizduell im Ersten started on German national television (ARD). The show is an adaption of the very popular mobile game Quizduell which has a user-based of over 30 million users, so the TV show was expected to be very popular.

I had previously blogged about the specific challenges faced when load testing such an application. This blog prompted the makers of Quizduell im Ersten to approach me to assist in dealing with the specific problems. This post is a summary and translation of both articles.

I was approached by grandcentrix as an external load testing service provider. Grandcentrix had published a German FAQ which is no longer available) on their company blog about what the challenges were and what went wrong. Most information in this article is based on these FAQs. Please note, that for obvious reasons, I'm not allowed to explicitly mention any specific detail. Suffice to say they were pleased with the results.

"Sebastian was of tremendous help to set up and conduct the required large scale load tests. To be able to watch our Mobile Mass Response platform under load, was more than helpful in those stressful days […]"

— Ralf Rottmann (@ralf), Managing Partner at grandcentrix GmbH

Flashback

I won't go into detail on the game mechanics of Quizduell im Ersten (rather check that out for yourselves). One of the big technical challenges was the achievement of the so called "TV synchronicity": Enabling an interactive, real time game experience with the TV audience via the Quizduell App at home. Depending on the game state, every 1 to 10 seconds the App is communicating at least once with the backend API. This means that (at peak) the requests per second correspond with the total number of users playing the game.

In the above mentioned FAQs, grandcentrix explains where the essential problems were. Parts of the complexity and the requirements were removed and I was asked to conduct load tests to back the refactoring efforts. In addition to the originally tested 85,000 requests per second, we hit the system with over 330,000 requests per second which correlates to about 1 million Quizduell players.

I worked very closely with the grandcentrix team on quality assurance and we conducted lots of rather large scale stress tests. In this article I'd like to outline a few more details about what we did.

Test Setup

The test cluster consisted of up to 50 AWS EC2 instances with 800 cores, 1.5 TB RAM and with well over 50GBits bandwidth. The setup was chosen to rule out any effects due to overloading the testing system.

The test case was modeled in a way, that test clients actually play Quizduell. A simulated API client, reacts to different game states, respects instructions to polling intervals, chooses game categories and answers to questions – the latter however not always correctly :)

After the test was modeled, I took care of provisioning the test systems, conducted and monitored each test execution. The grandcentrix team could therefore focus entirely on analyzing internal metrics and logs while a bot automatically reported the current test state (number of current users, current request rates, bandwidth, latency statistics, etc.) to a Slack chat room. After each test execution, all relevant metrics and charts were generated and thoroughly analyzed and interpreted by the team.

Challenges

One of the bigger challenges was, that the Quizduell API is running on Google App Engine. A deeper look on the runtime environment was therefore not possible and we had to rely on the Google support team which was outstanding.

DOS protection: Since the test does not have 1,000,000 servers and IP addresses, the Google DOS protection was quite a problem for some time. Action from the Google support team was required to permanently unban the load generators from the DOS protection.
Google Magic: There are a lot of tuning knobs, which control the (scaling) behavior of App Engine and some of those are only changeable by Google itself. We had to carefully adjust those parameters to further optimize response times and stability of the API.
Network: There are strange things happening at higher request rates and the resulting network bandwidth. Strange effects on TCP connect timings, flow control, routing and other phenomena had to be traced back to their root causes, understood and, if possible, eliminated.

Results

Conducting comprehensive load tests prior to relaunching the app-enabled show had shown that grandcentrix's product "Mobile Mass Response" (not available anymore) platform is capable of handling the expected load. The latest shows have proven that most of the initial performance problems could be resolved.

Here are a few numbers: While testing, my setup did a total of 1,213,583,187 requests in over 50 load test runs to the Quizduell system and a total of about 2.21 TB of data was moved. The error rate was at about 0.000000216% (1 error every 4,624,616 requests).

Followup

Last Friday (23.05.) we conducted another round of large tests resulting in another 800+ million requests and close to another TB of data transfer. We were able to identify and remove the remaining performance issues.

"We’d like to thank @tisba for supporting the making of #Quizduell im Ersten with the best load testing service ever! Lovin’ it!"

grandcentrix (@grandcentrix) May 24, 2014

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Quizduell im Lasttest

2014-05-22T19:01:00+00:00

Dear English reader, ~~I'll translate this post later~~ take a look at the translated post. tl;dr: I had the opportunity to load test the API backend for a big German TV show („Quizduell im Ersten“) by simulating up to 1 million users.

Nach dem Start des Quizduell im Ersten mit einigen Pannen, wurde ich als externer Lasttester von grandcentrix angefragt. ~~Mittlerweile hat [grandcentrix](https://grandcentrix.net) auf dem Unternehmensblog ein FAQ veröffentlicht~~ (nicht mehr verfügbar), in dem vor allem technische Fragen angesprochen und erklärt werden. Die meisten Informationen in diesem Beitrag stammen aus diesen FAQs. Ich bitte um Verständnis, dass ich keine weiteren Details kommunizieren kann.

Rückblick

Ohne auf Details der Spielmechanik des Quizduell im Ersten einzugehen, war eine der großen technischen Herausforderungen die Erreichung der sogenannten „TV Synchronität“: Es soll ein interaktives, Echtzeit-ähnliches Spielen mit dem Publikum über die Quizduell App ermöglicht werden. Dazu wird, je nach Spielzustand, alle 1 bis 10 Sekunden mindestens ein Mal mit der API kommuniziert. Das bedeutet, dass bei Spitzenlast die Anzahl der Spieler der Anzahl der API Anfragen pro Sekunde entspricht.

In den FAQs zum Quizduell im Ersten stellte grandcentrix dar, worin die maßgeblichen Probleme beim Sendungsstart bestanden. Die Komplexität und die Anforderungen an das System wurden zurückgebaut und gleichzeitig wurden externe Lasttests durchgeführt, um den Umbau abzusichern. Über die ursprüngliche getestete Spitzenlast von 85.000 Anfragen pro Sekunde hinaus wurde das neue System mit meiner Hilfe auf über 330.000 Anfragen pro Sekunde getestet, das entspricht etwa 1 Million Quizduell Spielern.

Ich habe die letzten Tage intensiv mit dem grandcentrix Team an der Qualitätssicherung gearbeitet und viele Belastungstests durchgeführt. Ein paar weitere Details möchte ich hier darlegen.

Testsetup

Der Testcluster bestand aus bis zu 50 AWS EC2 Instanzen mit 800 CPU Cores, 1,5 TB RAM und über 50GBits Bandbreite. Das Setup wurde so gewählt, um Effekte durch Überlastung in jedem Fall auszuschließen.

Der Testcase wurde so modelliert, dass er das Quizduell spielen kann. Ein simulierter Client reagiert auf Zustandsänderungen im Spiel, beachtet Anweisungen zum Abfrageintervall, wählt Kategorien und beantwortet sogar Fragen – wenn letzteres auch nicht immer richtig :)

Nachdem der Test modelliert war, übernahm ich die Provisionierung der Testsysteme und die Durchführung und Überwachung des jeweiligen Testablaufes. Das grandcentrix Team konnte sich somit voll und ganz auf die Analyse der internen Metriken und Logs konzentrieren, während ein Bot automatisch den aktuellen Stand (Anzahl Benutzer, aktuelle Request Raten, Bandbreite, Latenzen etc.) in einen Slack Chat Room berichtete. Abschließend wurden relevante Metriken und Charts erzeugt, die vom Team ausgiebig analysiert und interpertiert wurden.

Herausforderungen

Die Herausforderungen beim Testen bestanden vor allem darin, dass die Quizduell API auf Google App Engine lief und somit ein detailierter Einblick in die Laufzeitumgebung erschwert wurde.

DOS Protection: Bedingt dadurch, dass der Test nicht von 1.000.000 Rechnern und von selbiger Anzahl IPs kommt, führte dazu, dass sich immer wieder die DOS Protection von Google aktivierte. Hier war stets das Eingreifen des Google Support Teams erforderlich.
Google Magic: Es gibt eine Reihe von Parametern, die das (Skalierungs-)Verhalten von App Engine steuern – einige dieser Parameter kann nur Google selber ändern. Hier galt es die Einstellungen weiter zu justieren, um die Antwortzeiten zu optimieren.
Netzwerk: Bei den hohen Request-Raten und der damit verbundenen Netzwerkbandbreite treten oft seltsame Effekte auf. Unerklärliche Anstiege beim TCP Verbindungsaufbau, Flow Control und viele andere Phänomene gilt es so gut wie möglich auf eine Ursache einzugrenzen, zu erklären und wenn möglich zu beseitigen.

Ergebnisse

Im Wesentlichen haben wir durch die umfangreichen Lasttests im Vorfeld zum Neustart der Sendung mit App bestätigen können, dass die Mobile Mass Response Plattform in der Lage ist das Quizduell im Ersten handhaben zu können. Die vergangenen Sendungen haben ebenfalls bestätigt, dass die meisten initialen Probleme mit der Performance beseitigt werden konnten.

„Sebastian war eine große Erleichterung beim Aufsetzen und Durchführen der erforderlichen Lasttests, während der funktionalen Anpassungen. Das Verhalten unserer Mobile Mass Response Plattform unter Last beobachten zu können, war überaus hilfreich in der stressigen Zeit der letzten Tage […]“

Ralf Rottmann (@ralf), grandcentrix GmbH

Hier noch ein paar Zahlen: Wir haben insgesamt 1.213.583.187 Anfragen in über 50 Lasttests an das Quizduell System gemacht und dabei rund 2,21 TB an Daten bewegt. Die Fehlerrate lag bei knapp 0,000000216% (oder 1 Fehler auf 4.624.616 Anfragen).

Nachtrag: Am vergangenen Freitag haben wir ein paar weitere große Tests mit weiteren 800+ Millionen Anfragen und fast einem weiteren Terrabyte an Datentransfer durchgeführt. Wir konnten erfolgreich die letzten Probleme identifizieren und beseitigen.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Quizduell, Last und Testing

2014-05-13T15:15:00+00:00

Dear English reader. This post is about an event happened yesterday evening. tl;dr: A new interactive TV show had technical issues sustaining the load.

Das Quizduell (eine Adaption von der Quizduell App) mit Jörg Pilawa startete mit technischen Pannen und es folgte stundenlange Häme auf Twitter und Facebook. Die Online-Ausgaben diverser Medien berichteten teilweise noch während des Geschehens.

Was war geschehen?

Jörg Pilawa sagte während der Sendung, dass die Server überlastet seien und später, dass ein einzelner User über 15.000 Server blockieren würde. Andere Medien (Welt, Focus, Süddeutsche, Süddeutsche, Spiegel) berichteten sofort über einen Hackerangriff, wovon allerdings in der Pressemeldung vom NDR gestern Abend nichts mehr zu lesen ist. Auch auf Twitter wird schlicht ein Softwarefehler und/oder der Benutzeransturm als Vermutung propagiert.

Technische Hintergründe

Das Mobile Mass Response System läuft nach Informationen von grandcentrix (Betreiber und Hersteller der App und des Backends) auf Google Infrastruktur (also entweder Google Compute Engine oder Google App Engine). Die Aussage, dass mindestens 15.000 Server zum Einsatz gekommen sind, halte ich allerdings für eine Fehlinformation. Denn bei 173.000 Anmeldungen im Vorfeld würde das etwa 11,5 Benutzer pro Server bedeuten. Ich vermute, es sind Prozesse auf App Engine gemeint.

Weitere technische Details sind öffentlich leider nicht zu finden. Eventuell ist auf der Interactive Cologne mehr über die Architektur, den Vorfall und deren Hintergründe zu erfahren. Über ein Post-Mortem oder einen „Lessons Learned“ Vortrag würde ich micht sehr freuen.

Lässt sich so etwas verhindern?

grandcentrix hat mit dem Voting-Systems des Eurovision Song Contest bewiesen, dass sie in der Lage sind, solch anspruchsvolle Architekturen zu entwickeln und zu betreiben. Lassen sich solche Zwischenfälle überhaupt verhindern?

Ob sich prinzipiell solche Ausfälle wie beim Quizduell verhindern lassen, ist zumindest fraglich. Es ist eine gute Mischung aus Know-How und vielleicht einem externen Auditing notwendig um Risiken zu minimieren. Trotzdem bergen solche Events enormes Potential für Probleme, da im Vorfeld nicht klar ist, mit welchen Mengen an Anfragen zu rechnen ist, ob es enorme Lastspitzen geben wird und wie sich das System unter solchen Voraussetzungen verhält.

Daher ist es unerlässlich, umfangreiche, nicht-funktionale Lasttests durchzuführen. Dabei sind Tests – vor allem in dieser Größenordnung – ein nicht triviales Unterfangen, selbst wenn die API auf den ersten Blick vermutlich relativ einfach erscheinen mag. Viele Testframeworks scheitern schon daran, viele hunderttausende Benutzer zuverlässig über längere Zeiträume zu simulieren; der Test und die Lastgeneratoren selber muss überwacht werden, um Überlastung zu verhindern? Wie testet man API Clients mit langsamer Verbindung? Wie bildet man dynamisches Verhalten ab? Sind die Testszenarien realistisch genug? usw.

Ursächlich für solche Probleme unter Last sind neben der Anwendungssoftware selbst, oft Probleme mit der (Cloud-)Infrastruktur, Lastverteilern, System- oder Netzwerkkonfigurationen der Application Server. Ein reiner Audit auf Basis von Quellcode oder Modell-basierte Analysen sind daher nicht ausreichend. Besonders problematisch ist der Betrieb solcher Systeme basierend auf Infrastruktur wie dem von Google betriebenen Google App Engine, da hier der Nutzer wenig unmittelbare Kontrolle und Einfluss auf die Umgebung nehmen kann.

Fazit

Technisch skalierbare und Cloud-basierte Architekturen sind nicht notwendigerweise ein Garant für einen reibungslosen Ablauf. Vor allem bei Events wie dem Eurovision Song Contest oder dem Quizduell ist mit enormen Lastspitzen, „seltsamem“ Traffic oder „Angriffen“ zu rechnen. Ohne Lasttests, die den gesamten Stack (vom Loadbalancer bis zum Application Server) testen, kann auf Grund der Komplexität der eingesetzten Systeme keine verlässliche Aussage über das Verhalten eines Systems getroffen werden.

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Using InfluxDB at StormForger

2014-05-08T17:43:00+00:00

Yesterday I gave a talk on InfluxDB at the NoSQL user group Cologne. I gave a general introduction to time series data, InfluxDB and how we will use InfluxDB at StormForger to handle huge quantities of real-time processed metrics.

Find the (German) slides here at Speaker Deck or here:

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Introducing StormForger

2014-02-06T19:55:00Z

Hi, my name is Sebastian (@tisba on Twitter) and I'm working on a project called StormForger. In this blog I'll keep you posted on API, performance and load testing related topics.

StormForger is a Load Testing as a Service platform, targeting HTTP-APIs. By providing a comprehensive and expressive DSL to specify your test cases and a high degree of automation, I try to enable people to do Continuous Load Testing.

To tell you in more detail what StormForger is about, I just recently published a high-level introduction video about it. If you are interested in testing your APIs, just get in contact. I'd love to have a chat with you on that topic, tell you more on the project or even setup a demo specifically to test for your APIs.

Here is a transcript of the video:

Hi, my name is Sebastian and today I would like to introduce StormForger – a next generation, comprehensive API load testing solution. But before I start with StormForger, let me give you some context.

HTTP-based APIs can be found everywhere. Examples for APIs are…

APIs to web applications
backends for mobile applications and online games
single page web applications
tracking & monitoring systems

and so on. Applications that offer such APIs can be complex systems with extensive software stacks and their performance can be hard to predict.

While you most likely have test suites in place, that cover functional requirements, you might lack good answers to questions like:

How fast is my system?
How many requests can my architecture handle?
Can my application scale?

Load testing is a way of providing data to answer those questions.

It is, simply put, the process of putting a specific amount of traffic on a target system while observing the targets behavior.

For example, you might be interested in the response time, resource usage or error rates of your application under a specific amount of pressure.

Let's have a very brief look at what you have to do in order to run a load test. You have to…

set test goals
plan tests
provision test resources
setup and deploy the test case
execute the test
monitor the running test
tear down the test environment
gather and analyze log and other sensor data
evaluate & compare results to previous test runs

In short: Load testing can be a time consuming and challenging task. And it isn’t easy to get started with. But performance can be crucial to success, so you want to perform load tests on a regular basis. So you should ask yourself the question, why not run load tests like you do Continuous Integration?

The solution to target this in a fast and easy way is StormForger.

StormForger is a load testing as a service platform targeting HTTP APIs. It tries to simplify and automate as much as possible in regard to setting up, executing and analyzing API load tests. More importantly, StormForger tries to enable the user to conduct affordable comprehensive and reproducible load tests on a regular basis.

I think that getting started with load testing should be a no-brainer. My vision is to enable people to run load tests continuously: Continuous Load Testing so to speak. I also want to provide you with ongoing visibility into performance characteristics.

To achieve this, StormForger will provide you with the following:

easy to understand test case language to specify your tests
management of test resources (setup, deployment, execution & monitoring)
detailed analytics of results
tools and metrics to compare multiple test runs over time
comprehensive HTTP API to automate and integrate StormForger into your environment

StormForger is in open beta right now. If you want to try it out, get in contact.

Thanks for watching and happy load testing!

Do you want to learn more about StormForger - Performance Testing as a Service? Sign up for free!

Storm Forecast

Open? Or Closed? On Workload Models for Performance Testing

Closed Workloads

Open Workloads

What’s the big deal?

Conclusion

Cookie Domain Mapping

Example

Conclusion

Load Testing Engine Evolution

The Origin

The Goal

The Migration Path

The Road Ahead

Don't give up yet… keep-alive!

How does HTTP work?

Why is this important? Why bother?

Keep-Alive and Current Architectural Approaches

Conclusion

More Details

Performance Testing at over 1 Million Requests per Second

Goal

Premise

Challenges and Test Setup

Preparations

Fire it up!

Conclusion

Some More Details

AWS Fargate Network Performance

Preparation and Test Setup

Results

Conclusions

More Details

Determine Your Performance Impact by Meltdown & Spectre

Performance impacts are workload related

Determine how your performance is impacted

How to get started?

Read Further

Disclaimer

Advices on performance from AWS' "Well-Architected Framework"

Security, Reliability, Performance Efficiency, Cost Optimization & Operational Excellence – Five Pillars of the AWS Well-Architected Framework

General Design Principles

Operational Excellence

Reliability

Performance

Conclusion

Performance Testing – Pre and Post Cloud?

Difference to pre-cloud?

Conclusion

Slides

Types of Performance Testing

Types of Testing

Load Testing

Stress Testing

Scalability Testing

Spike Testing

Soak Testing

Configuration Testing

Configuration Options

Availability & Resilience Testing

Principles of Chaos Engineering

Slides

Why Load & Performance Testing in the Cloud?

Abstract

Performance

Scalability

Load and Performance Testing

So Why Performance Testing In The Cloud?

Know your application architecture!

Slides

Performance Testing an E-commerce Platform

New Architecture

Testing Approach

Infrastructure Provider Evaluation

Conclusion

Building a Startup with NoSQL

Startups and NoSQL

Polyglot Persistence at StormForger

Being lean and pragmatic

Conclusion