We have big sales events more than couple of times a year. Lots of items go on sale from 15% up to 50%. While this is great for customers, for engineers this means panic and lots of sleepless nights, monitoring services.
Most services are connected by APIs. During these events our service’s API’s QPS increases up to 3 times. While we do lots of stress tests, there is always a chance that something unexpected can happen in the long run.
One of these unexpected “issues” has happened recently. While monitoring our databases, there was a spike in queries. We monitor our databases based on type of query (e.g. SELECT, UPDATE, etc.). When queries per minute was going smoothly, suddenly number of SELECT queries spiked almost 15, 20 times. There was no way it is going to end without any trouble. However, surprisingly there was no error. Even more surprisingly there was no change in request per minute for the API that accesses the databases.
Since, I was responsible for the service architecture, I’ve spent next few days investigating the cause of the spike on database. There was no clue. No change in API calls, no background tasks were running, no batches, nothing. Only team that can access the databases besides us was DBA team. When I asked them whether they did anything during the time of spike, they said that they didn’t do anything.
We use monitoring tool provided by DBA team to monitor our databases. Query spike looked something like below
Finally, instead of looking at the graph, I’ve decided to look at raw data. And, after spending some time, I’ve realised that there was no spike, issue was with visualization of the data.
There was network issue for a short time, where monitoring data wasn’t received by visualization tool. But instead of reporting any error, it simply waited until it received any data. Eventually, when network issue was resolved, data was shipped combined without timestamp, which was recorded for single timestamp.
And, visualization tool wasn’t showing time difference at correct distance. Distance for 1 minute and 15+ minute was displayed as same, because no data was received between that time.
So slightly better visualization with correct distances would be as below:
However, this is also misleading. Because it looks like instead of showing that no data was received during the time of network issue, it looks as if spike has happened and slowly cooled down. It is not showing that there are no data for queries during the network issue. So even better visualization would be:
It clearly shows that no data was received for a certain timespan. From this we can understand that there was issue with data collection/shipping.
In conclusion, when creating visualization of anything it is better to test it for unusual scenarios. Also, it is not always correct to trust the visualizations. Sometimes visualization tool is also a suspect.