As we'd added new versions of our system into the live environment fairly recently, the development team was asked to join into the discussion of how to solve the problem. This was a little unusual, as we don't normally get involved in the support of the live systems... be that firefighting or strategic support. And it wasn't that our applications were under suspicion, it was just that it made political sense to get us involved.
Rather than jump straight into the database, we took a holistic look at the performance of the applications, taking in as many components as we could: from client machines, through network infrastructure down to individual datafiles on the database server. In the end we came to the usual, almost inevitable conclusion... the database was the bottleneck, and it wasn't any single aspect of the database.
One thing we did notice however, was that the memory settings for the database server appeared to be unusual. So we advised to the DBA that these be looked at straight away. The DBA's suggested approach to this was
"Change the settings one at a time on live, if it doesn't break the system then we're probably on the right lines. Because it's dangerous doing this, we can only make one change per day and then wait for any calls on the service desk before we make the next change"
The rest of us didn't think that sounded like a safe way to go, so I suggested a new approach.
I put forward the idea of taking a production quality server that we could run in isolation, and importing the database owned by the brand that had the most urgent problems. We would then write an application that would stress that database in line with the behaviour of the live system, we would monitor the system's peroformance during the test and record our findings. I argued that once we had that in place we could produce a benchmark, make a change and rerun the test. This would then tell us how the system's performance characteristics had changed. We could then make the next proposed change, run the test again. Then make another change...
Having this tool in place would mean that we could thoroughly test the performance of the system in relation to our changes without ever having to rollout a change into a live environment. Once we were sure that we had made a step change in the performance of the system we could roll that change onto the live system reasonably safe in the knowledge that we would always have a positive impact on the system.
The rest of the team agreed and we set about producing our performance test rig.
Our requirements were:
- The test should be completely repeatable. Meaning that, as far as is reasonably possible, the test should perform the same queries, at the same time, with the same parameters, every time the test runs.
- When the test is running, the database being tested should appear to the DBA like the live version of the system under a particularly high level of stress.
- The connections to the database should be made in the same style as the live system. E.g. emulations of connections from a web application should appear, make their requests and then disappear. Connections from a client machine application should appear at the start of the test and then remain connected for the duration of the test.
- It should be easy to increase / decrease / change the characteristics of the test load.
- It should be possible to run the test by entering a single command into a suitably set-up machine. It should then be possible to re-run the test by entering the same single command.
- Running the test should include reporting on the performance characteristics.
I hope to go into the design of the solution in another blog entry soon...
We've since put the performance test rig together, and using it we managed to very closely replicate what was hapenning on live. Each test takes an hour to run and emulates a load equal to around double to the normal live load. We ran several test runs, each time changing a single memory setting until such a time as we had a decent change in performance.
We measured the performance differences by checking the average, minimum and maximum run times of each of our sets of emulation queries (we had a set for each of our applications, sometimes a set for each module in the application). The test rig also produces two stats pack snashots for each test, one at the start and another at the end. All the test results are then copied into a seperate database for long term storage and reporting.
Once we had the test rig up and running it took a day to tune the memory settings for our biggest database. We were able to run 5 tests in that day and end up with a very clear picture of the effects our changes would make on the live system.
The changes involved slashing the shared pool to less than a quarter its original size, taking the PGA target up by a factor of four and doubling the buffer cache. We then rolled the complete set of changes out onto the live database in one go.
It more than doubled the overall performance of the database. Exactly as predicted.
Since then the performance test rig has been earmarked for a few other jobs:
- Balancing our read : write ratio on the local disk controllers.
- Testing and tuning a Network Attached Storage device as a replacement for local disks.
- Checking the impact of adding new indexes on heavily used areas of the system.
- Investigation into possible designs for a new realtime "MIS light" application module.
- Testing if regular defragmentation of our DBs give us an increase in performance.
- Examination of how block sizes affect the performance of the system.
Basically, because the test rig properly emulates what happens on the live system, it can be used to investigate any change that would impact the performance of the system, be that hardware or software. And it allows us to make that investigation quickly and without any impact on any live system.