Saturday, January 15, 2005

The Importance of Load Testing

I recently finished up the first phase of a complete rebuild of my company's intranet. A big part of the new application is an integrated search that hits not only the intranet pages (which are largely managed using Macromedia Contribute), but also hits our knowledge management tool, which is a CF 5 app that uses Oracle as well as Verity collections as its datasources.


Because the knowledge management (KM) app contains some sensitive information and that team was reluctant to have us hit their datasources directly, our first notion of how we'd integrate the two sides of the intranet was to have our CFMX app hit a CFML page on the CF5 side of things with a query string, then we'd get a WDDX packet as our response. Seemed to work great, but then we fired up Load Runner to do some load and performance testing, and this is when things got interesting.


 


Load Runner is a really, really great tool. If you haven't used it before, you can easily script virtual users by hitting a "record" button and clicking around on your web app. For things like forms (such as the search form in this case), you can fill it out once and submit it, then in the script on the Load Runner side you can parameterize your search by feeding it a text file. In our case we fed it a text file containing a bunch of different search terms (70 to be exact) and told it to randomly run searches based on those terms. This can help you pretty realistically simulate real users. You can also have it record what they call "think time" as you're building your script, meaning the time you're sitting on a page just looking at it, and then tell it to randomly use values based on a percentage range, again to simulate real users more accurately. Very cool stuff.


In our case we basically had three user types: searchers, who just repeatedly searched, clickers, who just navigated around all the content pages, and a few who did some clicking and then some searching. You then use the Load Runner controller to tell it which virtual users you want to use, how many of each you want to use, give it ramp up, duration, and ramp down times for the test, and fire it off. You can monitor practically everything you can think of during the test, including live stats on the server you're hitting. (Load Runner should run on its own separate, dedicated server, not the same server you're trying to test.)


I won't bore you with all the gory details, but the long and short of the situation is that under a load of even 30 search users things got nasty. REALLY nasty. Page response times started getting into the 20-30 second range, and that's not just the search functionality. Whatever was going on seemed to negatively impact the content pages as well, many of which are cached on the server.


So that's the bad news. Our architecture that works fine with a very small number of users apparently just isn't going to scale. The CFML pages held up extremely well even with 100 users or more really pounding away, but the search functionality was pretty ugly.


The good news is that we figured this out before we launched it to the users. I can't imagine the stress we would have been under had we launched it first and THEN figured out it wouldn't scale. Better to know all of this now than figure it out when my phone's ringing off the hook later.


The other great thing is that between the Load Runner reports, the server stats, and the JRun logs (we're running CFMX on JRun), we at least can figure out where the bottlenecks are. For further testing, we did the following:




  • Used WDDX as well as more "standard" XML data locally instead of shipping it over the network


  • Created a test database in SQL Server (on a separate physical machine) containing much of the same data so we could hit a database instead of using WDDX over HTTP


  • Created local Verity collections containing much of the same data



From these tests and analysis of the Load Runner reports, at this point we've determined the following:




  • CPU utilization is always extremely high (averaging 95%) when using WDDX or more "standard" XML, either locally or over HTTP


  • Response times are always horrendously bad under load when using WDDX over HTTP


  • Response times are quite good when using WDDX or XML locally, but CPU usage still seems high


  • Response times and CPU utilization are both excellent when hitting the SQL Server database


  • Response times and CPU utilization are both excellent when hitting local Verity collections, but this is very slightly slower than hitting SQL Server (which honestly surprised me a bit)



We're still working through some of our options, which at this point are as follows (feel free to suggest more!):




  • Hit the Oracle database and Verity collections on the knowledge management side directly. This may or may not be possible depending on what the KM team will allow us to do.


  • Replicate all the data in another database as well as replicate the Verity collections. Upside: distributes load, would perform extremely well. Downside: multiple points of failure, added maintenance headache.


  • Ship XML data over to our server on a scheduled basis. Upside: we're hitting local data. Downside: CPU utilization when you're pounding away at XML data seems high.



That's where we are at this point--we're going to make a final determination for a path forward this week. I'm just sharing this because it's been an extremely educational process to go through, and points out the huge importance of load testing your apps.


If you don't have the bucks for Load Runner, consider using Microsoft's Web Application Stress Tool or Open STA to load test your apps before deployment. Believe me, you'll be glad you did!


Comments


Matt: I first hit you post as ammunition for my colleagues to institute some "best practices" (and a timely reminder for us too - cheers mate) after having a really good read, I gotta ask what exactly you were testing for. I ask because it we may have a similar app architecture here and it could be a heads up for us on a different level. "CPU utilization is always extremely high (averaging 95%) when using WDDX or more "standard" XML, either locally or over HTTP" - so this was due to using WDDX/XML's send/load (as a transport medium)? Or CF parsing of the WDDX/XML? (I take it the WDDX/XML data from the CF5 system was converted into something the CFMX side could use – not streamed directly to the browser and processed there, possibly with Javascript?) "Ship XML data over to our server on a scheduled basis." - so while WDDX/XML had it’s problems, provided it was on the local server it was still "viable" to do it this way? "Our architecture that works fine with a very small number of users apparently just isn't going to scale" We’re backed into a corner to use XML for what we have to do - and that’s why I’m asking these specific questions. Getting back to your original intent of your post - If it were me we’d already have LoadRunner built into our unit testing. We've busted a gut creating reusable components. I'd like the unit testing to benchmark each as they're made. As I added to the link to your post when I sent to all the team this morning: "IMHO, making the code compile (work) is only half the story..." Cheers barry.b


Hi Barry--I'm not sure I understand exactly what you mean with the "what exactly you were testing for" question, so if you can clarify that I can give you more information. At the outset we were just testing for basic performance under load and needless to say were a bit disappointed with the distributed portion of the application. This led to the further investigations that I've outlined. Concerning CPU utilization, when I initially saw the high CPU figures I thought for sure it had to do with the CFHTTP calls we were making to the remote server and the overhead involved with that. It turns out that this was definitely a reason for the slow response times (when we switched to local data the response times increased dramatically), but it didn't seem to affect CPU utilization, because when we use WDDX or XML data that's local to the machine, the CPU utilization is still really high. My big question at this point is why this is happening. The good news is that at 30 users or 100, the CPU utilization stays about the same, but I'm still not happy with it and I'm definitely afraid during peak use the server would become non-responsive. Oh yes--one thing I failed to mention. On a lark I tested this code on a CFMX standalone box since we're building this on CFMX on JRun. Exact same results, so I know it's not an issue with JRun (or if it is, it's a consistent issue with the full version of JRun and the one that runs underneath CFMX standalone). At this point all I can conclude is that XML parsing is CPU intensive, and if you have 100 people or more banging away at operations that involve XML parsing, you better have a pretty beefy server on which to run your app. As for the difference between WDDX and XML, it seemed minimal. When we get the WDDX over HTTP, I just run the cfwddx tag to convert the WDDX data to a query object. I timed just that portion of the operation (the cfwddx tag) and even under load it's extremely quick. Just to make sure that wasn't the bottleneck, however, I tested things without doing that step and it didn't seem to make a discernable difference. This is what led me to trying "standard" XML as well. Again, not much difference either in data packet size (I was concerned that WDDX was just heavy by nature, but it's really not bad) or in CPU utilization. As for using local XML data, I don't really see this as a viable option at this point, but since we're still working through all of this I'm just trying to list every possibility. Then at least our team can be satisfied that we've thought through literally every option. Unless we got a quad-processor box with tons of RAM (our dev box is a relatively old dual P-III 500 with 1.25GB of RAM) or start getting into clustering (which we may do anyway ...), I don't really see this working under load. For our targeted user base if we had two killer servers behind a load balancer it would work OK, but at that point I see it as using hardware to solve an application architecture issue. If you're completely stuck using XML and there's no other way around it (which might not be the case in our situation), then that may be what you're facing. Unless I'm doing something horrendously wrong XML operations in CFMX just seem pretty CPU intensive. Another test I'm going to do if I have time is re-write just this piece in Java to see if that works any better, but I don't suspect that it will. I think in our case we're going to determine that we don't want to spend tons of money on hardware to solve the problem, and we'll likely end up having the knowledge management team create a read-only view in Oracle that they're happy with from a data access/security standpoint, and we'll either hit their Verity collections directly somehow or replicate those to our server once a day. Based on the testing I did hitting a database and Verity, this performed really well even on our relatively modest dev server. You're absolutely right about testing early and often. Since this is the first large distributed app I've really worked on (like most CFers I'm far more used to having direct access to a database), when I was testing on my own and seeing response times of 2-4 seconds for searches, I thought "a bit sluggish but it'll be fine." The Load Runner testing told a completely different story! In the end, testing for potential bottlenecks like this early in the development process is pretty crucial. We haven't announced this app to the company yet but they're chomping at the bit to do so, so we're a bit under the gun. If I had tested this piece of things a lot earlier we would have worked through the possible solutions a lot earlier as well. I'll definitely post what we wind up doing, and if anyone reading this has any further ideas or questions, I'm all ears! Matt


cheers Matt: I should take this off the comments section and email you directly but this touches on 2 things on your post. you've clarified that it's xml parsing - not transport - that's the root of the issue. so... 1) the actual problem (well the XML part anyway): we're forced into using XML as a transport medium to drive a very complex UI. Javascript is used client-side but your post has me worried when it's time to parse the xml in CF. you're saying that XmlParse() and manip of XML (as arrays and structs) is the cause of the 95% CPU utilisation? gee, that's a worry... for us we may be in luck - the server-side xml building and parsing is done as strings (the guy that wrote it is an old unix programmer so RegExp is his best friend - he's also slightly mad...) I suppose if it *can* be traced back to the code level, slotting in different parsing techniques might change the results. Just a thought... 2) the origional intent of your post - the value of load testing: I've been banging my head against a wall here to get some best practices going: the present culture is "ship it and we'll refactor later" (especially for performance). My biggest worry is that the XML solution will be accepted but we won't be able to prove that it'll scale until it's too late - the technique will be all over the app and we won't be able to do a thing about it. It should have had it's own feasilability study done or at least incorperate load testing in the unit test (which we've been begging for anyway). bottom line? I want to be able to sleep at night... I'll be very curious how you get on good luck barry.b


Hi again Barry--as far as I can tell, it is indeed the XML parsing itself that's causing the high CPU utilization. Bear in mind, however, that in our case since this is a search application we're talking about quite a lot of XML data. I tested with a local XML document that was about 115K for each user (this represented about 100 or so search results), so when you take that times 50 or 100 users, you're getting into a lot of data being parsed. So what I'm getting at is that if you're dealing with a lot less data than that it may not be quite so much of an issue. Concerning manipulation, I've tried everything from a simple dump to doing a toString() to using XSLT (both read from a file and cached in RAM) to manipulate the XML once it's parsed. Nothing seemed to have much of an impact. My next avenue of investigation is going to be the XML parsing CF is doing under the hood. My understanding is that CFMX by default is using DOM (which I'm going to verify), which of course is a lot more RAM intensive than SAX. DOM reads the whole XML document into memory for random access capabilities, which I don't need in my case, whereas SAX is a top-to-bottom read, which would work for me. If I remember correctly from when I last messed with all this in Java, SAX is much faster than DOM so that could give me more avenues of investigation. This isn't ideal since I'm still dealing with a CF 5 app on the other side that's generating the data, but if using SAX spanks the pants off of DOM then it may be another possibility. (This gets at what you mention concerning alternative parsing strategies.) I completely hear you on the "I want to be able to sleep at night" comment! I think everyone thinks I'm a bit nuts at this point because I'm obsessing over this, but I'm not going to feel comfortable launching this to my entire company until this is resolved. I'd never really done load testing before, and I'd never had problems, but believe me after this experience I'm going to load test everything. I just hadn't had problems before because I was very lucky. I'll let you know what more I find out, and thanks for the dialogue--it's getting my brain going in other directions on some of this! Matt


Just a quick follow-up--we're retooling this to hit the Oracle database directly and we'll be using local Verity collections as well. I'm sure we'll get some radically different results from the next round of load testing! In the end we determined that in our situation that other setup just wasn't going to scale unless we threw a ton of money into more hardware, which wasn't necessary since we have other options. The original architecture looked great on paper and kept things really simple in terms of the integration between the systems, but alas, it just wasn't going to hold up under load. Live and learn! Matt


>> I think everyone thinks I'm a bit nuts at this point because I'm obsessing over this, but I'm not going to feel comfortable launching this to my entire company until this is resolved. well, who's been vindicated over the worth of these tests, eh? A timely reminder to us all, methinks. now if only I can get my people to learn from this... (sleep well, Matt) cheers barry.b


Matt: I've come across someone else who had performance issues with the apache Crimson classes (v1.0) that CFMX uses. this was especially using xpath queries, etc. Because he was on a shared server that had additional java libraries installed for XML, he used that with a much better performance. they can be found at http://www.dom4j.org/ hope this helps. cheers barry.b http://barryb.coldfusionjournal.com/


Thanks Barry--very good to know! We're moving forward with the direct database solution but I'm sure this will come in very handy in the future.

No comments: