Wednesday, January 26, 2005

DAOs and Composition

In the spirit of what Joe Rinehart and a few others have been doing so far in this the "year of OO for CFers," I thought I'd share what I've been working through this week. Specifically I've been grappling with Data Access Objects (DAOs) and how best to use them when composition is involved with the objects. I've done this several times before but I'm using a rebuild of our CFUG site to dig deep and try and figure out the best way to handle this (I'll stop short of saying "the right way to handle this"). For example, let's say I have a Person bean and an Address bean, and the Person bean has an Address bean in it. What's the best way to handle this situation in the DAOs?


 


Let's focus on the create method specifically since it's the most messy. If we're dealing with a Person and an Address, let's assume we have a form on which a person enters their basic personal information (name, email, etc.) and also their address information. When they submit this form, on the backend we need to create a Person object and an Address object, populate them with the form data, and pass them to a DAO to run the create method. There are a few different scenarios I've mulled over, and I've come up with my preferred way of doing this, but I'd be curious to get your feedback.


Scenario One: Handle Person and Address Separately


Because these are separate objects and each have their own DAO, one way of handling this would be to handle the Person and Address objects separately. In other words, after the form submission, in whatever component is handling the logic of processing the form, populate a Person, populate an Address, then call personDAO.create(person) and addressDAO.create(address) in sequence.


The Issues


This may seem like the simplest way to handle things, but there are a few issues that arise. First, there is a bit of chicken and egg stuff going on with the relationship between Person and Address on the RDBMS side. Since the person and address tables are separate and related through a key (address_id in the person table), the Person really should have an Address id before personDAO.create() gets called. So we could reverse the order of the calls above and call addressDAO.create() first, then grab the address id from the result of that call and put it in the Person object, then call personDAO.create().


This isn't necessarily a bad way to do things, and would certainly work in this situation, but what happens when you get into more complex scenarios with multiple instances of composition (which in my application is the case)? In some cases the relationships and chicken-and-egg stuff gets even more complex, so you end up making multiple calls to your DAOs (create first to get the ids needed, then update later to put the ids in the necessary spots). Also, in my opinion you end up really muddying up things in the component that's handling the logic of the form submission (which in Mach-II is done in the listeners). So in my mind I scratched this option off as not the way to go about doing things.


Scenario Two: Put Address in Person and Have Separate Queries in the PersonDAO


It doesn't take but a few lines of typing the code for this scenario to realize what the faulty logic here is. It may seem like a decent idea at first, and you can even put the person and address queries in a nice tidy transaction on the database side, but you end up completely duplicating the code that's in the Address DAO, which is a big, big, big no-no. I must admit in some cases for deletion I have done this, but delete queries are typically extremely simple and likely wouldn't change over time. Something like a create might change in the future (if you add a field, for example), so you'd end up having to maintain the create logic in two places. Clearly not the right solution.


Scenario Three: Leverage the Address DAO Within the Person DAO


We're using composition in our beans, so why not use composition of sorts (this isn't strictly composition, but bear with me ...) in our DAO as well? When we instantiate the Person DAO, why not just instantiate an Address DAO inside the Person DAO so we can call things that way? Then after the form submission we're instantiating the Person bean, the Address bean, putting the Address bean in the Person bean, and just calling personDAO.create(person).


At this point, after much pacing and pondering, this is what makes the most sense to me. That isn't to say there aren't downsides here as well, which is why I'm posting my thoughts to get some feedback from others on this. I've seen plenty of examples of Java code that do this, so I'm assuming it's not flat-out "wrong," it seems to work well, and I really like the fact that it keeps things self-contained by only necessitating one call to the Person DAO (even if behind the scenes the Person DAO is actually calling the create method of the Address DAO as well).


There are still some chicken-and-egg situations when you do things this way. Those are relatively unavoidable in any of these scenarios. But what this gains you is A) complete reuse of the DAO code (no duplication of code as in scenario 2 above), and B) it keeps the logic of the form processing component a lot cleaner because you're just instantiating some objects and passing a single object to its DAO for the create action.


So am I on the right track? Is there a fourth option I haven't considered? Should I find a new career? Let me know your thoughts.


Comments


I'm glad you posted this. I have the same personal debates and haven't been able to fully commit to any one solution as being the "best". Maybe you have an object UserAddressDAO that you call to create and it instantiates the UserDAO and the AddressDAO - so the "chicken egg" logic is handled in that one place where the id relationship is handled at a data abstracted location (as opposed to in your business object)? (I guess this is kind of a facade) I am anxious to hear what others think. I am currently debating the inclusion of some deletion code in a DAO as you glossed over "I must admit in some cases for deletion I have done this, but delete queries are typically extremely simple and likely wouldn't change over time." on the cfc-dev mailing list and can't really come to a "feels right" answer on that either. I currently do just what you mention - i'm just not sure its the best way.


I think it makes sense... I do this type of thing all the time. It makes sense that if person HAS A address then personDAO has a addressDAO. In fact, I even have DAO's that contain gateways for another type. Just the other day I had a "student" object that had a property like "previous schools attended" (an array). So, my studentDAO received an instance of a schoolGateway via. dependency injection (rather than creating it on it's own), and during it's read() operation, it placed a result from the schoolGateway.getPreviouslyAttendedByStudentID(id) into the student instance. All the merits you mentioned are valid... in fact I use this same instance of the schoolGateway elsewhere. I know I'm starting to sound like a broken record, but you should take a look at what I've done w/ the spring -> CFC port. It makes managing these situations a hell of a lot easier.


Matt, This is a GREAT post. I think you're right on track, and you did a great job explaining why Scenario Three is probably the the rig -- erm, a Very Good way to handle this problem. Patrick


Thanks for the thoughts Bill--the issue I see with the UserAddressDAO situation is that you wind up with a class explosion of sorts. If we throw a Company object into the mix, do you then build a UserAddressCompanyDAO? That could get ugly real quick. ;-) When I'm thinking through this stuff I try to talk through all the scenarios, no matter how unlikely they seem at first, just to make sure I understand *why* something isn't a good solution. Then I can feel like I've covered all my bases. There's someone who posts on one of the CF lists with the tag line, "I haven't failed. I've found 10,000 ways that don't work" (or something along those lines), which I think is attributed to Edison. That's definitely how I feel doing all this OO stuff. I often spend more time thinking through all the options than actually implementing the one that makes the most sense, but this isn't necessarily a bad thing! Better to spend a lot of time thinking through everything than building a bunch of stuff you have to throw out when you realize you've coded yourself into a corner.


Thanks Dave--I'm really interested in checking out the Spring port you've done. I've read quite a bit about Spring on the Java side of the world and it's pretty impressive, particularly in contrast to something like Struts. Can you post a link here or email it to me? I definitely want to check that out.


Thanks Patrick--given the brief feedback so far at least I know I'm not way off the mark. That will allow me to sleep better at night. ;-)


Though I find myself using the third scenario most often, here's a fourth scenario to throw out there for discussion: Scenario Four: Drop the AddressDAO In this scenario, all the DB functions for managing Addresses would be handled through the PersonDAO. I'm sure there are alarms going off in your head, thinking "What?! I'll have to rewrite the code to manage Addresses for any other business object that has an Address!". That's more than likely correct... the Person-Address example isn't really a good one for this scenario. Instead, consider this example of composition (as opposed to aggregation or association): a Worm is composed of Segments. Since Segments cannot exist without a Worm (and don't make sense outside the context of a Worm), Worm manages the lifecycle of its Segments. Similarly, we can have WormDAO manage all the DB functions for Segments, since no other DAO will ever have to work with Segments (no danger of duplicate code). One big plus of this approach is that you can take advantage of the DB's joining capability (and grab the Worm and all its Segments in one, optimized query). Of course, if WormDAO starts getting bloated, you can factor out the Segment functionality into a SegmentDAO.


nice post. How to handle multiple addresses per person with your scenario three? I suppose have person.address be an array of address type, eh? Now, what I am really stuck on is how to wrap these creates into a DB transaction? what happens if the person.create is successful but the address.create is not? Doug


Definitely true Dough--if there is no need for a separate Address object (or Segment object in your example), then you wouldn't need and really shouldn't have a separate DAO unless there's some other driving need. I actually thought about not having the Person and Address have a separate DAO, but in the end I decided to keep the two separate so I could expand in the future as needed. Great point though--don't create objects just for kicks; make sure you need them and it makes sense to have them be separate entities.


Douglas, I'd likely handle multiple addresses as you outline, namely with an array of objects. As for your other comment concerning the transactions, I'm still messing with that myself so I'd be curious to hear what other folks have done. What I usually do (at this point anyway) is have the create() methods return some sort of indication of whether or not they were successful, and that way I can proceed or not based on the success of each step. That doesn't address the rollback issue, however, so if the first step fails I'm ok, but if the last step fails I'm in a bit of trouble. I haven't quite thought through the best way to handle that other than something like keep track of whether the last step fails, then you can go back and run deletes as needed on the previous steps. Not particularly pretty but it does the trick, although then you get into what happens if the deletes fail ...


good post Doug. I think what you say about the Segment/Worm makes alot of sense and will probably apply to the problem I am having. I hadn't thought to have one "big" dao that takes care of a Parent Element and its subordinate objects (that can only exist if the parent exists). Then pull out those subordinate objects that end up needing to be in their own DAO. I was guess I was under (the false?) impression that a DAO would only deal with one objects interface to the datasource.


I think this was my impression when I first started learning all this as well Bill, that a DAO would only deal with one object. The more code samples I looked at (which are largely in Java, which is one of the reasons I'm trying to make this big charge for CF-specific OO materials this year), the more I saw that this wasn't necessarily the case, and it started to make more sense as well as simplify the code overall, which is always a good goal to have. The CFUG site I'm working on is turning into a bit of a beast, but once it's done I think it should be a relatively decent example of a lot of these concepts as well as a good semi-large Mach-II sample application. I've gotten a lot out of studying Phil Cruz's mach-ii.info site as well as several of the examples that Joe Rinehart, Scott Barnes, and others have put on their blogs.


in regards to the rollback it seems like I remember reading that a transaction will work across components in CF 6.1 now. So, it seems to me that having a "transaction" object that calls you various methods for you - a facade i guess since it simplifies the interface while still giving you full access to the power behind it - the facade woudl call the various methods for you - and in the facade you would wrap those method calls in a cftransaction. Now, something I have never tried is havinga cftransaction wrapped around these generic factory /execution combinations so that the cftransaction is in place as needed if my datasource is a db that supports transactions - but if my datasource is something else - with no queries for instance - that the code won't blow up. Now I'll have to confirm with my memory if that is a valid scenario - cftransaction wrapped around code that contains no queries. It seems like it would be OK but would have a slight performance hit. hrm. Of course, you could have a facade factory i guess that returns the correct facade based on your datastore and the xml based facade wouldn't have the cftransactions in it but the sql based one would. I guess the merged DAO thing that Doug mentioned would work here - in the same regards a merged "facade" could exist to handle an object and all of its subordinate objects (those that wouldn't exist without it). So you could have FacadeFactory -> Facade_xml || Facade_odbc the facade_odbc would be responsible for instantiaing whatever objects it needed then wrapping those object method calls in a cftransaction. this way you get the rollback, the separation of methods to their reusable components (DAOs and so forth) and an easy to access interface for your application to deal with. What do you think? Does this seem like a huge stretch?


That gives me a lot of very cool ideas to try out Bill--thanks! As with most of this OO stuff, there's a million different ways to go with things, so I'll have to experiment, particularly with the transaction stuff. If there's a way to get that going without necessarily involving database queries that would be very cool, I've just never tried it. One of those things I just assumed wouldn't work so I never messed with it.


I always use a service object which has my DAO called inside of this. This helps with the situation you talk about above because I would have a personService that has a getPerson() and savePerson method. savePerson knows to call save using the personDAO and then to call save using the addressDAO each address inside my person. This service layer also works great when Flash remoting and/or web services are involved. I would be happy to do a blog post on my method to get more feedback if people are interested.


Thanks Kurt, that would be cool. It sounds like that's similar to what I'm doing (as far as calling the person DAO method and then that in turn calls the address DAO method), only with the addition of the service layer in the picture. I'm always happy to see details on what other folks are doing with all of this.


Best stuff I read today. Thanks for the inputs. DK


This is great Matt ! This blog (and all the comments) just answer my previous problem posted on CFCDev about an Article objects with multiple properties in it. I didn't know it, but what I meant was _multiple composition_, it's hard when you don't know the terminology of what you need. Now I can have my _ContentObjectClass_ DAO managing all _ContentObjectClassProperty_ DB queries, like Doug Keen segment and worm example. Or maybe I can have simply the _ContentObjectClass_ DAO calling the _ContentObjectClassProperty_ DAO, but anyway, things area getting better all trhe time, thank you guys.


I think I saw the discussion on CFCDev Marcantonio which is what got my wheels turning on all of this as well. I'm getting a lot of great ideas from the discussions as well.


Excellent discussion! I ran into the same issue, but with a header/detail object scenario. I used the same idea as Kurt and created a Service. That allowed the service to contain the logic that knows how to deal with the interaction of the DAO objects and I can keep the DAOs as "pure" as one can. For the batch header and detail row issue, I used a UUID to solve the chicken and egg issue. It requires 32 more bytes per rows to store a char(36) vs. an integer, but our main tables will only have up to 1,000,000 rows. We archive to a data warehouse scenario and those tables use identity columns for smaller storage. I have built and torn apart my objects and services 3 times now in the past 2 weeks, but they keep getting better. And don't get caught up in the *best* process - it will never be good enough. I have been at this for more than 16 years and the idea of *satisficing* is best practiced. No matter how good you build it this time, you will rebuild in the future to take advantage of new technologies and methods. That's the price of progress. And you can't let the new folks take advantage of new ideas/technologies while you stay connected to your old *best* methods. Things move too fast for that to be successful. Excellent conversation!


Thanks for the thoughts Paul--you know, I thought about using UUIDs and then "just decided not to" (no rhyme or reason to it), but now that I've thought through this stuff 100 times I think I may rework my database and all my CFCs to use UUIDs instead. I'll think through all my scenarios to make sure it'll get me what I'm after, but you're right, that would definitely avoid the chicken-and-egg situation with the IDs. Thanks for bringing up something that makes me want to do yet more rework! ;-) Excellent thoughts on refactoring as well. It's nice to hear from someone with a lot more experience than myself so I know I'm not just bad at this stuff! I'm a big enough dork that I'm having a great time thinking about all these OO design issues, reading all the books I can get my hands on, experimenting, reworking, etc. Sometimes I feel like I'm spending tons of time thinking and not a lot of time doing, but as I think I said before, better to do that and make well-informed decisions than just slap something together and hope it's maintainable. Thanks again--now off to see about using UUIDs ...


You need not use UUID's to beat chicken-and-egg; you could use a Sequence. In a current project I have a MS SQL Server db table named 'Sequence' with columns 'SequenceName' and 'SequenceId'. My DAO's all extend a GenericDAO.cfc which has a getNewId() method, that calls a stored procedure to get the next integer id in a specified (named) sequence. In this app my table has 3 rows, with 'SequenceName' = 'UserGroup', 'NodeItem' and 'ValueObject', and SequenceId for each row is the next integer id that will be returned. My UserDAO and GroupDAO both use a 'UserGroup' sequence, so each sets variables.sequenceName = "UserGroup". getNewId() calls the stored procedure like this: EXEC nextVal @sequenceName=<cfqueryparam cfsqltype="CF_SQL_VARCHAR" maxlength="50" value="#variables.sequenceName#"> We adapted the stored procedure from one by James Thornton ( http://jamesthornton.com/software/coldfusion/nextval.html ) -- ours looks like this: CREATE PROCEDURE nextVal @sequenceName nvarchar(50), @sequenceId int=NULL OUTPUT AS -- return an error if sequence does not exist -- so we will know if someone truncates the table set @sequenceId = -1 UPDATE Sequence SET @sequenceId = SequenceId = SequenceId + 1 WHERE SequenceName = @sequenceName Select @sequenceId as sequenceId GO So if you want, each of your DAOs can have its own sequence of integer ids or all can use the same one. My DAOs use getNewId() within their create() methods to grab an id just before inserting the new record, but you could just as well call getNewId() as a public method up front, set it in your User bean and do it again for your Address bean. Now each can know the other's id before you persist them. Hmmm, I guess for flexibility the create() method should use check for a non-zero id in the incoming object and use that if present; if the id is zero (as set by the constructor) then call getNewId() for the id to use in the INSERT query. There's your 32 bytes per record back, at the expense of calling the stored proc once per id (instead of using CreateUUID() in CF). Also, integer ids are nice if you have to move things between (say) Oracle and SQL Server, heaven forbid...


lars: I understand your approach, but there are 3 issues for us: 1. We would have to incorporate into the sproc a check for uniqueness. We are going to key off that new ID and the Sequence table row for a given SequenceName can be reset without any db constraints validating it. This only pertains if you care about uniqueness, which we do in our case. 2. This solution will have a difficult time scaling as every item in the application will now depend on this one DAO and sproc to handle all new rows. That could cause a very large logjam. 3. UUIDs offer one more benefit: they do not have to be created on your system. For instance, in our solution, our own application will be calling the Service and DAO objects from within our app, but there is also a Web service that uses the same ojbects. Other systems can pass in their chunk of information, using their own UUID, but we can also use it since it will be unique not only in our system, but the world. UUID pretty much guarantees us uniqueness from any system. In this manner, all of the disparate systems have their own unique key (the UUID), but we are able to link them together without having to map our internal ID in a row to the UUID in order to speak with other systems regarding this chunk of data. The CreateUUID() CF function also saves us a trip to the db server. Every little bit counts (pun intended) when looking at large volume transaction systems. For us, storage is cheap and network bandwith is a bigger concern. But that may differ in other scenarios, in which your solution handles the tasks with ease.


Matt, I think your chicken and egg issue is due to the DB design. Instead of putting the address_id in the person table, you should have a separate table to join the two. person_addresses { person_id address_id } That would actually more closely model the separation of objects if that is your goal. It also allows for a person to have more than one address (which is a very likely scenario). Using this method, you no longer need worry about which gets created first - you simply create each one, presumably retrieiving the new ids from each operation, then you perform an insert those new ids into the person address table.


Matt, I think your chicken and egg issue is due to the DB design. Instead of putting the address_id in the person table, you should have a separate table to join the two. person_addresses { person_id address_id } That would actually more closely model the separation of objects if that is your goal. It also allows for a person to have more than one address (which is a very likely scenario). Using this method, you no longer need worry about which gets created first - you simply create each one, presumably retrieiving the new ids from each operation, then you perform an insert those new ids into the person address table.


Paul: yes, UUID vs Sequence is not a one-size-fits-all discussion! Your point #1 (Sequence id's must be unique) would suggest table-level security on the Sequence table. Your point #2 -- you could have many DAOs but yes, they'd all call the same stored procedure. Throw enough traffic at it and I guess anything bottlenecks. There must be stopgap solutions (e.g. 2 independent Sequences and sprocs -- one produces odd numbers, the other even ones... or do it with N greater than 2). Come to that, the Sequence could live on a different box than the main db. Not arguing with you, just "thinking aloud". Your point #3 -- yes, I _love_ UUIDs for being GLOBALLY unique, and they can come from a whole other process. I hate that not all platforms I use support UUID / GUID as native types -- I'm sure casting them to 36-character strings and having to index them as text primary keys in, say, Oracle, incurs a wee performance hit that might add up at very high volume. Thanks for the reply.


Roland--thanks for the thoughts. The reason I'm not using a join table for addresses is because, in this current app anyway, people will always only have one address. I also have companies who have addresses, however (again they'll only have one), so that's why I separated the addresses out. In other cases in this app I do have linking tables (person_skill for example) but not for the addresses. I'm just not sure that if there really wasn't a need (in this case there might be, but in other cases not) for a join table that it's worth having the extra table, but it's worth considering I suppose. Good point about avoiding some problems with this though. In the interest of incorporating this and allowing for the possibility of multiple addresses for people and companies (total overkill for this app, but what the heck!) I'll probably rework this as well. Thanks!


Lars--thanks for the info on sequences. Yet another option to consider! And thanks to everyone for all the discussion on this point, it's really helping a lot of people I think. There's a lot of ins and outs to some of this stuff so bouncing these ideas around is really great.


I agree, Matt. Some great ideas from everyone. It's nice to see that people can share thoughts in a way that we all learn how other people approach problems. Lars's comments got me thinking down another road...but I will test some things out and we will have another discussion on another day! Good Luck!

Saturday, January 15, 2005

The Importance of Load Testing

I recently finished up the first phase of a complete rebuild of my company's intranet. A big part of the new application is an integrated search that hits not only the intranet pages (which are largely managed using Macromedia Contribute), but also hits our knowledge management tool, which is a CF 5 app that uses Oracle as well as Verity collections as its datasources.


Because the knowledge management (KM) app contains some sensitive information and that team was reluctant to have us hit their datasources directly, our first notion of how we'd integrate the two sides of the intranet was to have our CFMX app hit a CFML page on the CF5 side of things with a query string, then we'd get a WDDX packet as our response. Seemed to work great, but then we fired up Load Runner to do some load and performance testing, and this is when things got interesting.


 


Load Runner is a really, really great tool. If you haven't used it before, you can easily script virtual users by hitting a "record" button and clicking around on your web app. For things like forms (such as the search form in this case), you can fill it out once and submit it, then in the script on the Load Runner side you can parameterize your search by feeding it a text file. In our case we fed it a text file containing a bunch of different search terms (70 to be exact) and told it to randomly run searches based on those terms. This can help you pretty realistically simulate real users. You can also have it record what they call "think time" as you're building your script, meaning the time you're sitting on a page just looking at it, and then tell it to randomly use values based on a percentage range, again to simulate real users more accurately. Very cool stuff.


In our case we basically had three user types: searchers, who just repeatedly searched, clickers, who just navigated around all the content pages, and a few who did some clicking and then some searching. You then use the Load Runner controller to tell it which virtual users you want to use, how many of each you want to use, give it ramp up, duration, and ramp down times for the test, and fire it off. You can monitor practically everything you can think of during the test, including live stats on the server you're hitting. (Load Runner should run on its own separate, dedicated server, not the same server you're trying to test.)


I won't bore you with all the gory details, but the long and short of the situation is that under a load of even 30 search users things got nasty. REALLY nasty. Page response times started getting into the 20-30 second range, and that's not just the search functionality. Whatever was going on seemed to negatively impact the content pages as well, many of which are cached on the server.


So that's the bad news. Our architecture that works fine with a very small number of users apparently just isn't going to scale. The CFML pages held up extremely well even with 100 users or more really pounding away, but the search functionality was pretty ugly.


The good news is that we figured this out before we launched it to the users. I can't imagine the stress we would have been under had we launched it first and THEN figured out it wouldn't scale. Better to know all of this now than figure it out when my phone's ringing off the hook later.


The other great thing is that between the Load Runner reports, the server stats, and the JRun logs (we're running CFMX on JRun), we at least can figure out where the bottlenecks are. For further testing, we did the following:




  • Used WDDX as well as more "standard" XML data locally instead of shipping it over the network


  • Created a test database in SQL Server (on a separate physical machine) containing much of the same data so we could hit a database instead of using WDDX over HTTP


  • Created local Verity collections containing much of the same data



From these tests and analysis of the Load Runner reports, at this point we've determined the following:




  • CPU utilization is always extremely high (averaging 95%) when using WDDX or more "standard" XML, either locally or over HTTP


  • Response times are always horrendously bad under load when using WDDX over HTTP


  • Response times are quite good when using WDDX or XML locally, but CPU usage still seems high


  • Response times and CPU utilization are both excellent when hitting the SQL Server database


  • Response times and CPU utilization are both excellent when hitting local Verity collections, but this is very slightly slower than hitting SQL Server (which honestly surprised me a bit)



We're still working through some of our options, which at this point are as follows (feel free to suggest more!):




  • Hit the Oracle database and Verity collections on the knowledge management side directly. This may or may not be possible depending on what the KM team will allow us to do.


  • Replicate all the data in another database as well as replicate the Verity collections. Upside: distributes load, would perform extremely well. Downside: multiple points of failure, added maintenance headache.


  • Ship XML data over to our server on a scheduled basis. Upside: we're hitting local data. Downside: CPU utilization when you're pounding away at XML data seems high.



That's where we are at this point--we're going to make a final determination for a path forward this week. I'm just sharing this because it's been an extremely educational process to go through, and points out the huge importance of load testing your apps.


If you don't have the bucks for Load Runner, consider using Microsoft's Web Application Stress Tool or Open STA to load test your apps before deployment. Believe me, you'll be glad you did!


Comments


Matt: I first hit you post as ammunition for my colleagues to institute some "best practices" (and a timely reminder for us too - cheers mate) after having a really good read, I gotta ask what exactly you were testing for. I ask because it we may have a similar app architecture here and it could be a heads up for us on a different level. "CPU utilization is always extremely high (averaging 95%) when using WDDX or more "standard" XML, either locally or over HTTP" - so this was due to using WDDX/XML's send/load (as a transport medium)? Or CF parsing of the WDDX/XML? (I take it the WDDX/XML data from the CF5 system was converted into something the CFMX side could use – not streamed directly to the browser and processed there, possibly with Javascript?) "Ship XML data over to our server on a scheduled basis." - so while WDDX/XML had it’s problems, provided it was on the local server it was still "viable" to do it this way? "Our architecture that works fine with a very small number of users apparently just isn't going to scale" We’re backed into a corner to use XML for what we have to do - and that’s why I’m asking these specific questions. Getting back to your original intent of your post - If it were me we’d already have LoadRunner built into our unit testing. We've busted a gut creating reusable components. I'd like the unit testing to benchmark each as they're made. As I added to the link to your post when I sent to all the team this morning: "IMHO, making the code compile (work) is only half the story..." Cheers barry.b


Hi Barry--I'm not sure I understand exactly what you mean with the "what exactly you were testing for" question, so if you can clarify that I can give you more information. At the outset we were just testing for basic performance under load and needless to say were a bit disappointed with the distributed portion of the application. This led to the further investigations that I've outlined. Concerning CPU utilization, when I initially saw the high CPU figures I thought for sure it had to do with the CFHTTP calls we were making to the remote server and the overhead involved with that. It turns out that this was definitely a reason for the slow response times (when we switched to local data the response times increased dramatically), but it didn't seem to affect CPU utilization, because when we use WDDX or XML data that's local to the machine, the CPU utilization is still really high. My big question at this point is why this is happening. The good news is that at 30 users or 100, the CPU utilization stays about the same, but I'm still not happy with it and I'm definitely afraid during peak use the server would become non-responsive. Oh yes--one thing I failed to mention. On a lark I tested this code on a CFMX standalone box since we're building this on CFMX on JRun. Exact same results, so I know it's not an issue with JRun (or if it is, it's a consistent issue with the full version of JRun and the one that runs underneath CFMX standalone). At this point all I can conclude is that XML parsing is CPU intensive, and if you have 100 people or more banging away at operations that involve XML parsing, you better have a pretty beefy server on which to run your app. As for the difference between WDDX and XML, it seemed minimal. When we get the WDDX over HTTP, I just run the cfwddx tag to convert the WDDX data to a query object. I timed just that portion of the operation (the cfwddx tag) and even under load it's extremely quick. Just to make sure that wasn't the bottleneck, however, I tested things without doing that step and it didn't seem to make a discernable difference. This is what led me to trying "standard" XML as well. Again, not much difference either in data packet size (I was concerned that WDDX was just heavy by nature, but it's really not bad) or in CPU utilization. As for using local XML data, I don't really see this as a viable option at this point, but since we're still working through all of this I'm just trying to list every possibility. Then at least our team can be satisfied that we've thought through literally every option. Unless we got a quad-processor box with tons of RAM (our dev box is a relatively old dual P-III 500 with 1.25GB of RAM) or start getting into clustering (which we may do anyway ...), I don't really see this working under load. For our targeted user base if we had two killer servers behind a load balancer it would work OK, but at that point I see it as using hardware to solve an application architecture issue. If you're completely stuck using XML and there's no other way around it (which might not be the case in our situation), then that may be what you're facing. Unless I'm doing something horrendously wrong XML operations in CFMX just seem pretty CPU intensive. Another test I'm going to do if I have time is re-write just this piece in Java to see if that works any better, but I don't suspect that it will. I think in our case we're going to determine that we don't want to spend tons of money on hardware to solve the problem, and we'll likely end up having the knowledge management team create a read-only view in Oracle that they're happy with from a data access/security standpoint, and we'll either hit their Verity collections directly somehow or replicate those to our server once a day. Based on the testing I did hitting a database and Verity, this performed really well even on our relatively modest dev server. You're absolutely right about testing early and often. Since this is the first large distributed app I've really worked on (like most CFers I'm far more used to having direct access to a database), when I was testing on my own and seeing response times of 2-4 seconds for searches, I thought "a bit sluggish but it'll be fine." The Load Runner testing told a completely different story! In the end, testing for potential bottlenecks like this early in the development process is pretty crucial. We haven't announced this app to the company yet but they're chomping at the bit to do so, so we're a bit under the gun. If I had tested this piece of things a lot earlier we would have worked through the possible solutions a lot earlier as well. I'll definitely post what we wind up doing, and if anyone reading this has any further ideas or questions, I'm all ears! Matt


cheers Matt: I should take this off the comments section and email you directly but this touches on 2 things on your post. you've clarified that it's xml parsing - not transport - that's the root of the issue. so... 1) the actual problem (well the XML part anyway): we're forced into using XML as a transport medium to drive a very complex UI. Javascript is used client-side but your post has me worried when it's time to parse the xml in CF. you're saying that XmlParse() and manip of XML (as arrays and structs) is the cause of the 95% CPU utilisation? gee, that's a worry... for us we may be in luck - the server-side xml building and parsing is done as strings (the guy that wrote it is an old unix programmer so RegExp is his best friend - he's also slightly mad...) I suppose if it *can* be traced back to the code level, slotting in different parsing techniques might change the results. Just a thought... 2) the origional intent of your post - the value of load testing: I've been banging my head against a wall here to get some best practices going: the present culture is "ship it and we'll refactor later" (especially for performance). My biggest worry is that the XML solution will be accepted but we won't be able to prove that it'll scale until it's too late - the technique will be all over the app and we won't be able to do a thing about it. It should have had it's own feasilability study done or at least incorperate load testing in the unit test (which we've been begging for anyway). bottom line? I want to be able to sleep at night... I'll be very curious how you get on good luck barry.b


Hi again Barry--as far as I can tell, it is indeed the XML parsing itself that's causing the high CPU utilization. Bear in mind, however, that in our case since this is a search application we're talking about quite a lot of XML data. I tested with a local XML document that was about 115K for each user (this represented about 100 or so search results), so when you take that times 50 or 100 users, you're getting into a lot of data being parsed. So what I'm getting at is that if you're dealing with a lot less data than that it may not be quite so much of an issue. Concerning manipulation, I've tried everything from a simple dump to doing a toString() to using XSLT (both read from a file and cached in RAM) to manipulate the XML once it's parsed. Nothing seemed to have much of an impact. My next avenue of investigation is going to be the XML parsing CF is doing under the hood. My understanding is that CFMX by default is using DOM (which I'm going to verify), which of course is a lot more RAM intensive than SAX. DOM reads the whole XML document into memory for random access capabilities, which I don't need in my case, whereas SAX is a top-to-bottom read, which would work for me. If I remember correctly from when I last messed with all this in Java, SAX is much faster than DOM so that could give me more avenues of investigation. This isn't ideal since I'm still dealing with a CF 5 app on the other side that's generating the data, but if using SAX spanks the pants off of DOM then it may be another possibility. (This gets at what you mention concerning alternative parsing strategies.) I completely hear you on the "I want to be able to sleep at night" comment! I think everyone thinks I'm a bit nuts at this point because I'm obsessing over this, but I'm not going to feel comfortable launching this to my entire company until this is resolved. I'd never really done load testing before, and I'd never had problems, but believe me after this experience I'm going to load test everything. I just hadn't had problems before because I was very lucky. I'll let you know what more I find out, and thanks for the dialogue--it's getting my brain going in other directions on some of this! Matt


Just a quick follow-up--we're retooling this to hit the Oracle database directly and we'll be using local Verity collections as well. I'm sure we'll get some radically different results from the next round of load testing! In the end we determined that in our situation that other setup just wasn't going to scale unless we threw a ton of money into more hardware, which wasn't necessary since we have other options. The original architecture looked great on paper and kept things really simple in terms of the integration between the systems, but alas, it just wasn't going to hold up under load. Live and learn! Matt


>> I think everyone thinks I'm a bit nuts at this point because I'm obsessing over this, but I'm not going to feel comfortable launching this to my entire company until this is resolved. well, who's been vindicated over the worth of these tests, eh? A timely reminder to us all, methinks. now if only I can get my people to learn from this... (sleep well, Matt) cheers barry.b


Matt: I've come across someone else who had performance issues with the apache Crimson classes (v1.0) that CFMX uses. this was especially using xpath queries, etc. Because he was on a shared server that had additional java libraries installed for XML, he used that with a much better performance. they can be found at http://www.dom4j.org/ hope this helps. cheers barry.b http://barryb.coldfusionjournal.com/


Thanks Barry--very good to know! We're moving forward with the direct database solution but I'm sure this will come in very handy in the future.

Wednesday, January 5, 2005

What's an Object?

What started as a simple blog entry quickly became quite lengthy, so I'm posting what I hope will be the first in a series of introduction to object-oriented programming for ColdFusion programmers articles as a PDF.


Please feel free to comment, correct, scrutinize, and ask questions as needed!




whats_an_object.pdf
Download this file



Comments


Thanks for writing this piece and I look forward to any future ones!


Ben Forta blogged this. that's how I found it.


Excelent Matt, While this was not new to me, this is the first time I gained a real understanding of it. You have a great way of explaining. Really looking forward to the future "follow ups".


Thanks everyone--appreciate the feedback!


I was looking for a OO explanation in a CF way... and this is it... (and like Trond... I'm really looking forward for more.) Thank you


Hi Matt, Good show on the article! I am looking forward to seeing this blog in Flex calling CFCs through Flash Remoting. :)


Thanks Mike--getting that up and running is definitely on the roadmap! The code's done, just have to find the time to clean it up a bit and get Flex installed on this server. I'm putting it on Tomcat so I'm sure I'll have some tales to tell going through that experience ...


I too would like to express my congratulatory remarks on a well written article. I am definetely looking forward to your next one, just don't wait too long to release it! :)


Great article, I look forward to more in the future. I have been trying to get my head aroung CFC's and OOP in general for some time. Your first article reienforce what I have picked up todate, and added some depth to what an Object is. What I look forward to hopefully in a future article is something along the lines of "OK, now you know what an object is, here is how you find them in the requirements for an application and get them to work together to acomplish said applications goals."


Thanks for all the nice feedback. Ian, what you're discussing, specifically OO design, is probably the hardest part of OOP. I'll definitely be addressing it, but a very, very firm foundation needs to be in place first, so it will come late in the series of articles I have planned. As a preview, the next article (no timing on it yet) will likely be along the lines of "OK, now I know what an object is ... how do I use one in the context of an application?" This will be really simple stuff but will help build a foundation so everyone can start seeing objects in their apps as Ian points out. It's fantastic to see so many CFers interested in this!


Nice job, Matt!


Great introductory article. I'm glad to see that you are not stuffing all your bean properties in to the "this" scope like I've seen so many other people do.


Matt, Thanks for writting this article!


Great article Matt! I know Im late to the table, but this has actually helped me see that I've been going about OOP all wrong. ;) Now I know how to start looking at it. I've always tied my DAO's and my BO's to my databases like there was some sort of magical mystical invisible binding between them (Procedural Programmer stuck in an OOP world). I now see how I should be looking at it and have found a new excitment for trying. :) Thanks again!