5 Replies Latest reply: Aug 1, 2012 7:37 AM by leeja RSS

Bounded work queues?

leeja

We utilize database connections to drive many of our workflows, but lately we've had a convergence of bad luck where large amounts of data comes in at once and trigger a big series of workflows.  We follow the model that our professional services layed out for us where we have a database monitor feeding a dispatcher workflow.  This dispatcher spawns a process for each incoming event and passes the event to the spawned workflow. 

 

It would appear from our experiences that BAO does not throttle incoming work or gracefully handle more incoming work than its resources allow for.  In particular we are seeing OutOfMemoryException errors.  So I'm wondering if anyone has any experience with implementing such throttling and if so where was it applied?  I have experimented with implementing queues in my database tables and then using stored procedures to ensure that no more than X number of processes are in "processing" state at a given time.  Although this works for the most part, it creates its own set of problems as the queue can become stuck when errors are encountered and you have to have some sort of mechanism to purge/bypass work items that can't make forward progress for whatever reason.  It's a huge coding effort just to maintain the flow of work and not where I want to spend my time.

 

I briefly considered maintaing a global context variable to maintain the number of active processes as well, but was worried about losing updates to this variable.  There doesn't seem to be an "atomic increment" so having a process that reads the current count and then setting it could lead lost updates when another process running at the same time is doing the same thing.  This in turn would lead to an inaccurate count of the number of in-progress work items.

 

Is there a better way to go about this?  Our current architecture consists of a dedicate Access Manager, dedicated repo, and a dedicated CDP and HA-CDP.  Other than implementing lots of throttling, is it possible to scale out our environment other than just adding more memory to the CDP/HA-CDP?

  • 1. Re: Bounded work queues?
    Ranganath Samudrala

    Gordon McKeown has done exactly what you are looking for. Hopefully he can answer

     

    1. What is the OS type?

    2. What is the CPU speed?

    3. What is the memory available?

    4. 32bit or 64bit?

    5. What is the -Xmx value provided to AO?

    6. What is the version fo AO?

     

    Ranga

  • 2. Re: Bounded work queues?
    leeja

    Our environment is composed of VMs, so we have some flexibility as far as memory/CPU.  At present our production CDP is as follows:

     

    1. Windows Server 2008 R2 Enterprise

    2. Intel Xeon E7330@2.40GHz

    3. 3.50 GB RAM

    4. 64-bit OS

    5. lax.nl.java.option.additional=-Xms1280m -Xmx2048m

    6. Product Version=7.6.02.02 - no=44 id=2011082302 c=316582

  • 3. Re: Bounded work queues?
    Ranganath Samudrala

    1. Overall memory on the machine could be low compared to the load you are putting on the AO process - suggest that it be increased to 8G.

    2. Your -Xms and -Xmx should increased to 3072m or 4096m.

    3. You should upgrade to the latest version - v7.6.02.05.

     

    When you encountered OOM errors, do you know how many jobs were running concurrently in the system? You could find that by navigating to http://host:port/baocdp/console -> GRID->Jobs screen.

     

    Ranga

  • 4. Re: Bounded work queues?
    Gordon Mckeown

    The trick, I have found, is to do as little as possible in the triggered workflow. My experience with this was with ticket brokering, so the incoming requests were from Remedy or web services. I would immediately record the key fields from the event into a queue database table, exactly as you described, and then the event handler workflow would exit.

     

    You are correct that this does introduce its own problems around handling stale status codes for requests. I used a timestamp for each queue entry, and a separate process that would sweep the queue for "stale" entries -- i.e. those over a certain age. Depending on the status, these would be marked as "error" or re-submitted.

     

    There are benefits to this approach, though. You reduce the risk of ever losing a request, and can track the status of each request. It also can make failure-handling easier in some cases. By running multiple scheduled processes to handle specific actions against the queue, you can even cope with a complete grid failure -- the processes will just pick up each request at the right step on restart.

     

    Potentially you could add a couple of APs to this grid to increase workflow-processing capacity. I will probably get myself into trouble with the R&D guys by saying that because technically AO does not scale for performance by adding peers. For certain workloads, however, additional peers can increase processing capacity; only testing will tell if this will work for you. After about 4 peers, you are likely to get better performance by scaling to multiple grids (on different peers, not multiple grids across the same peers).

     

    I would avoid using global context items if you can. They're awkward to initialise, and introduce additional overhead to the grid as it needs to constantly synchronise them between peers. Also you have the atomicity issue you mention.

  • 5. Re: Bounded work queues?
    leeja

    Thank you both, this is very helpful information.  And it looks like I was on the right track with throttling all work through a database queue.  I've never actually monitored the active jobs on our system, although I have been watching the number of active threads.  During low points of activity, our system is running around 200 to 300 active threads.  And during exceptional times we've seen it hit well over 1000 active threads.

     

    Most of my newer workflows already throttle their workloads, but that was mostly done out of consideration for the downstream sysetms that BAO interacts with.  It looks like I may have quite a bit of work ahead of me looking at all incoming work streams and building queues around them.