Transfer’s nifty cache causes real problems if you’re running a cluster as data changes without notifying other nodes. Over time, one server may report value A while another reports value B and the only way to correct this is to turn caching off. Boo! The solution is TransferSync, a recent project by Tom de Manincor, based on concepts from Sean Corfield’s early work to synchronize the actions of the Transfer ORM among nodes in a cluster.
Getting Synchronized
There’s more to it than just dropping in the files so I’ve written a brief HOWTO. Here are the steps I took to get it working on a 2-node CentOS 5.2 cluster for a Model-Glue/Coldspring/Transfer application, starting with my CF8 install and upgrade. The steps are roughly the same for Windows minus some of the Unix-isms. I chose to run an instance of ActiveMQ JMS on each of my webservers instead of one central instance because it’s a lightweight daemon with little requirements. Perform these steps on each of the nodes in your cluster:
- Get the ActiveMQ 5.1.0 binary:
wget http://mirror.its.uidaho.edu/pub/apache/activemq/apache-activemq/5.1.0/apache-activemq-5.1.0-bin.tar.gz
- Extract it. I installed mine to /opt alongside my multi-server ColdFusion installation:
tar zxvf apache-activemq-5.1.0-bin.tar.gz
- If you have iptables running as a local firewall, you will need to poke some holes (config is in /etc/sysconfig/iptables):
# activemq clustering
-A RH-Firewall-1-INPUT -m state --state NEW -p tcp -s 192.168.2.0/255.255.255.0 --dport 61616 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp -s 192.168.2.0/255.255.255.0 -d 239.255.2.3 --dport 6155 -j ACCEPT - Make sure you can resolve your two nodes via DNS as ActiveMQ won’t be able to automagically connect to the other host if not:
nslookup node1.site.com; nslookup node2.site.com
- Start ActiveMQ and see it find the other nodes:
/opt/apache-activemq-1.5.0/bin/activemq &
- Make ActiveMQ automatically start on boot by adding this to your /etc/rc3.d/S99local file taking care to point at the proper JRE:
JAVA_HOME=/opt/jrun4/jre /opt/apache-activemq-5.1.0/bin/activemq &
Although I am using 1.6.0 u10 for ColdFusion, ActiveMQ 1.5 did not like it so instead I pointed it at the default JRE that comes with ColdFusion 8. - Copy the top-level JAR file into your JVM ext folder, for example:
cp /opt/apache-activemq-5.1.0/activemq-all-1.5.0.jar /opt/jrun4/jre/lib/ext
Caution: if you have pointed CF to another JRE like I did in the above install howto, the ext folder is no longer under the ColdFusion installation directory! Instead, it might be something like /usr/java/jdk1.6.0_10 or c:\program files\java\jdk1.6.0_10. If not done properly, this will result in an error like the following:
javax.naming.NoInitialContextException: Cannot instantiate class: org.apache.activemq.jndi.ActiveMQInitialContextFactory [Root exception is java.lang.ClassNotFoundException: org.apache.activemq.jndi.ActiveMQInitialContextFactory]
- Restart ColdFusion to load the ActiveMQ JAR.
- For Model-Glue applications, you need two folders out of Tom’s TransferSync distribution: gateway and model. Copy these into your application somewhere that you can reference in your Coldspring configuration. In my case, I split out my API from my model-glue application so I put the TransferSync model folder under my api at /PUKKA_API_MAP/transfersync and the gateway folder in my webroot under the mapping /PUKKA_CORE_MAP/transfersync. The event gateway needs to be under the application whose Transfer instance you want to access or else it can’t discard modified objects.
- Define the Gateway according to Tom’s documentation using the CF Administrator, choosing Event Gateways and adding a Gateway Instance, using your paths as appropriate:
ID = TransferSync
Type = ActiveMQ
CFC Path = /web/www/transfersync/gateway/TransferSync.cfc
Configuration Path = /web/www/transfersync/gateway/TransferSync.cfg
Startup Mode = Automatic
Make sure that your transfersync/model/definition directory is writable by the webserver. If you’re using SELinux, be sure the context is set properly. - CRITICAL: Modify your TransferSync.cfc event gateway to configure how to access your Transfer instance in the method getTransfer(). If you fail to do this, you will get errors saying Transfer can’t be found. For a Model-Glue/Coldspring application with the CS beanFactory defined as application.cs, my code looks like:
<cffunction name="getTransfer" access="private" returntype="transfer.com.transfer" output="false">
<cfreturn application.cs.getBean('ormService').getTransfer() />
</cffunction> - You should be able to start the gateway at this point. Look at your eventgateway.log file in case it fails to start for ideas. The most likely cause is going to be the JAR not loaded resulting in a ClassNotFound exception:
"Information","Thread-29","09/04/08","15:26:18",,"Starting Gateway: ID=TransferSync, Class=examples.ActiveMQ.JMSGateway."
"Information","Thread-29","09/04/08","15:26:18",,"JMSConsumer.start() called"
"Information","Thread-29","09/04/08","15:26:18",,"JMSConsumer.start() initializing"
"Error","Thread-29","09/04/08","15:26:18",,"Failed to start gateway: Cannot instantiate class: org.activemq.jndi.ActiveMQInitialContextFactory"
"Error","Thread-29","09/04/08","15:26:19",,"Error starting gateway TransferSync: Cannot instantiate class: org.activemq.jndi.ActiveMQInitialContextFactory"If everything went to plan, your ColdFusion startup log should show a successful init of the event gateway:
"Information","Thread-16","09/04/08","15:39:04",,"Starting Gateway: ID=TransferSync, Class=examples.ActiveMQ.JMSGateway."
"Information","Thread-16","09/04/08","15:39:04",,"JMSConsumer.start() called"
"Information","Thread-16","09/04/08","15:39:04",,"JMSConsumer.start() initializing"
"Information","Thread-16","09/04/08","15:39:04",,"JMSConsumer.start() starting connection"
"Information","Thread-16","09/04/08","15:39:07",,"JMSConsumer.start() done" - Next, edit your coldspring.xml where you define your Transfer instance and add the following configuration to enable the synchronization:
<!-- provides cluster support via ActiveMQ for cache synchronization -->
<bean id="TransferKeyRetriever" class="PUKKA_API_MAP.transfersync.model.transfer.TransferKeyRetriever" lazy-init="false">
<constructor-arg name="transfer"><bean id="transfer" factory-bean="ormService" factory-method="getTransfer" /></constructor-arg>
<constructor-arg name="definitionPath"><value>/PUKKA_API_MAP/transfersync/model/definition</value></constructor-arg>
</bean>
<bean id="TransferSyncObserver" class="PUKKA_API_MAP.transfersync.model.transfer.TransferSyncObserver" lazy-init="false">
<constructor-arg name="transfer"><bean id="transfer" factory-bean="ormService" factory-method="getTransfer" /></constructor-arg>
<constructor-arg name="keyRetriever"><ref bean="TransferKeyRetriever" /></constructor-arg>
<constructor-arg name="gatewayName"><value>TransferSync</value></constructor-arg>
<constructor-arg name="JMSTopic"><value>dynamicTopics/transfer.cache</value></constructor-arg>
</bean>Note that PUKKA_API_MAP and PUKKA_CORE_MAP are just two CF mappings. Technically they are placeholders that are replaced during deployment with Ant but that’s a separate topic: you need to be able to refer to the files in some manner. This new configuration goes alongside my existing Transfer configuration:
<!-- create an instance of Transfer -->
<alias alias="ormService" name="ormService.Transfer" />
<bean id="ormService.Transfer" class="transfer.TransferFactory">
<constructor-arg name="configuration"><ref bean="transferConfiguration" /></constructor-arg>
</bean>
<!-- datasource and ORM adapter -->
<bean id="datasource" factory-bean="ormService" factory-method="getDatasource" singleton="true" />
<!-- This is your application specific Transfer Configuration. Paths are from the webroot or mapping -->
<bean id="transferConfiguration" class="transfer.com.config.Configuration">
<constructor-arg name="datasourcePath"><value>/PUKKA_CORE_MAP/config/transfer/datasource.xml</value></constructor-arg>
<constructor-arg name="configPath"><value>/PUKKA_CORE_MAP/config/transfer/transfer.xml</value></constructor-arg>
<constructor-arg name="definitionPath"><value>/PUKKA_API_MAP/generated</value></constructor-arg>
</bean> - At this point, you should be able to restart your Model-Glue application and have everything come up like normal. If you get an error saying the gateway couldn’t be started or the Gateway Instance TransferSync doesn’t exist, check your eventgateway.log file to debug and find what went wrong. If you modify an object managed by Transfer, you will see in your eventgateway.log a successful message like:
"Information","jrpp-0","09/04/08","15:43:56",,"JMSPublisher.sendMessage topic=dynamicTopics/transfer.cache; message={}"
"Information","ActiveMQ Session Task","09/04/08","15:43:56",,"Added message 'ID:brian_x61-1209-1220567944500-0:2:1:1:1/null' to event gateway queue."
Now your cache is synchronized!
In Production
I’ve had my application up and running now for a month using TransferSync and things are running well. As part of my migration to CF8, I did some load testing with TransferSync enabled and everything ran flawlessly under heavy load. Since the event gateway threads run asynchronously and separate from regular CF threads, I don’t think there is any noticeable performance impact either.
More important to me than outright performance is failover and redundancy for reliability. We need to be up 24×7 and going back to multiple redundant web servers accomplishes that goal while using TransferSync allows us to keep Transfer caching turned on for performance.
Hiccups
On three occasions so far, one of the ActiveMQ instances has hung which leads to the process trying to trigger the event gateway to hang. What’s dangerous about this is that it ties up regular CF threads and so very quickly the server becomes unresponsive queuing up threads and waiting for them to finish. The underlying errors in eventgateway.log look like:
"Error","jrpp-11524","10/15/08","13:46:29",,"JMSPublisher.sendMessage failed: javax.jms.JMSException: java.io.EOFException"
"Error","jrpp-11524","10/15/08","13:46:29",,"Failed to send message with exception: javax.jms.JMSException: Channel was inactive for too long: localhost/127.0.0.1:61616"
Restarting ActiveMQ on the node solved the problem but that’s not very helpful. I am trying a series of things now in production to squelch this problem including coordinating with Tom about wrapping the sendGatewayMessage() call in TransferSyncObserver.cfc with a CFLOCK that would timeout after a few seconds and generate an email. It’s potentially dangerous that your cluster would be out of sync but I think that, combined with an alert, is far preferable to the server locking up. I’ve also modified my ActiveMQ configuration on both nodes (/opt/apache-activemq-5.1.0/conf/activemq.xml) to change my transportConnector to disable inactivity checks:
<transportConnector name="openwire" uri="tcp://localhost:61616?wireFormat.maxInactivityDuration=0" discoveryUri="multicast://default"/>
The addition of the wireFormat.maxInactivityDuration=0 supposedly disables checking the channel for inactivity. I have had this in production now for about 12 hours and so far it’s fine, but the three previous failures occurred typically after more than a day of uptime. I’ll update here if I find this doesn’t resolve the problem.
Update 10/30 – I’ve had one hang since making the above wireFormat change. It’s not clear if it’s really due to inactivity or not as in the logs I can see successful calls not long before. I may try some kind of scheduled task that periodically drops a known object from each side to act as a keepalive ping?
Update 1/26/09 – The wireFormat change did seem to improve things, but I still had occasional failures with ActiveMQ. I’ve updated to 5.2.0 and that has solved the problem for me. See more in my new post TransferSync stable on ActiveMQ 5.2.0.
Thanks to Tom for a great piece of software – this makes Transfer enterprise-ready by being able to scale out in a cluster.
brian said:
on October 28, 2008 at 9:43 am
A note about dev/staging servers. In our network we have a separate server for staging and we do NOT want this machine to be part of the ActiveMQ cluster. In short, changes made on the test server should not cause the production machines to discard their objects.
Assuming you don’t add the IPTABLES rules listed above, you will get error messages from the other nodes like:
INFO DiscoveryNetworkConnector - Establishing network connection between from vm://localhost to tcp://web1:61616
WARN DiscoveryNetworkConnector - Could not start network bridge between: vm://localhost and: tcp://web1:61616 due to: java.net.NoRouteToHostException: No route to host
INFO DemandForwardingBridge - localhost bridge to Unknown stopped
You can eliminate these by adding one simple rule on the box you wish to exclude:
-A OUTPUT -p udp -d 239.255.2.3 --dport 6155 -j DROP
This will stop the machine from broadcasting UDP packets that tell other machines “hey, connect to me!”
TransferSync stable on ActiveMQ 5.2.0 » ghidinelli.com said:
on January 26, 2009 at 11:35 am
[...] my post on synchronizing Transfer ORM with TransferSync, I used Apache ActiveMQ 5.1.0 which was not reliable for me. After about a week of uptime, one of [...]