I ran into a situation over the weekend, where an application and schema, which were stable under 10.2.0.3, started hitting ORA-00060 deadlocks in 220.127.116.11, in spite of the fact that no application code changes had occurred. It seems that 11g was more sensitive to deadlocks in this situation than 10gR2 was.
The situation developed this way. We are starting to work with 11g. We have a brand new application that is all new from the ground up, so, we thought we’d give 11g a try, as there is no legacy code base, and it seemed like a good opportunity to get our feet wet with 11g. Well, we have an RMAN based backup system, so, to be able to backup the new 11g development database, I need to upgrade the RMAN catalog and catalog database to 11g. When attempting to do so, I ran into an upgrade bug, which, without getting into the ugly details, meant that I needed to restore my catalog database from backup. Open SR with Oracle, after much going round and round with them, they determined that my best course of action would be to upgrade the 10g database to 11g, and then NOT upgrade the 10g catalog to 11g (which is where I hit the bug initially). Instead, leave the catalog owned by the ‘RMAN’ user at 10.2.0.3. Create a new user, ‘RMAN11G’, and create a brand new 11g catalog there. Now, I can continue to backup all my pre-11g databases to the RMAN user, and when I have an 11g database, I connect to the same rman catalog database, but as RMAN11g, rather than as RMAN. So, I did the upgrade on Friday night, and everything seemed to go well. I tried a few archive log backups, just to make sure everything seemed to be fine.
So, that’s a bit of the background.
On Saturday, I start getting pages, backups are failing. And they’re failing w/ ORA-00060 deadlock detected. Huh? I’ve never heard of this happening before. This makes no sense to me. A quick look at the backup log, and it appears to happen on catalog resync. So, I look at a sampling of the trace files, and they all seem to be deadlocking on TM enqueues with mode held ‘SX’ and mode waited on ‘SSX’. So, this is definitely due to foreign keys referencing unindexed columns. No doubt in my mind about that. So, I go over to Steve Adams’ website, and find the script that identifies all the columns used in foreign key relationships that are missing indexes, and I slightly modify that script to generate DDL to create the missing indexes. I run that, create all the missing indexes, and the deadlock problem goes away.
But, the mystery, in my mind, is why this happened at all. Note that the deadlocks were happening when a 10.2.0.3 database tried to backup archive logs running 10.2.0.3 version of RMAN connecting to a 10.2.0.3 catalog living in a 18.104.22.168 database. The same scenario worked fine for months and months, when the 10.2.0.3 catalog was in a 10.2.0.3 database. So, I upgraded the database, but not the catalog or the RMAN binary that was being used. That is, the “application” (RMAN binary and RMAN catalog) were not upgraded.
This would seem to imply that 11g is somehow more deadlock sensitive than 10gR2 is? That strikes me as troublesome….and definitely concerns me, if that’s really the case. I’m not sure I have enough information to prove that this is the case, but I’m definitely cautious and suspicious, at the moment….. It wouldn’t be the first time that an Oracle upgrade exacerbated a situation, rather than improve it….