11g is more deadlock sensitive than 10g?

I ran into a situation over the weekend, where an application and schema, which were stable under 10.2.0.3, started hitting ORA-00060 deadlocks in 11.1.0.6, in spite of the fact that no application code changes had occurred.  It seems that 11g was more sensitive to deadlocks in this situation than 10gR2 was.

The situation developed this way.  We are starting to work with 11g.  We have a brand new application that is all new from the ground up, so, we thought we’d give 11g a try, as there is no legacy code base, and it seemed like a good opportunity to get our feet wet with 11g.  Well, we have an RMAN based backup system, so, to be able to backup the new 11g development database, I need to upgrade the RMAN catalog and catalog database to 11g.  When attempting to do so, I ran into an upgrade bug, which, without getting into the ugly details, meant that I needed to restore my catalog database from backup.  Open SR with Oracle, after much going round and round with them, they determined that my best course of action would be to upgrade the 10g database to 11g, and then NOT upgrade the 10g catalog to 11g (which is where I hit the bug initially).  Instead, leave the catalog owned by the ‘RMAN’ user at 10.2.0.3.  Create a new user, ‘RMAN11G’, and create a brand new 11g catalog there.  Now, I can continue to backup all my pre-11g databases to the RMAN user, and when I have an 11g database, I connect to the same rman catalog database, but as RMAN11g, rather than as RMAN.  So, I did the upgrade on Friday night, and everything seemed to go well.  I tried a few archive log backups, just to make sure everything seemed to be fine.

So, that’s a bit of the background.

On Saturday, I start getting pages, backups are failing.  And they’re failing w/ ORA-00060 deadlock detected.  Huh?  I’ve never heard of this happening before.  This makes no sense to me.  A quick look at the backup log, and it appears to happen on catalog resync.  So, I look at a sampling of the trace files, and they all seem to be deadlocking on TM enqueues with mode held ‘SX’ and mode waited on ‘SSX’.  So, this is definitely due to foreign keys referencing unindexed columns.  No doubt in my mind about that.  So, I go over to Steve Adams’ website, and find the script that identifies all the columns used in foreign key relationships that are missing indexes, and I slightly modify that script to generate DDL to create the missing indexes.   I run that, create all the missing indexes, and the deadlock problem goes away.

But, the mystery, in my mind, is why this happened at all.  Note that the deadlocks were happening when a 10.2.0.3 database tried to backup archive logs running 10.2.0.3 version of RMAN connecting to a 10.2.0.3 catalog living in a 11.1.0.6 database.  The same scenario worked fine for months and months, when the 10.2.0.3 catalog was in a 10.2.0.3 database.  So, I upgraded the database, but not the catalog or the RMAN binary that was being used.  That is, the “application” (RMAN binary and RMAN catalog) were not upgraded.

This would seem to imply that 11g is somehow more deadlock sensitive than 10gR2 is?  That strikes me as troublesome….and definitely concerns me, if that’s really the case.  I’m not sure I have enough information to prove that this is the case, but I’m definitely cautious and suspicious, at the moment…..  It wouldn’t be the first time that an Oracle upgrade exacerbated a situation, rather than improve it….

12 comments on “11g is more deadlock sensitive than 10g?

  1. orainternals says:

    Yo Mark!
    Glad you are blogging. Did you get any deadlock graph, by any chance?
    Cheers
    Riyaj

  2. mbobak says:

    Riyaj,

    I have several trace files, all indicate the same type of deadlock graph, all holding TM enqueue in ‘SX’ mode, and waiting on ‘SSX’ mode. There are at least two sessions involved in each deadlock, these are not self-deadlocks. Also, adding the indexes that were missing has clearly solved the problem.

    -Mark

  3. dombrooks says:

    If I read correctly, the only thing that has changed is that database version has upgraded.

    The RMAN codebase and table structures are the same

    So, it sounds like the performance of some bit or bits of code that 10.2.0.3 RMAN runs has degraded in 11g compared to 10.2.03?

    In which case, something that you might expect from an upgrade – some code gets better, some stays the same, some gets worse. And as it’s internal oracle code, would that be something that Oracle would have tested as part of 11g development – 10.2.0.3 RMAN on Oracle 11g?

  4. grancher says:

    good evening Mark,

    thank you for starting this blog. Already in my Google Reader favorites.

    We have hit another issue with RMAN and 11.1.0.6 (see https://metalink.oracle.com/metalink/plsql/ml2_documents.showDocument?p_database_id=NOT&p_id=577600.1)
    We had to go back to 10.2 for the RMAN repository (all work done by my colleague).

    see you,
    eric

  5. mbobak says:

    Eric,

    Interesting, I was not aware of this bug. Can you clarify? It seems that this bug affects only 10.2 (or lower?) databases being backed up w/ 11g catalog? The problem does not occur on 11g database backed up using 11g catalog? If so, I think I’ll be ok, for the time being. So far, we are maintaining the old 10g catalog as well as the new 11g catalog, and only using the 11g catalog for backing up the 11g databases. The plan was to start migrating the 10.2 and lower databases over to the 11g catalog. But, after learning about this bug, I think we’ll be holdling off on that, at least till a patch is available.

    Thanks Eric!

    -Mark

  6. Eric Grancher says:

    hi Mark,

    yes, from what my collegue Chris has investigated, the issue is 10.2.0.3 rdbms backing up with 11.1.0.6 catalog (fine for 11.1.0.6 rdbms backing up with 11.1.0.6 catalog. So you are fine with the first step but indeed, it is advisable to wait and not do the second step for now.

    see you,
    eric

  7. dallasdeeds says:

    Hi Mark,

    Which bug did you hit when you ran the catalog upgrade? Did you log the deadlock issue as a bug?

    We are doing a lot 11g testing now and I am trying to document these issues in our upgrade documentation.

    Thanks!

    Dallas

  8. mbobak says:

    Hi Dallas,

    Sorry, apparently I’m useless. I can no longer query the SR, so, I think it’s been closed too long and aged out of MetaLink. As to the deadlock issue, no, I don’t think I filed a bug. I just created the missing indexes and moved on.

    -Mark

  9. damirvadas says:

    Hi!
    I have created deadlock solution which was working on 9i. After I read your article I have tried to re change to work on 11i. Anyone welcome to se it more close…

    http://damir-vadas.blogspot.com/2009/11/deadlock-deadlock-how-to-monitor.html

    Cheers!

  10. […] couple of months after I generated a test case like the one below, I read an interesting blog article by Mark Bobak, a fellow OakTable Network member and frequent contributor to the Oracle OTN forums […]

  11. dombrooks says:

    Mark – I’m trying to find a copy of your presentation “Understanding and Interpreting Deadlocks”? Is it still publicly available? It used to be downloadable from the oaktable site but I can’t find it since the redesign. Any chance you could publish it on this site?

Leave a comment