Last Comment Bug 470487 - Firefox Crash [@ nsWindow::GetParentWindow(int)] & [@ nsBaseWidget::Destroy()]?
: Firefox Crash [@ nsWindow::GetParentWindow(int)] & [@ nsBaseWidget::Destroy()]?
Status: VERIFIED FIXED
: [sg:critical?][crashkill][crashkill-fix]
: crash, regression, testcase-wanted, topcrash, verified1.9.2
Product: Core
Classification: Components
Component: Widget: Win32
: 1.9.1 Branch
: x86 Windows XP
: P2 critical (vote)
: mozilla1.9.2a1
Assigned To: Jim Mathies [:jimm]
: win32
:
: 488377 523186
: 503561 506108 530070
  Show dependency treegraph
 
Reported: 2008-12-19 14:41 PST by chris hofmann
Modified: 2010-07-15 19:11 PDT (History)
35 users (show)
benjamin: blocking1.9.2+
dveditz: blocking1.9.0.19-
samuel.sidler+old: wanted1.9.0.x+
See Also:
Crash Signature:
[@ nsWindow::GetParentWindow(int)] [@ nsBaseWidget::Destroy()]
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  beta1-fixed
  .6+
  .6-fixed


Attachments
getparent / destroy variable cleanup v.1 (2.26 KB, patch)
2009-04-16 12:59 PDT, Jim Mathies [:jimm]
vladimir: review+
Details | Diff | Splinter Review
getparent / destroy variable cleanup v.2 (7.61 KB, patch)
2009-04-16 14:36 PDT, Jim Mathies [:jimm]
no flags Details | Diff | Splinter Review
getparent / destroy variable cleanup v.3 (9.97 KB, patch)
2009-04-17 10:10 PDT, Jim Mathies [:jimm]
no flags Details | Diff | Splinter Review
[m-c checkin: a7b2a881595a] getparent / destroy variable cleanup v.3 (10.01 KB, patch)
2009-04-17 13:14 PDT, Jim Mathies [:jimm]
vladimir: review+
mbeltzner: approval1.9.1+
Details | Diff | Splinter Review
Properly call Destroy() (44.43 KB, patch)
2009-05-07 16:24 PDT, Jeremy Lea
no flags Details | Diff | Splinter Review
[1.9.1 checkin: 46b703c76f3a] getparent / variable shadowing patch (9.99 KB, patch)
2009-05-19 16:17 PDT, Jim Mathies [:jimm]
no flags Details | Diff | Splinter Review
[m-c checkin: da42cf530fa7] teardown refactor v.1 (15.80 KB, patch)
2009-07-07 10:27 PDT, Jim Mathies [:jimm]
neil: review+
Details | Diff | Splinter Review
teardown refactor 1.9.1 v.1 (14.84 KB, patch)
2009-07-23 14:39 PDT, Jim Mathies [:jimm]
no flags Details | Diff | Splinter Review
teardown refactor 1.9.1 v.3 (15.69 KB, patch)
2009-08-19 11:51 PDT, Jim Mathies [:jimm]
samuel.sidler+old: approval1.9.1.6+
Details | Diff | Splinter Review
java refs (13.89 KB, text/plain)
2009-11-08 16:11 PST, Jim Mathies [:jimm]
no flags Details

Summon comment box

Description chris hofmann 2008-12-19 14:41:08 PST
Currently the #4 top crash in Firefox 3.1b2

0  	xul.dll  	nsWindow::GetParentWindow  	widget/src/windows/nsWindow.cpp:1544
1 	xul.dll 	nsWindow::GetParent 	widget/src/windows/nsWindow.cpp:1518
2 	xul.dll 	nsBaseWidget::Destroy 	widget/src/xpwidgets/nsBaseWidget.cpp:260
3 	xul.dll 	nsWindow::Destroy 	widget/src/windows/nsWindow.cpp:1416
4 	xul.dll 	nsPluginInstanceOwner::Destroy 	layout/generic/nsObjectFrame.cpp:4030
5 	xul.dll 	DoStopPlugin 	layout/generic/nsObjectFrame.cpp:1987
6 	xul.dll 	nsStopPluginRunnable::Run 	layout/generic/nsObjectFrame.cpp:2024
7 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:510
8 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:170
9 	nspr4.dll 	PR_GetEnv 	
10 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:87
11 	firefox.exe 	firefox.exe@0x2197 	
12 	kernel32.dll 	kernel32.dll@0x17066

not many comments to go on.

*replying to message in yahoo mail at the time it crashed. Had just hit reply
Comment 1 chris hofmann 2008-12-19 14:50:27 PST
recent changes around that code...
http://hg.mozilla.org/mozilla-central/log/7cd58efd035a/widget/src/windows/nsWindow.cpp
Comment 2 chris hofmann 2008-12-19 15:40:51 PST
there is also another similar stack that goes though
 nsWindow::Destroy     widget/src/windows/nsWindow.cpp:1416

I wonder if this is a related crash, or another bug that needs to be spun off?

This stack signature "nsBaseWidget::Destroy()" is currently ranked #7

0  	xul.dll  	nsBaseWidget::Destroy  	widget/src/xpwidgets/nsBaseWidget.cpp:262
1 	xul.dll 	nsWindow::Destroy 	widget/src/windows/nsWindow.cpp:1416
2 	xul.dll 	nsPluginInstanceOwner::Destroy 	layout/generic/nsObjectFrame.cpp:4030
3 	xul.dll 	DoStopPlugin 	layout/generic/nsObjectFrame.cpp:1987
4 	xul.dll 	nsStopPluginRunnable::Run 	layout/generic/nsObjectFrame.cpp:2024
5 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:510
6 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:170
7 	nspr4.dll 	PR_GetEnv 	
8 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:87
9 	firefox.exe 	firefox.exe@0x2197 	
10 	kernel32.dll 	kernel32.dll@0x17066
Comment 3 Masayuki Nakano (Mozilla Japan) 2008-12-19 22:56:11 PST
I didn't change the actual logic of nsWindow::GetParentWindow which its old name is nsWindow::GetParent.

The crash timing is:
  if (widget->mIsDestroying) {
this line.

So, looks like the parent widget is already destroyed.
Comment 4 Masayuki Nakano (Mozilla Japan) 2008-12-19 23:24:41 PST
The pointer to nsWindow is released from HWND by SubclassWindow(FALSE). It is called only in nsWindow::OnDestroy. nsWindow::Destroy and getting WM_CLOSE.

I guess that the parent window didn't get WM_CLOSE message but the instance was already released. Then, SubclassWindow(FALSE) is called via nsWindow::Destroy. But then, mWnd was already set to null, so, we fail to release the pointer to nsWindow from HWND.
Comment 5 Masayuki Nakano (Mozilla Japan) 2008-12-19 23:27:56 PST
(In reply to comment #4)
> The pointer to nsWindow is released from HWND by SubclassWindow(FALSE). It is
> called only in nsWindow::OnDestroy. nsWindow::Destroy and getting WM_CLOSE.

nsWindow::Destroy and getting WM_CLOSE are callers of nsWindow::OnDestroy.
Comment 6 Masayuki Nakano (Mozilla Japan) 2008-12-20 00:20:04 PST
It's strange... ::DestroyWindow automatically destroys its children. So, why can ::GetParent in nsWindow::GetParentWindow get the destroyed parent window after the parent nsWindow destroyed??
Comment 7 chris hofmann 2008-12-20 08:19:44 PST
Masayuki,   thanks for helping to investigate.  I've cc'ed some others that have some worked in the windows widget code in the past to see if they have ideas.
Comment 8 chris hofmann 2009-01-06 16:42:22 PST
any more ideas on this one?
Comment 9 Vladimir Vukicevic (:vlad) 2009-01-09 12:53:56 PST
This is actually more likely to be plugin related -- I know there were some bugs & changes about plugin destruction.  Cc'ing jst to see if this looks similar..
Comment 10 Vladimir Vukicevic (:vlad) 2009-01-20 16:04:12 PST
Can we figure out what plugins are loaded for people who are crashing?  Is it always yahoo-related?  We've had problems with the yahoo state plugin in the past.
Comment 11 Johnny Stenback (:jst, jst@mozilla.com) 2009-01-20 16:24:18 PST
Given that this looks windows specific it's even more odd as on windows we reparent the plugins widget to the root widget (the desktop) when we do plugin destruction off of an event, which seems to be the case here.

I looked through a dozen or so crash reports, and only one of them had the yahoo state plugin dll loaded (npYState.dll), so it's likely not specific to yahoo.

It's also interesting to note that out of the 6886 crashes found by my query, 6675 happened on December 1st of 2008. So how much of a top-crash this is atm is not clear to me.
Comment 12 chris hofmann 2009-01-20 17:09:36 PST
This query shows 851 crashes in the last week for users on beta2

 http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.1b2&query_search=signature&query_type=contains&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28%29

That ranks it as #7 top crash.

Glancing though a few of the module lists I've seen NPAskSBr.dll (the ask tool bar) show up in a couple of reports.
Comment 13 Robert O'Callahan (:roc) (Mozilla Corporation) 2009-01-28 14:28:27 PST
I think this should block.
Comment 14 Mike Beltzner [:beltzner] 2009-01-28 15:01:50 PST
Not blocking for now since it looks like we might have just seen a spike caused by other software/sites? If it persists in B3 as a topcrash please renom.
Comment 15 Samuel Sidler (old account; do not CC) 2009-03-03 08:22:00 PST
This ranks as the number 7 for Firefox 3.0.6. My guess is that it'll be highly present in b3 just as it was for b2.
Comment 16 Samuel Sidler (old account; do not CC) 2009-03-24 08:21:13 PDT
This is currently #6 in the list of top crashes for Firefxo 3.1 beta 3. Renoming per comment 14.
Comment 17 Vladimir Vukicevic (:vlad) 2009-03-31 10:37:24 PDT
I've stared at this code and I can't see anything obviously wrong; we need to find steps to reproduce.
Comment 18 Vladimir Vukicevic (:vlad) 2009-04-07 10:59:17 PDT
(In reply to comment #15)
> This ranks as the number 7 for Firefox 3.0.6. My guess is that it'll be highly
> present in b3 just as it was for b2.

Did this start in 3.0.6, or earlier?  A regression range would be quite handy.
Comment 19 Aakash Desai [:aakashd] 2009-04-07 11:17:19 PDT
I've spent half a day trying different scenarios concerning Yahoo Mail and the Yahoo Mail extension and haven't been able to reproduce it whatsoever. The comments on the crash reports aren't helpful for this at all either. Has anyone been able to see this once or twice on their profile? I'd like to get some deatils as to what extensions, sites and operations/actions were used.
Comment 20 chris hofmann 2009-04-07 11:22:38 PDT
maybe between 3.0.4 and 3.0.5?   searching the topcrasher lists for recent
3.0.x releases I see these in nsWindow code.

release  rank  stack
3.0.4    43       nsWindow::GetParent()
3.0.5    28       nsWindow::GetParentWindow()
3.0.4    17       nsWindow::GetParentWindow()
Comment 21 chris hofmann 2009-04-07 11:29:51 PDT
er, that should have been

maybe between 3.0.3 and 3.0.4?   searching the topcrasher lists for recent
3.0.x releases I see these in nsWindow code.

release  rank  stack
3.0.3    43       nsWindow::GetParent()
3.0.4    17       nsWindow::GetParentWindow()
3.0.5    28       nsWindow::GetParentWindow()
3.0.6    20  	  nsWindow::GetParentWindow()
3.0.7    19  	 nsWindow::GetParentWindow()
Comment 22 chris hofmann 2009-04-07 11:37:53 PDT
and for the 1.9.1 related releases
 
release  rank    stack
3.1a2   nothing
3.1b1     7  	 nsWindow::GetParentWindow()
3.1b2    15  	 nsWindow::GetParentWindow()
3.1b3     9  	 nsWindow::GetParentWindow()
3.5b4pre 18  	 nsWindow::GetParentWindow()
Comment 23 chris hofmann 2009-04-07 11:51:26 PDT
so maybe some time last sept. looks like when it started showing up.

v3.0.3, released September 26, 2008  -- v3.0.4, released November 12, 2008

3.1 alpha2 Sept 5, 2008  ---   v3.1 Beta 1, released October 14, 2008
Comment 24 chris hofmann 2009-04-07 12:16:23 PDT
at first glance the only thing I see landing around then that might be related is https://bugzilla.mozilla.org/show_bug.cgi?id=452385

any possibility of some bad interaction between the patches there and plugins?
Comment 25 Masatoshi Kimura [:emk] 2009-04-07 15:30:55 PDT
Bug 452385 renamed nsWindow::GetParent() to nsWindow::GetParentWindow().
So this bug had occured in 3.0.3.
Comment 26 chris hofmann 2009-04-07 16:43:18 PDT
> re: comment 25

ok, that's helping to make a bit more sense of the pattern seen in the releases.  I looked at the stack traces in 3.0.3 and they look basically the same as comment 0.

one thing looking at this bug shows is an increase in the ranking after 3.0.4, and from no visability to being ranked in 3.1 beta1.  the increase in visability in 3.1 beta1 may just be related to more users ramping up on 3.1 when it hit the first beta.   It does appear that that it takes a large user pool to hit this problem.

Is it possible the fixes that went in for bug 452385 reduced the chances of bug being surfaced in the bookmark UI case, but increased the chances in plugin window popups or some other windowing cases, or that some hangs for remaining edge cases in that bug have been turned into crashes?
Comment 27 Samuel Sidler (old account; do not CC) 2009-04-07 17:27:25 PDT
(In reply to comment #26)
> one thing looking at this bug shows is an increase in the ranking after 3.0.4,
> and from no visability to being ranked in 3.1 beta1.  the increase in
> visability in 3.1 beta1 may just be related to more users ramping up on 3.1
> when it hit the first beta.   It does appear that that it takes a large user
> pool to hit this problem.

This is likely related to the fact that crash-stats top crash reports don't "freeze", so they're still based on current data (last 2 weeks or whatever). There's likely less users on 3.0.3 than on 3.1b1 or something weird like that...
Comment 28 Jeremy Lea 2009-04-07 19:58:48 PDT
A random thought...  Can someone who can reproduce this bug (can anyone?) please try removing the mIsDestroying and mOnDestroyCalled variables from windows/nsWindow.h (they already exist in nsBaseWidget).  Duplicate variables are never a good idea.  They've been there for a long time, but maybe something is now tickling the classes into using different versions?
Comment 29 Vladimir Vukicevic (:vlad) 2009-04-07 20:31:30 PDT
Yeah, I don't think we have a way to reproduce it.. the fix you suggest is probably a good one in any case, though.
Comment 30 Vladimir Vukicevic (:vlad) 2009-04-14 10:18:31 PDT
This is still sitting on qawanted -- given that noone can reproduce this, it would be helpful for QA to:

- try to find a way to reproduce, by looking at crash info, perhaps by contacting users or some other type of outreach (blog post saying "hey, if you crashed, and you have an about:crashes log that looks like this, get a hold of us");
- get this in the debugger for a developer to look at;
- try to find a regression range.

Cc'ing Jim as well -- Jim, could you take a look at jeremy's suggestion on comment #28 and fix up those variables?
Comment 32 Henrik Skupin (:whimboo) 2009-04-15 05:14:29 PDT
Juan, you probably added the wrong url? This are crashes from Songbird and only two with comments.

Our ones aren't helpful either:
http://crash-stats.mozilla.com/report/list?product=Firefox&branch=1.9.1&version=Firefox%3A3.1b3&query_search=signature&query_type=exact&query=nsWindow%3A%3AGetParentWindow%28%29&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28%29

We could file an IT ticket and ask for a couple of URLs where the crash happens.
Comment 33 juan becerra [:juanb] 2009-04-15 09:34:41 PDT
Henrik, I added the intended URL, because it had a couple of comments, as opposed to our stats site which, when I tried to access, would time out.
Comment 34 Jim Mathies [:jimm] 2009-04-15 09:53:15 PDT
(In reply to comment #32)
> Juan, you probably added the wrong url? This are crashes from Songbird and only
> two with comments.
> 
> Our ones aren't helpful either:
> http://crash-stats.mozilla.com/report/list?product=Firefox&branch=1.9.1&version=Firefox%3A3.1b3&query_search=signature&query_type=exact&query=nsWindow%3A%3AGetParentWindow%28%29&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28%29
> 
> We could file an IT ticket and ask for a couple of URLs where the crash
> happens.

Clicking through a number of these, all seem to have some form of plugin loaded, but there doesn't appear to be a particular plugin that's causing it, so I'm guessing it's more generic.
Comment 35 Jeremy Lea 2009-04-15 11:15:21 PDT
OK, here's my debugging assessment.  I don't have time to work up a patch and test it now.  But basically, widget destruction on windows is completely broken - it's a wonder it doesn't crash more...  This is besides the duplicate member variable problem.

The sequence of preferred actions appears to be nsWindow::Destroy() which calls nsBaseWidget::Destroy(), nsWindow::OnDestroy() which calls nsBaseWidget::OnDestroy() and finally nsWindow::~nsWindow().  Lets look at these in order:

1381 NS_METHOD nsWindow::Destroy()
1382 {
1383   // Switch to the "main gui thread" if necessary... This method must
1384   // be executed on the "gui thread"...
1385   nsToolkit* toolkit = (nsToolkit *)mToolkit;
1386   if (toolkit != nsnull && !toolkit->IsGuiThread()) {
1387     MethodInfo info(this, nsWindow::DESTROY);
1388     toolkit->CallMethod(&info);
1389     return NS_ERROR_FAILURE;
1390   }

This is fine (the bogus failure code is always ignored).

1392   // disconnect from the parent
1393   if (!mIsDestroying) {
1394     nsBaseWidget::Destroy();
1395   }

This is wrong. This should test be:

if (mIsDestroying)  // Should be an atomic test and set...
  return NS_OK;

nsCOMPtr<nsIWidget> kungFuDeathGrip(this);
mIsDestroying = PR_TRUE;
nsBaseWidget::Destroy();

1397   // just to be safe. If we're going away and for some reason we're still
1398   // the rollup widget, rollup and turn off capture.
1399   if ( this == gRollupWidget ) {
1400     if ( gRollupListener )
1401       gRollupListener->Rollup(nsnull);
1402     CaptureRollupEvents(nsnull, PR_FALSE, PR_TRUE);
1403   }

This is OK.

1405   EnableDragDrop(PR_FALSE);
1406 
1407   // destroy the HWND
1408   if (mWnd) {
1409     // prevent the widget from causing additional events
1410     mEventCallback = nsnull;
1411 
1412     // if IME is disabled, restore it.
1413     if (mOldIMC) {
1414       mOldIMC = ::ImmAssociateContext(mWnd, mOldIMC);
1415       NS_ASSERTION(!mOldIMC, "Another IMC was associated");
1416     }
1417 
1418     HICON icon;
1419     icon = (HICON) ::SendMessageW(mWnd, WM_SETICON, (WPARAM)ICON_BIG, (LPARAM) 0);
1420     if (icon)
1421       ::DestroyIcon(icon);
1422 
1423     icon = (HICON) ::SendMessageW(mWnd, WM_SETICON, (WPARAM)ICON_SMALL, (LPARAM) 0);
1424     if (icon)
1425       ::DestroyIcon(icon);
1426 
1427 #ifdef MOZ_XUL
1428     if (eTransparencyTransparent == mTransparencyMode)
1429     {
1430       SetupTranslucentWindowMemoryBitmap(eTransparencyOpaque);
1431 
1432     }
1433 #endif

This is OK, until here.  We should add:

SubclassWindow(FALSE);

1434 
1435     VERIFY(::DestroyWindow(mWnd));
1436 
1437     mWnd = NULL;
1438     //our windows can be subclassed by
1439     //others and these nameless, faceless others
1440     //may not let us know about WM_DESTROY. so,
1441     //if OnDestroy() didn't get called, just call
1442     //it now. MMP
1443     if (PR_FALSE == mOnDestroyCalled)
1444       OnDestroy();
1445   }
1446 
1447   return NS_OK;
1448 }

This should be:

    VERIFY(::DestroyWindow(mWnd));
    mWnd = NULL;
  }

  //our windows can be subclassed by
  //others and these nameless, faceless others
  //may not let us know about WM_DESTROY. so,
  //if OnDestroy() didn't get called, just call
  //it now. MMP
  if (PR_FALSE == mOnDestroyCalled)
    OnDestroy();

  return NS_OK;
}

Looking at OnDestroy:

5759 void nsWindow::OnDestroy()
5760 {
5761   mOnDestroyCalled = PR_TRUE;

This should test and exit.

5763   SubclassWindow(FALSE);
5764 
5765   // We have to destroy the native drag target before we null out our
5766   // window pointer
5767   EnableDragDrop(PR_FALSE);
5768 
5769   mWnd = NULL;

This should test mIsDestroying and call Destroy() and exit.  This should be done before the test for mOnDestroyCalled.

5770 
5771   // free GDI objects
5772   if (mBrush) {
5773     VERIFY(::DeleteObject(mBrush));
5774     mBrush = NULL;
5775   }
5776 
5777 #if 0
5778   if (mPalette) {
5779     VERIFY(::DeleteObject(mPalette));
5780     mPalette = NULL;
5781   }
5782 #endif
5783 
5784   // if we were in the middle of deferred window positioning then
5785   // free the memory for the multiple-window position structure
5786   if (mDeferredPositioner) {
5787     VERIFY(::EndDeferWindowPos(mDeferredPositioner));
5788     mDeferredPositioner = NULL;
5789   }
5790 
5791   // release references to children, device context, toolkit, and app shell
5792   nsBaseWidget::OnDestroy();

nsBaseWidget::OnDestroy() does not destroy children (and there is not more app shell).  GTK widgets explicitly destroy their children in Destroy()... should windows be doing that too?  It seems cleaner...  Otherwise OK.

5793 
5794   // dispatch the event
5795   if (!mIsDestroying) {
5796     // dispatching of the event may cause the reference count to drop to 0
5797     // and result in this object being destroyed. To avoid that, add a reference
5798     // and then release it after dispatching the event
5799     AddRef();
5800     DispatchStandardEvent(NS_DESTROY);
5801     Release();
5802   }
5803 }

This is not needed.  Or otherwise should the ordering be OnDestroy(), Destroy()?  If it is needed, should this have a kungfuDeathGrip?

697 nsWindow::~nsWindow()
698 {
699   mIsDestroying = PR_TRUE;

This should test and call Destroy().  Or just call destroy unconditionally...

700   if (gCurrentWindow == this) {
701     gCurrentWindow = nsnull;
702   }
703 
704   MouseTrailer* mtrailer = nsToolkit::gMouseTrailer;
705   if (mtrailer) {
706     if (mtrailer->GetMouseTrailerWindow() == mWnd)
707       mtrailer->DestroyTimer();
708 
709     if (mtrailer->GetCaptureWindow() == mWnd)
710       mtrailer->SetCaptureWindow(nsnull);
711   }

These need mWnd, so should be moved to Destroy().

713   // If the widget was released without calling Destroy() then the native
714   // window still exists, and we need to destroy it
715   if (NULL != mWnd) {
716     Destroy();
717   }

Called above.

719   if (mCursor == -1) {
720     // A successfull SetCursor call will destroy the custom cursor, if it's ours
721     SetCursor(eCursor_standard);
722   }

Move to destroy?

724   sInstanceCount--;
725 
726 #ifdef NS_ENABLE_TSF
727   if (!sInstanceCount)
728     nsTextStore::Terminate();
729 #endif //NS_ENABLE_TSF
730 
731 #ifndef WINCE
732   //
733   // delete any of the IME structures that we allocated
734   //
735   if (sInstanceCount == 0) {
736     if (sIMECompUnicode) 
737       delete sIMECompUnicode;
738     if (sIMEAttributeArray) 
739       delete [] sIMEAttributeArray;
740     if (sIMECompClauseArray) 
741       delete [] sIMECompClauseArray;
742 
743     NS_IF_RELEASE(gCursorImgContainer);
744 
745     if (sIsOleInitialized) {
746       ::OleFlushClipboard();
747       ::OleUninitialize();
748       sIsOleInitialized = FALSE;
749     }
750   }
751 
752   NS_IF_RELEASE(mNativeDragTarget);
753 #endif
754 
755 }

These are OK.

Otherwise, once the order of Destroy() and OnDestroy() is fixed, all of the other tests for mIsDestroying and mOnDetroyCalled should be changed to whichever is first.

Finally, and what I think is actually the bug here.  SubClassWindow should addref and release the this pointer.

Hope this helps.
Comment 36 Jeremy Lea 2009-04-15 11:22:54 PDT
One more thing... The WindowProc does and conditional kungfuDeathGrip on the target, which assumes nsIsDestroying is only set in the destructor.  GTK has a nsIsDestroyed flag which is set in the destructor.  Maybe windows should have one of these too?
Comment 37 Jim Mathies [:jimm] 2009-04-15 16:07:48 PDT
Adding a bit of data - I was looking at the normal shutdown steps a windowed plugin takes when it's destroyed. Similar to Jeremy's break down:

Starting a shutdown for a Flash nsObjectFrame:

nsObjectFrame::StopPluginInternal
nsPluginInstanceOwner::PrepareToStop -> mWidget->SetParent(nsnull)
nsObjectFrame::StopPluginInternal done.

(time passes for a delayed plugin teardown which gets fired off as a runnable event)

nsObjectFrame's DoStopPlugin
nsPluginInstanceOwner::Destroy()
nsWindow::Destroy()
nsBaseWidget::Destroy()
nsWindow::GetParentWindow()

mWnd = 0x260afe | parent hwnd = 0x10010 | parent widget = 0x0

This is where this crash would occur.

A couple notes - we set the native parent window of the plugin to null way up in nsPluginInstanceOwner's PrepareToStop. This happens way before we actually destroy the window. This is the same parent hwnd we retrieve in GetParentWindow later on when we are destroying. During destruction, GetParentWindow's ::GetParent(mWnd) call actually returns the handle to the desktop instead, since GetParent returns the parent _or_ the owner, which in this case is the desktop. We then query that for a mozilla property through GetProp to get the widget, which fails and everything moves on smoothly...

nsWindow::GetParentWindow() done.
nsBaseWidget::Destroy() done.
nsWindow::OnDestroy()
nsWindow::SubclassWindow(0) 0x260afe
nsBaseWidget::OnDestroy()
nsBaseWidget::OnDestroy() done.
nsWindow::OnDestroy() done.
nsWindow::Destroy() done.
~nsWindow()
~nsWindow() done.

So that's a normal shutdown. This bug is rather odd in that in the crash case GetParentWindow retrieves a valid parent hwnd, retrieves a valid mozilla property for the widget pointer, and then tries to access mIsDestroying for the widget. But if shutdown is working correctly, parent should have been set to null and the desktop hwnd would have been returned. 

I still don't have a specific answer as to how this could happen, but I think it's something more than a slight timing issue in how we destroy. I'm going to look a bit more at the plugin code. I'm wondering if there are cases where we do not delay the unload of plugins like flash.

(Also we should probably be calling GetAncestor instead of GetParent in GetParentWindow so we can avoid checking for properties on the desktop window during shutdown.)
Comment 38 Jim Mathies [:jimm] 2009-04-15 16:37:11 PDT
Actually, looking at the stacks, all of these crashes occur on threads that are coming through nsStopPluginRunnable::Run, so the delayed vs. immediate destruction isn't the issue. These crashes are all on delayed tear down events.
Comment 39 Jim Mathies [:jimm] 2009-04-16 12:59:16 PDT
Created attachment 373186 [details] [review]
getparent / destroy variable cleanup v.1

I'm going to look at Jeremy's other cleanup suggestions but I'd like to start with something simple - cleanup the variable madness and make sure get parent is always working with a parent window handle.
Comment 40 Vladimir Vukicevic (:vlad) 2009-04-16 13:51:06 PDT
Comment on attachment 373186 [details] [review]
getparent / destroy variable cleanup v.1

ugh, field shadowing.. can you get rid of the other shadowed fields as well?  mIs{Shift,Control,Alt}Down are the ones I see at a quick glance
Comment 41 Jim Mathies [:jimm] 2009-04-16 14:08:41 PDT
Looks like it was this little block - beos and os2 also use most of these.

PRPackedBool      mIsShiftDown;
PRPackedBool      mIsControlDown;
PRPackedBool      mIsAltDown;
PRPackedBool      mIsDestroying;
PRPackedBool      mOnDestroyCalled;

new patch coming up...
Comment 42 Jim Mathies [:jimm] 2009-04-16 14:36:58 PDT
Created attachment 373201 [details] [review]
getparent / destroy variable cleanup v.2

Threw this at try. I'll land it on m-c once that clears.
Comment 43 Masatoshi Kimura [:emk] 2009-04-16 15:51:21 PDT
> -    HWND parent = ::GetParent(mWnd);
> +    HWND parent = ::GetAncestor(mWnd, GA_PARENT);
Actually we also need the owner here when GetTopLevelWindow() is called with aStopOnDialogOrPopup=PR_FALSE.
Comment 44 Jim Mathies [:jimm] 2009-04-16 16:21:40 PDT
(In reply to comment #43)
> > -    HWND parent = ::GetParent(mWnd);
> > +    HWND parent = ::GetAncestor(mWnd, GA_PARENT);
> Actually we also need the owner here when GetTopLevelWindow() is called with
> aStopOnDialogOrPopup=PR_FALSE.

Hmm, I'm not seeing why this is. GetTopLevelWindow would return the current window if it's top and I don't see how that would mess up caret position data in calls to ResolveIMECaretPos. I'll do some debugging though and see if I can spot what the potential problem might be. If we do need owner for the ime code, I think it might be best to differentiate between the two calls so that other calls to getparentwindow return the true parent.
Comment 45 Masatoshi Kimura [:emk] 2009-04-16 16:29:25 PDT
It will affect the popup window (e.g. "Organize Bookmarks" dialog). We want a main window pointer from children of the popup.
> I think it might be best to differentiate between the two calls so that other
> calls to getparentwindow return the true parent.
I agree.
Comment 46 Jim Mathies [:jimm] 2009-04-17 09:04:13 PDT
(In reply to comment #45)
> It will affect the popup window (e.g. "Organize Bookmarks" dialog). We want a
> main window pointer from children of the popup.
> > I think it might be best to differentiate between the two calls so that other
> > calls to getparentwindow return the true parent.
> I agree.

How will it effect windows like organize bookmarks? Is this an IME issue or something more general? I've been using a build of this for a day now and haven't seen any strange behavior during normal use.
Comment 47 Jim Mathies [:jimm] 2009-04-17 10:10:09 PDT
Created attachment 373343 [details] [review]
getparent / destroy variable cleanup v.3

Here's an update that chooses between the two so that calls to GetTopLevelWindow continue to return owner.
Comment 48 Jim Mathies [:jimm] 2009-04-17 10:53:39 PDT
Independent of these changes, this crash still seems pretty prevenlent in pre beta 4. Three crash signatures all seems to tie in with this -

From the last week:

nsWindow::GetParentWindow() (rank 14) -

http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b4pre&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow()

0  	xul.dll  	nsWindow::GetParentWindow  	widget/src/windows/nsWindow.cpp:1651
1 	xul.dll 	nsWindow::GetParent 	widget/src/windows/nsWindow.cpp:1625
2 	xul.dll 	nsBaseWidget::Destroy 	widget/src/xpwidgets/nsBaseWidget.cpp:259
3 	xul.dll 	nsWindow::Destroy 	widget/src/windows/nsWindow.cpp:1523 

nsBaseWidget::Destroy() (rank 39) -

http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b4pre&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsBaseWidget%3A%3ADestroy()

0  	xul.dll  	nsBaseWidget::Destroy  	widget/src/xpwidgets/nsBaseWidget.cpp:261
1 	xul.dll 	nsWindow::Destroy 	widget/src/windows/nsWindow.cpp:1523
2 	xul.dll 	nsPluginInstanceOwner::Destroy 	layout/generic/nsObjectFrame.cpp:4069
3 	xul.dll 	DoStopPlugin 	layout/generic/nsObjectFrame.cpp:1986 


254 NS_METHOD nsBaseWidget::Destroy()
255 {
256   // Just in case our parent is the only ref to us
257   nsCOMPtr<nsIWidget> kungFuDeathGrip(this);
258   // disconnect from the parent
259   nsIWidget *parent = GetParent();
260   if (parent) {
261     parent->RemoveChild(this);

nsWindow::Destroy() (rank 78) -

(Last one of these was the 12th)

http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b4pre&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=nsWindow%3A%3ADestroy()

0  	 	@0x740072  	
1 	xul.dll 	nsWindow::Destroy 	widget/src/windows/nsWindow.cpp:1489
2 	xul.dll 	nsPluginInstanceOwner::Destroy 	layout/generic/nsObjectFrame.cpp:4069
3 	xul.dll 	DoStopPlugin 	layout/generic/nsObjectFrame.cpp:1986
4 	xul.dll 	nsStopPluginRunnable::Run 	layout/generic/nsObjectFrame.cpp:2023
5 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:510 


I was hoping to reproduce this from some extended use of a test build but still haven't seen it.
Comment 49 Masatoshi Kimura [:emk] 2009-04-17 11:43:25 PDT
(In reply to comment #46)
> How will it effect windows like organize bookmarks?
Sorry, I have confused a dialog name. It should read "Edit This Bookmark" dialog (i.e. a dialog which will appear when tha star icon is clicked).
> Is this an IME issue or
> something more general? I've been using a build of this for a day now and
> haven't seen any strange behavior during normal use.
IME issue. With cleanup v.2 patch and ChangJie, composition window was displayed at wrong position on "Edit Bookmarks" dialog.
Even worse, With cleanup v.3 patch, Firefox hanged when I tried to input on "Edit Bookmarks" dialog using ChangJie. It seems to regress bug 452385.
Without neither patches, there was no problem.
Comment 50 Jim Mathies [:jimm] 2009-04-17 11:48:30 PDT
(In reply to comment #49)
> (In reply to comment #46)
> > How will it effect windows like organize bookmarks?
> Sorry, I have confused a dialog name. It should read "Edit This Bookmark"
> dialog (i.e. a dialog which will appear when tha star icon is clicked).
> > Is this an IME issue or
> > something more general? I've been using a build of this for a day now and
> > haven't seen any strange behavior during normal use.
> IME issue. With cleanup v.2 patch and ChangJie, composition window was
> displayed at wrong position on "Edit Bookmarks" dialog.
> Even worse, With cleanup v.3 patch, Firefox hanged when I tried to input on
> "Edit Bookmarks" dialog using ChangJie. It seems to regress bug 452385.
> Without neither patches, there was no problem.

Thanks for testing. I guess I'll go back to using ::GetParent on the IME calls.
Comment 51 Jim Mathies [:jimm] 2009-04-17 13:14:07 PDT
Created attachment 373373 [details] [review]
[m-c checkin: a7b2a881595a] getparent / destroy variable cleanup v.3
Comment 52 Masatoshi Kimura [:emk] 2009-04-17 16:51:44 PDT
The new patch looks good.
I haven't find any apparent problems during my short, non-scientific testing.
Comment 53 Jim Mathies [:jimm] 2009-04-18 12:48:14 PDT
Comment on attachment 373373 [details] [review]
[m-c checkin: a7b2a881595a] getparent / destroy variable cleanup v.3

r? -> vlad since there have been some additional changes. This might fix the crash problem so we should consider getting it into 1.9.1 after it's sat on trunk for a few days.
Comment 54 Tomas 2009-05-03 15:54:33 PDT
And... ?
Comment 56 Vladimir Vukicevic (:vlad) 2009-05-05 11:02:02 PDT
So we generally know which thing is NULL; is there any data at that point that we could use to figure out what happened?  I'm thinking we should maybe add some logging; unfortunately breakpad doesn't support sending arbitrary logged data, but we could leave a file in some dir and ask people to mail it in if they see it.  (Should always append to it, etc.)
Comment 57 Vladimir Vukicevic (:vlad) 2009-05-05 11:13:13 PDT
Looks like this can actually be done: http://mxr.mozilla.org/mozilla-central/source/xpcom/system/nsICrashReporter.idl#101 (appendAppNotesToCrashReport) and it'll show up in breakpad.
Comment 58 Jeremy Lea 2009-05-07 16:24:12 PDT
Created attachment 376333 [details] [review]
Properly call Destroy()

I just pushed this patch to the try server...  It's a big hammer approach to this, which makes sure Destroy() is called correctly.  To do that, SetParent() also needs to manage the parent's child list, which most platforms where not doing.  My brief testing showed that the new asserts were not being called, and that reftests and mochitests looked good, except for a few, which appears to caused by a focus problem (the main window isn't refocused after a popup).  I'll see what the tryserver says.

I think this patch is to intrusive for 3.5.  But some things I did learn...  It appears that nsWindow::~nsWindow could be called multiple times...  That's not good.  I think what might be happening is that the code is actually stepping through deleted memory which gets overwritten at some point.
Comment 59 Karl Tomlinson, offline (:karlt) 2009-05-07 17:21:25 PDT
Jeremy, out of interest, do you have a test case where SetParent is called from nsViewManager::ReparentChildWidgets?
Comment 60 Jeremy Lea 2009-05-11 14:06:25 PDT
Karl...  No.  I've just been reading the code.  bug 129034 is the original implementation.  The test case there now asserts on something else (bug 485893 - it doesn't get to SetParent()).  It seems to have regressed from 3.0.10, but I haven't looked at it carefully.  If SetParent(nonnull) is not needed then these could all be changed to ResetParent(), like what Cocoa did recently, and the code greatly simplified.  The assert in nsViewManager::ReparentChildWidgets has never had a bug filed on it.  But this is getting off topic.
Comment 61 Jim Mathies [:jimm] 2009-05-11 15:20:37 PDT
A possible fix for 1.9.1 might be to move (or add) a call to SetParent(nsnull) just before we destroy the widget in nsPluginInstanceOwner's Destroy.  We clear it in PrepareToStop (just below Destroy in the code) a little bit before we actually tear down.

http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsObjectFrame.cpp#4217
Comment 62 Jim Mathies [:jimm] 2009-05-11 15:40:31 PDT
(In reply to comment #61)
> A possible fix for 1.9.1 might be to move (or add) a call to SetParent(nsnull)
> just before we destroy the widget in nsPluginInstanceOwner's Destroy.  We clear
> it in PrepareToStop (just below Destroy in the code) a little bit before we
> actually tear down.
> 
> http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsObjectFrame.cpp#4217

Hmm, we call GetParent in SetParent, so that's not going to work without changes to the window code.
Comment 63 Jim Mathies [:jimm] 2009-05-12 06:36:33 PDT
Summarizing what's known at this point - 

- The crash is sporadic, and hard to reproduce
- The crash occurs in delayed tear down of a plugin window
- We know that SetParent is called in PreparToStop before the tear down event is sent, and it seems that it must succeed in setting the parent widget pointer in the native window's property list to null. 
- When the event comes back, we go into what is a rather haphazzard process of unhooking and destroying the widget and window. During this process we retrieve the parent widget pointer via the property list.
- The parent widget pointer returned on a valid native window handle is not null and does not point to a valid parent widget which results in the crash.

I've tried to reproduce this, and looked at ways to do a one off for 1.9.1 but really am not coming up with much of anything except hacks. Vlad suggested logging but I didn't see anything there that would help. What has been suggested is a rework of our tear down methodology with a heavy handed patch in the hopes it will fix the problem. But nobody feels confident about applying that to 1.9.1 this late in the ball game.

I'm not entirely sure what direction to take, so I'd welcome any help / suggestions.
Comment 64 Samuel Sidler (old account; do not CC) 2009-05-12 10:17:58 PDT
(In reply to comment #63)
> What has been suggested is a rework of our tear down methodology with a heavy
> handed patch in the hopes it will fix the problem. But nobody feels confident
> about applying that to 1.9.1 this late in the ball game.

Can we start working on this now, land it for 1.9.2 as soon as it's ready and see if it fixes the problem there? If so, and if it bakes for a while, we might be able to take it in 1.9.1.x, despite its size/scope.
Comment 65 Jim Mathies [:jimm] 2009-05-12 10:51:11 PDT
(In reply to comment #58)
> Created an attachment (id=376333) [details]
> Properly call Destroy()
> 
> I just pushed this patch to the try server...  It's a big hammer approach to
> this, which makes sure Destroy() is called correctly.  To do that, SetParent()
> also needs to manage the parent's child list, which most platforms where not
> doing.  My brief testing showed that the new asserts were not being called, and
> that reftests and mochitests looked good, except for a few, which appears to
> caused by a focus problem (the main window isn't refocused after a popup). 
> I'll see what the tryserver says.
> 
> I think this patch is to intrusive for 3.5.  But some things I did learn...  It
> appears that nsWindow::~nsWindow could be called multiple times...  That's not
> good.  I think what might be happening is that the code is actually stepping
> through deleted memory which gets overwritten at some point.

Jeremy, how are you feeling about your destroy rework patch? Want to kick off some reviews?
Comment 66 Jim Mathies [:jimm] 2009-05-13 16:07:31 PDT
Spoke to Jeremy through email, his patch had some issues and didn't make it through try, so I'll start in on something myself.
Comment 67 Jeremy Lea 2009-05-14 14:01:54 PDT
I was going to attach an updated patch, but the more I little bugs I've fixed, the more creep in...

What I have become convinced of is that the windows cleverness of using the windows ::GetParent() call is the main problem, because it's ownership model and semantics are just not compatible with what nsIWidget wants.  I think that an alternative approach here is to add a nsRefPtr<nsWindow> mParent to nsWindow, and manage it in Create() [=baseParent], SetParent() and Destroy().  Then we can remove GetParentWindow(), since mParent would now be correct.  Also, many of the uses of ::GetParent() in nsWindow.cpp are bogus - they should already be using GetParentWindow().

Also, I think that Destroy() should be called on all widgets.  I think that the only needed fixes are in nsXULWindow (the last hunk of my patch), and then a loop over the children in Destroy(), like GTK already does.  I would also make the loop over the children use a nsCOMPtr so that they are only Released() and destructed when Destroy has returned, just to make life more predictable.

I'll merge most of the rest of my changes into the cleanup I'm doing in bug 431634.
Comment 68 Mike Beltzner [:beltzner] 2009-05-14 14:07:22 PDT
This is marked topcrash - do we know how much of a crasher this is under b4?
Comment 69 Damon Sicore (:damons) 2009-05-14 14:31:16 PDT
According to a discussion w/ jimm in IRC, we saw 4991 in a week on 3.0.0.11.  

In any case, after talking this over with Joe Drew and jimm in IRC, since we can't reproduce this crash, it's in 3.0, we don't have a regression range, *and* we're really down to the last 1-3 bugs that are going to actually hold the release, I think we're at the point where we should minus this.  Marking as such.  If people disagree and think we should hold the release for this bug, please re-nom and comment.
Comment 70 Jim Mathies [:jimm] 2009-05-14 14:34:03 PDT
(In reply to comment #68)
> This is marked topcrash - do we know how much of a crasher this is under b4?

http://crash-stats.mozilla.com/query/query?do_query=1&product=Firefox&version=Firefox%3A3.5b4&platform=windows&date=&range_value=1&range_unit=weeks&query_search=signature&query_type=contains&query=GetParentWindow

1597 for one week. 

If anyone has any ideas on how to pull up a regression range pre the 3.0 release I'd welcome the input. This was present in the first 3.0 release.
Comment 71 Samuel Sidler (old account; do not CC) 2009-05-14 15:39:29 PDT
(In reply to comment #69)
> If people disagree and think we should hold the release for this bug,
> please re-nom and comment.

I don't disagree, but I think we should block on this for 1.9.2 and consider backporting it to 1.9.1.x if a fix sticks and is relatively safe.
Comment 72 Henrik Skupin (:whimboo) 2009-05-14 15:43:58 PDT
Jim, it's hard without any way of reproducing this crash. All the crash stats have no comments.

Chris, could you check if we have a domain or url where this mostly happens? If you can't we should file an it bug to get the results. We still have the email addresses per crash report if users accept their transmission, right? Could we find some users for those the crash happens regularly?
Comment 73 chris hofmann 2009-05-14 17:09:37 PDT
looking at the url data it doesn't look to helpful in this case.  the distribution of sites and urls to these crash signatures is very wide.  

that matches up with the ideas in comment 67 suggesting these crashes are a result of bad interaction within the chrome and general window management; not with anything content related.

As for examples of this it looks like 760 different sites for just the 1500 some odd nsWindow::GetParentWindow() crashes I'm looking at in a weeks worth of data.   The list of top sites looks like:

#crash - domain

  78 www.youtube.com
  23 mail.google.com
  22 www.google.com
  21 www.facebook.com
   9 www.google.com.vn
   8 www.google.fr
   8 apps.facebook.com
   7 www.orkut.com.br
   7 win.mail.ru
   7 us.mg2.mail.yahoo.com
   6 www.orkut.co.in
   6 in.mg50.mail.yahoo.com
   5 us.mg1.mail.yahoo.com
   5 depositfiles.com
   5 addons.mozilla.org
   4 www.ynet.co.il
   4 www.mininova.org
   4 www.microsoft.com
   4 www.justin.tv
   4 www.google.ro
   4 www.google.it
   4 www.google.co.id
   4 www.gametrailers.com

one user might have even hit the crash on about:config , and others hit it on searches, startpage and just a variety of frequent and common browsing activities.

to drill down on the top of the youtube list it looks like:

Firefox 3.5b4   2       http://www.youtube.com/watch?v=daTTOyu-E1w&hd=1&feature=hd      nsWindow::GetParentWindow()
Firefox 3.5b4   2       http://www.youtube.com/watch?v=L5brwhalCG0&feature=PlayList&p=BBCB18EBEEF42EC2&index=0&playnext=1       nsWindow::GetParentWindow()
Firefox 3.5b4   2       http://www.youtube.com/watch?v=BsAtRKrbB8w      nsWindow::GetParentWindow()
Firefox 3.5b4   2       http://www.youtube.com/watch?v=wXUMZb7gqdg      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=1DODjKLUMVU      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=UYpxIaAI51M      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=SJvtkCFaKaE      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=PGuhX5AmjuA      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=UuBm0dvIzcc&feature=fvst nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/ nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=MYNEYp06EkI      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=kMEFCRbbi0g      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/activesharing_history    nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=I8kcnYCHxTk      nsWindow::GetParentWindow()
Firefox 3.5b4   1       http://www.youtube.com/watch?v=eUm-zvGsgiE      nsWindow::GetParentWindow()


for nsBaseWidget::Destroy stack signature its basically the same thing.  Another 625 crash reports in a weeks worth of 3.5b4 data and 457 different domains listed in the urls

  40 www.youtube.com
  29 mail.google.com
  17 www.facebook.com
  11 www.google.com
  11 apps.facebook.com
   5 www.orkut.co.in
   5 www.google.com.vn
   5 hi5.com
   4 www.google.co.in
Comment 74 chris hofmann 2009-05-14 17:18:58 PDT
I do see a few interesting things that appear to be messing with/reading chrome.

I guess one check would be to look at a bunch of reports to try and figure out if there is a common or set of common addons installed like adblockplus, ietab, or speeddial.   There really are not very many of these kinds of reports, but the general question of "does chrome get messed up by an addon or cause timing problems not seen otherwise?" might be an interesting one to try and figure out.

release       # crashes     url at time of crash        -- stack signature

Firefox 3.5b4   1       chrome://adblockplus/content/tip_subscriptions.xul      nsWindow::GetParentWindow()
Firefox 3.5b4   1       chrome://ietab/content/reloaded.html?url=http://dl.xunlei.com/index.htm?tag=1   nsWindow::GetParentWindow()
Firefox 3.5b4   1       chrome://speeddial/content/speeddial.xul        nsWindow::GetParentWindow()
Firefox 3.5b4   1       chrome://wot/locale/loading.html#aHR0cDovL3d3dy50dnZhdWdoYW4uY29tLw%3D%3D       nsWindow::GetParentWindow()

Firefox 3.5b4   1       about:sessionrestore    nsBaseWidget::Destroy()
Firefox 3.5b4   1       chrome://ietab/content/reloaded.html?url=http://update.microsoft.com/*de-ch     nsBaseWidget::Destroy()
Firefox 3.5b4   1       chrome://speeddial/content/speeddial.xul        nsBaseWidget::Destroy()
Comment 75 Vladimir Vukicevic (:vlad) 2009-05-14 17:37:34 PDT
Because it generally seems to be related to plugin destruction, the actual URL may not even be relevant -- for example, it could be caused by destroying a plugin that was on the page you were visiting previously.  That would be my guess even, given the high number of google.com and speeddial crashes... then again, there's an about:sessionrestore crash, and that's kinda weird :)
Comment 76 Jim Mathies [:jimm] 2009-05-14 18:26:52 PDT
(In reply to comment #75)
> Because it generally seems to be related to plugin destruction, the actual URL
> may not even be relevant -- for example, it could be caused by destroying a
> plugin that was on the page you were visiting previously.  That would be my
> guess even, given the high number of google.com and speeddial crashes... then
> again, there's an about:sessionrestore crash, and that's kinda weird :)

Yeah the current url is not assured to be accurate. The dealyed tear down can take some time. 

There is an interesting little bit of code in nsStopPluginRunnable's Run that delays tear down due to nesting level of events - 

http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsObjectFrame.cpp#1965

I've been looking at this recently because in my testing this code never gets used. A possible suspect maybe.
Comment 77 chris hofmann 2009-05-14 20:17:02 PDT
> there's an about:sessionrestore crash, and that's kinda weird :)

yeah, I've been looking at a bunch of cases where this helps as a semi-reliable way to find reproducable crashers.   e.g.

1) user crashes on a page
2) sesssion restore kicks in and user selects to load the pages from previous crashes
3) page(s) load and the crash happens again

this case pokes holes in that theory as being fully reliable.  I'm guessing the session restore isn't initiating any plugin page teardowns.

I'm getting time of crash, time since last crash, and uptime (time between startup and crash) and that will help to understand and connect these about:sessionrestore crashes to other events on a given users PC a bit better.
Comment 78 Jim Mathies [:jimm] 2009-05-18 08:51:08 PDT
I am getting some really crazy stuff going on, including what looked like this crash, by triggering the code in bug 389931. I'm also seeing a lot of slow script warnings, assertions in /mozilla/docshell/shistory/src/nsshentry.cpp's nsSHEntry::AddChild, assertions in the plugin code, hung instances, and plugins that hang around in memory and keep playing. 

Steps to reproduce:

1) open two windows
2) in the second window, browse to a page and hit ctrl-s
3) leave the modal file chooser open and switch back to the first window
4) browse around to sites with plugins like youtube & yahoo, use multiple tabs, close tabs, etc.
5) try switching back and forth between the two windows occasionally.

All kinds of interesting things happen.

Script: chrome://global/content/bindings/general.xml:0
Script: chrome://browser/content/browser.js:3552
Script: chrome://global/content/bindings/general.xml:0

&& 

WARNING: NS_ENSURE_TRUE(doc) failed: file f:/Mozilla/firefox/470487/mozilla/modules/plugin/base/src/nsNPAPIPlugin.cpp, line 1179
WARNING: NS_ENSURE_TRUE(cx) failed: file f:/Mozilla/firefox/470487/mozilla/modules/plugin/base/src/nsNPAPIPlugin.cpp, line 1192
WARNING: NS_ENSURE_TRUE(doc) failed: file f:/Mozilla/firefox/470487/mozilla/modules/plugin/base/src/nsNPAPIPlugin.cpp, line 1499

are common.

I can't say for sure that this is the cause of this crash. I did see a couple crashes in nswindow's destroy code, but it wasn't in the same place. It was however at points where nswindow was accessing member variables in the base widget.

I can address most of this (except the slow script warnings) by simply commenting out the timer code block in nsObjectFrame's nsStopPluginRunnable::Run(), so that plugins destroy when they are initially asked to. Maybe QA can lend a hand in getting some reproducible cases, so much is going on here I'm not sure where all of it is coming from.

This is all on a trunk build.
Comment 79 Jim Mathies [:jimm] 2009-05-18 08:54:29 PDT
The patch in bug 389931 looks to have landed in Fx 3.0, beta 3, so that would put it in the right time frame for when this crash started showing up.
Comment 80 Jim Mathies [:jimm] 2009-05-18 09:11:04 PDT
I've successfully reproduced a crash in beta 4 (no stack dump though) following the above steps. The browser re-opened both windows and reloaded the same content, which might explain some of the crash reports we've seen.
Comment 81 Jim Mathies [:jimm] 2009-05-18 10:09:06 PDT
Bug 420886 is where the delayed tear down code for plugins landed. Apparently it addressed a crash bug in realplayer post the landing of bug 389931.
Comment 82 Jim Mathies [:jimm] 2009-05-18 10:20:39 PDT
Bug 422926 may also be related.
Comment 83 Vladimir Vukicevic (:vlad) 2009-05-18 11:27:33 PDT
Given Jim's data, I'd like to see this go back on the 1.9.1 blocking list -- I think we've got a potential can of worms here.
Comment 84 Jim Mathies [:jimm] 2009-05-18 11:35:54 PDT
Filed bug 493578 on the slow script warnings, they are unrelated to this.
Comment 85 Jim Mathies [:jimm] 2009-05-18 12:04:51 PDT
(In reply to comment #81)
> Bug 420886 is where the delayed tear down code for plugins landed. Apparently
> it addressed a crash bug in realplayer post the landing of bug 389931.

A general observation - the code in bug 420886 seems to be compensating for a bug in real player. In nsPluginInstanceOwner's PrepareToStop, we hide the window, disable it, and then during destruction the realplayer window receives wm_destroy, at which point it should destroy any modal windows. Maybe the right thing to do here is to pull the patch in 420886, and get with real to work on a fix. We seem to be jumping through a lot of hoops and destablizing our own code in other ways to compensate for a an issue in realplayer.
Comment 86 Jim Mathies [:jimm] 2009-05-19 08:02:30 PDT
Moving the blocking request to bug 493601.
Comment 87 Mike Beltzner [:beltzner] 2009-05-19 13:10:04 PDT
Comment on attachment 373373 [details] [review]
[m-c checkin: a7b2a881595a] getparent / destroy variable cleanup v.3

a191=beltzner
Comment 88 chris hofmann 2009-05-19 13:31:23 PDT
re: comment 85
kev / jst/ josh, do we have any recent contacts at real that we could open up that discussion.
Comment 89 Kev [:kev] Needham 2009-05-19 13:34:54 PDT
I'll ping a couple people I know there and see if we can get the RP team engaged.
Comment 90 Jim Mathies [:jimm] 2009-05-19 16:17:00 PDT
Created attachment 378454 [details] [review]
[1.9.1 checkin: 46b703c76f3a] getparent / variable shadowing patch
Comment 91 Mike Beltzner [:beltzner] 2009-05-27 15:52:06 PDT
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/46b703c76f3a
Comment 92 Marcia Knous [:marcia] 2009-07-01 14:00:46 PDT
http://tinyurl.com/nds6x8 shows the current crashes for the Firefox 3.5 release. With 3058 crashes it is currently the #4 topcrash, so it doesn't appear the patch fixed the issue for 1.9.1.
Comment 93 Jim Mathies [:jimm] 2009-07-01 14:51:11 PDT
Might be interesting to look at mozilla-central in a few days with bug 500925  landing today.
Comment 94 Jim Mathies [:jimm] 2009-07-01 14:59:36 PDT
(In reply to comment #93)
> Might be interesting to look at mozilla-central in a few days with bug 500925 
> landing today.

My guess is that won't help much as this is a widget tear down issue. In general I guess this is still a top issue and needs to be addressed sooner rather than later. Now that the nswindow refactoring is complete, I'll look at this again.
Comment 95 Samuel Sidler (old account; do not CC) 2009-07-06 08:27:13 PDT
Jim: Is it possible this is fixed in bug 493601? If so, it's likely that this will be fixed in 1.9.1 later.
Comment 96 Jim Mathies [:jimm] 2009-07-06 08:45:57 PDT
(In reply to comment #95)
> Jim: Is it possible this is fixed in bug 493601? If so, it's likely that this
> will be fixed in 1.9.1 later.

Doesn't look like it as it's still showing up on 3.6a1pre - 

http://crash-stats.mozilla.com/query/query?do_query=1&product=Firefox&version=Firefox%3A3.6a1pre&platform=windows&date=&range_value=1&range_unit=weeks&query_search=signature&query_type=contains&query=GetParentWindow
Comment 97 Jim Mathies [:jimm] 2009-07-06 09:30:37 PDT
Seems like we have more signatures now -

xul.dll  	nsWindow::GetParentWindow  	 widget/src/windows/nsWindow.cpp:1668
xul.dll 	nsWindow::GetParent 	widget/src/windows/nsWindow.cpp:1634
xul.dll 	nsBaseWidget::SetZIndex 	widget/src/xpwidgets/nsBaseWidget.cpp:385
xul.dll 	xul.dll@0x2a44f9 	
xul.dll 	nsIView::CreateWidget 	view/src/nsView.cpp:699
xul.dll 	nsMenuPopupFrame::CreateWidgetForView 	layout/xul/base/src/nsMenuPopupFrame.cpp:280
xul.dll 	xul.dll@0x320c70

--

xul.dll  	nsWindow::GetParentWindow  	 widget/src/windows/nsWindow.cpp:1668
xul.dll 	nsWindow::GetParent 	widget/src/windows/nsWindow.cpp:1634
xul.dll 	nsBaseWidget::SetZIndex 	widget/src/xpwidgets/nsBaseWidget.cpp:385
xul.dll 	xul.dll@0x2a44f9 	
xul.dll 	nsIView::CreateWidget 	view/src/nsView.cpp:699
xul.dll 	nsMenuPopupFrame::CreateWidgetForView 	layout/xul/base/src/nsMenuPopupFrame.cpp:280
xul.dll 	nsMenuPopupFrame::EnsureWidget 	layout/xul/base/src/nsMenuPopupFrame.cpp:230
xul.dll 	nsMenuPopupFrame::InitializePopup 	layout/xul/base/src/nsMenuPopupFrame.cpp:497
xul.dll 	nsXULPopupManager::ShowPopup 	layout/xul/base/src/nsXULPopupManager.cpp:454

The DoStopPlugin one though is still the most common. We might have two bugs here, one on Destroy and another on CreateWidget.
Comment 98 Jim Mathies [:jimm] 2009-07-06 09:38:23 PDT
The CreateWidget crash is also on trunk - 

http://crash-stats.mozilla.com/report/index/d4431975-aee4-44e4-bcac-f1f3c2090629
Comment 99 Jim Mathies [:jimm] 2009-07-07 10:27:19 PDT
Created attachment 387214 [details] [review]
[m-c checkin: da42cf530fa7] teardown refactor v.1

This is a refactor of the tear down code in nsWindow. I’d welcome comments. I’ve paired Destroy() down to simply calling DestroyWindow, and rely on the WM_DESTROY event handler to do the actual tear down. Within OnDestroy, I’ve reorganized the order things into what I would consider a more proper tear down order.

This may not fix the crash problem, but it should give us a better feel for exactly what’s going on and when. If the crash still occurs in the call to GetParent in  nsBaseWidget’s Destroy(), my guess would be that we could eliminate nsWindow code as being the problem and instead look at the view/plug-in code. The reason I say this is that the plugin code calls SetParent(nsnull) on the widget in nsPluginInstanceOwner’s PrepareToStop, which sets the native parent to null and also removes the window from the parent widget's child list. This (supposedly) happens way before Destroy is called on the window.  If a call to GetParent on the final Destroy() call returns a valid Mozilla window handle with a bad |this| pointer, nsWindow’s tear down wouldn’t be the problem AFAICT.
Comment 100 Jim Mathies [:jimm] 2009-07-08 13:22:08 PDT
Comment on attachment 387214 [details] [review]
[m-c checkin: da42cf530fa7] teardown refactor v.1

This passed try fine. I've also been running the try build now for a couple days without problem.  

Neil, would you be a good reviewer for something like this?
Comment 101 Aaron Train [:aaronmt] 2009-07-17 10:11:46 PDT
So, what's the status on this crash?

Currently it's the top crasher for 3.5.1 on Windows

http://crash-stats.mozilla.com/query/query?do_query=1&product=Firefox&version=Firefox%3A3.5.1&platform=windows&date=&range_value=1&range_unit=weeks&query_search=signature&query_type=exact&query=
Comment 104 Robert Sayre 2009-07-18 10:54:56 PDT
did this land on the branch?
Comment 105 Jim Mathies [:jimm] 2009-07-19 07:08:08 PDT
(In reply to comment #104)
> did this land on the branch?

We could do that if it looks like the patch actually addressed the issue. It would require a little back porting since we refactored nswindow on trunk. 

I'd like to let this sit on trunk for a while (maybe 2 or 3 weeks) to test to see if it actually fixed the problem. So far we're about 5 days out from the landing and there hasn't been an occurrence in nightlies - not enough time to confirm IMHO.
Comment 106 Jim Mathies [:jimm] 2009-07-19 07:50:51 PDT
For reference, the patch landed on July 13th - 

http://hg.mozilla.org/mozilla-central/rev/da42cf530fa7

Crash stats -

http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.6a1pre&platform=windows&query_search=signature&query_type=contains&query=GetParentWindow&date=&range_value=2&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28int%29

The last DoStopPlugin crash was in a build from the 12th. We had a crash in GetParentWindow from a build on the 13th, but that was in CreateWidget, which might be something else. Hard to say at this point if the problem is fixed.
Comment 107 Robert Sayre 2009-07-19 11:09:40 PDT
I see. you have two bugs in one here. that's why beltzner marked it fixed1.9.1
Comment 108 Jim Mathies [:jimm] 2009-07-22 10:01:04 PDT
on trunk!

I'll get a 1.9.1 patch put together.
Comment 109 Jim Mathies [:jimm] 2009-07-23 14:39:15 PDT
Created attachment 390333 [details] [review]
teardown refactor 1.9.1 v.1
Comment 110 Jim Mathies [:jimm] 2009-07-23 23:30:31 PDT
This is definitely fixed on trunk.

http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.6a1pre&platform=windows&query_search=signature&query_type=contains&query=GetParentWindow&date=&range_value=2&range_unit=weeks&do_query=1&signature=nsWindow%3A%3AGetParentWindow%28int%29

Posted a 1.9.1 patch to try, if all goes well after a little testing, we should be good for a 1.9.1.3 release. Bug 506108 is filed on the CreateWidget crash.

We could also back port this pretty easily to 3.0 if somebody thinks it's needed.
Comment 111 chris hofmann 2009-08-18 19:52:14 PDT
looking at crash reports from yesterday I still see this signature on 3.5.2

494 total crashes for nsWindow::GetParentWindow(int) on 20090817

distribution of versions where the crash was found on 20090817-crashdata.csv
 430 Firefox 3.5.2
  44 Firefox 3.5
  20 Firefox 3.5.1

but nothing for 3.0.x

does that match up with the places the patch has landed and/or needs to land?
Comment 112 Samuel Sidler (old account; do not CC) 2009-08-18 21:25:17 PDT
chofmann: This was initially filed with the signature of nsWindow::GetParentWindow() and changed to be nsWindow::GetParentWindow(int) with an additional nsBaseWidget::Destroy() signature. While the "int" version of the first signature does not show on 1.9.0, the non-int version is still present, ranked #31 with the nsBaseWidget signature ranked at #73.

The above is consistent with this patch having not landed on 1.9.1 or 1.9.0.
Comment 113 Henrik Skupin (:whimboo) 2009-08-19 00:35:32 PDT
Verified fixed on trunk based on crash stats.
Comment 114 Jim Mathies [:jimm] 2009-08-19 09:12:16 PDT
(In reply to comment #111)
> looking at crash reports from yesterday I still see this signature on 3.5.2
> 
> 494 total crashes for nsWindow::GetParentWindow(int) on 20090817
> 
> distribution of versions where the crash was found on 20090817-crashdata.csv
>  430 Firefox 3.5.2
>   44 Firefox 3.5
>   20 Firefox 3.5.1
> 
> but nothing for 3.0.x
> 
> does that match up with the places the patch has landed and/or needs to land?

That's correct. 3.6 is what you want to look at to see if it's fixed. A 3.5.x patch hasn't landed because the original 3.6 patch back ported to 3.5 caused leaks in component manager that I was never able to track down.
Comment 115 Mike Shaver (:shaver) 2009-08-19 09:14:17 PDT
What do you mean "never able to track down"?  Are you no longer working on this?  It's one of our major crashes, so if you need help with it please do ask for it.
Comment 116 Jim Mathies [:jimm] 2009-08-19 11:51:21 PDT
(In reply to comment #115)
> What do you mean "never able to track down"?  Are you no longer working on
> this?  It's one of our major crashes, so if you need help with it please do ask
> for it.

These are the leaks that I can reliably reproduce on the unit test box:

TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 484 bytes during test execution
TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 1 instance of nsComponentManagerImpl with size 276 bytes
TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 2 instances of nsLocalFile with size 88 bytes each (176 bytes total)
TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 3 instances of nsStringBuffer with size 8 bytes each (24 bytes total)
TEST-UNEXPECTED-FAIL | runtests-leaks | leaked 2 instances of nsTArray_base with size 4 bytes each (8 bytes total)

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250700153.1250706746.27537.gz&fulltext=1#err7

Why a slight re-org of the window destruction code causes this is a mystery. I've poked around in component manager looking for an answer but haven't turned anything up. If anyone has any suggestions on how to better track this down I'd welcome the input. I'll post the current rev of the patch here in a sec.
Comment 117 Jim Mathies [:jimm] 2009-08-19 11:51:55 PDT
Created attachment 395363 [details] [review]
teardown refactor 1.9.1 v.3
Comment 118 Mike Beltzner [:beltzner] 2009-08-25 10:37:54 PDT
Mass change: adding fixed1.9.2 keyword

(This bug was identified as a mozilla1.9.2 blocker which was fixed before the mozilla-1.9.2 repository was branched (August 13th, 2009) as per this query: http://is.gd/2ydcb - if this bug is not actually fixed on mozilla1.9.2, please remove the keyword. Apologies for the bugspam)
Comment 119 Daniel Veditz 2009-10-02 11:49:42 PDT
Are the leaks that bad? If this fixes the topcrash it might be worth it.
Comment 120 Jim Mathies [:jimm] 2009-10-02 13:36:59 PDT
(In reply to comment #119)
> Are the leaks that bad? If this fixes the topcrash it might be worth it.

They caused leak check test failures on try.

FWIW, I can revisit this next week after the freeze.
Comment 121 Boris Zbarsky (:bz) 2009-10-02 18:23:16 PDT
So that leak output is basically leaking the component manager; everything else that's leaked is members of the component manager: mRegistryFile and mComponentsDir for the nsIFiles, mLoaderData and mPendingServices for the nsTArray_bases; I'd guess the strings are from mLoaderData (though not sure why it only has 3 entries).

Is it possible that this patch effectively changes the timing of things somehow so that we end up executing code after the component manager has been shut down (via nsComponentManagerImpl::Shutdown) and somehow reinstantiates the component manager?  Or something like that....
Comment 122 Jim Mathies [:jimm] 2009-10-05 16:17:02 PDT
If this is what people were looking at, it's a different bug:

http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A3.7a1pre&platform=windows&date=&range_value=3&range_unit=weeks&query_search=signature&query_type=contains&query=GetParentWindow&do_query=1#

0  	xul.dll  	nsWindow::GetParentWindow  	 widget/src/windows/nsWindow.cpp:1062
1 	xul.dll 	nsWindow::GetParent 	widget/src/windows/nsWindow.cpp:1024
2 	xul.dll 	GetWidgetOffset 	layout/base/nsLayoutUtils.cpp:833
3 	xul.dll 	nsLayoutUtils::TranslateWidgetToView 	layout/base/nsLayoutUtils.cpp:854
4 	xul.dll 	nsLayoutUtils::GetEventCoordinatesRelativeTo 	layout/base/nsLayoutUtils.cpp:674
5 	xul.dll 	PresShell::HandleEvent 	layout/base/nsPresShell.cpp:6188
6 	xul.dll 	nsViewManager::HandleEvent 	view/src/nsViewManager.cpp:1202
7 	xul.dll 	nsViewManager::DispatchEvent 	view/src/nsViewManager.cpp:1181
8 	xul.dll 	HandleEvent 	view/src/nsView.cpp:167
9 	xul.dll 	nsWindow::DispatchEvent 	widget/src/windows/nsWindow.cpp:2763
10 	xul.dll 	nsWindow::DispatchWindowEvent 	widget/src/windows/nsWindow.cpp:2786
11 	xul.dll 	nsWindow::DispatchMouseEvent 	widget/src/windows/nsWindow.cpp:3161
12 	xul.dll 	ChildWindow::DispatchMouseEvent
Comment 123 Henrik Skupin (:whimboo) 2009-10-09 10:49:45 PDT
As long as we don't have a testcase or steps how to reproduce this crash there will be no way to run a regression test. Clearing flag for now.
Comment 124 Samuel Sidler (old account; do not CC) 2009-10-14 16:26:07 PDT
We're going mark this as blocking 1.9.1.5. We'd like to get it in sooner rather than later, however, I'd like a second review on this. Technically security bugs need a superreview, but I'm explicitly asking for a second review (superreview or not) to get a better sanity check on the code since we haven't yet shipped this code in a release.

Jim or Vlad: Is there a Windows person that would be good for a second review here?
Comment 125 Jim Mathies [:jimm] 2009-10-14 16:42:34 PDT
(In reply to comment #124)
> We're going mark this as blocking 1.9.1.5. We'd like to get it in sooner rather
> than later, however, I'd like a second review on this. Technically security
> bugs need a superreview, but I'm explicitly asking for a second review
> (superreview or not) to get a better sanity check on the code since we haven't
> yet shipped this code in a release.
> 
> Jim or Vlad: Is there a Windows person that would be good for a second review
> here?

rob arnold, neil rashbrook, and doug turner come to mind. we still have the leaks to deal with though, so this isn't ready to be reviewed/land on 1.9.1.
Comment 126 Damon Sicore (:damons) 2009-10-14 16:51:47 PDT
Jim, can you fix those leaks?  Per our discussion earlier today, crashes are our top priority right now.
Comment 127 Jim Mathies [:jimm] 2009-10-14 18:04:40 PDT
(In reply to comment #126)
> Jim, can you fix those leaks?  Per our discussion earlier today, crashes are
> our top priority right now.

I can certainly try. Will look into what bz thought might be the cause (comment 121).
Comment 128 Jim Mathies [:jimm] 2009-10-15 17:05:57 PDT
(In reply to comment #127)
> (In reply to comment #126)
> > Jim, can you fix those leaks?  Per our discussion earlier today, crashes are
> > our top priority right now.
> 
> I can certainly try. Will look into what bz thought might be the cause (comment
> 121).

Things haven't changed much since the last time I worked on this. These leaks only show up on try server - reproducing them on an XP vmware image with both debug and release builds doesn't work.

I guess I'll try playing around with a winnt 5.2 unit test image next. If I can reproduce this on something other than try server where I have more control, I might be able to find something.

TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 484 bytes during test execution
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of nsComponentManagerImpl with size 276 bytes
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsLocalFile with size 88 bytes each (176 bytes total)
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 3 instances of nsStringBuffer with size 8 bytes each (24 bytes total)
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsTArray_base with size 4 bytes each (8 bytes total)
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 484 bytes during test execution
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of nsComponentManagerImpl with size 276 bytes
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsLocalFile with size 88 bytes each (176 bytes total)
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 3 instances of nsStringBuffer with size 8 bytes each (24 bytes total)
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 2 instances of nsTArray_base with size 4 bytes each (8 bytes total)

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1255636233.1255645227.4871.gz&fulltext=1#err12
Comment 129 Johnny Stenback (:jst, jst@mozilla.com) 2009-10-16 08:36:15 PDT
Reassigning to Jim since he's investigating here...
Comment 130 Jim Mathies [:jimm] 2009-10-23 10:37:22 PDT
After getting access to a try server slave, I found the crashtest leaks corresponded to two tests involving applets - 

layout/base/crashtests/265986-1.html
parser/htmlparser/tests/crashtests/328751-1.html

Disabling those two tests addresses the crashtest leak problem. 

Will try to track down the mochitest-plain leaks, they probably related to applets as well.

Another interesting note - disabling one test and leaving the other results in the same number of leaked bytes.
Comment 131 Jim Mathies [:jimm] 2009-10-26 12:21:09 PDT
The other tests for mochitest-plain all tie to some html validation tests that make use of a java applet:

http://mxr.mozilla.org/mozilla-central/source/dom/tests/mochitest/dom-level2-html/
http://mxr.mozilla.org/mozilla-central/source/dom/tests/mochitest/dom-level2-html/files/

test_HTMLBodyElement07.html \
test_HTMLBodyElement08.html \
test_HTMLBodyElement09.html \
test_HTMLBodyElement10.html \
test_HTMLBodyElement11.html \
test_HTMLBodyElement12.html \

test_HTMLDocument01.html \
test_HTMLDocument02.html \
test_HTMLDocument03.html \
test_HTMLDocument04.html \
test_HTMLDocument05.html \
test_HTMLDocument07.html \
test_HTMLDocument08.html \
test_HTMLDocument09.html \
test_HTMLDocument10.html \
test_HTMLDocument11.html \
test_HTMLDocument13.html \
test_HTMLDocument14.html \
test_HTMLDocument15.html \
test_HTMLDocument16.html \
test_HTMLDocument17.html \
test_HTMLDocument18.html \
test_HTMLDocument19.html \
test_HTMLDocument20.html \
test_HTMLDocument21.html \
test_HTMLDocument22.html \
test_HTMLDocument23.html \
test_HTMLDocument24.html \
test_HTMLDocument25.html \
test_HTMLDocument26.html \
test_HTMLDocument27.html \

I'm guessing we probably don't want to disable these on 1.9.1.
Comment 132 Damon Sicore (:damons) 2009-11-02 11:27:46 PST
What's next here?  Let's get this done.
Comment 133 Damon Sicore (:damons) 2009-11-02 11:28:13 PST
Ah, nevermind.  I see it now.
Comment 134 Jim Mathies [:jimm] 2009-11-02 11:42:33 PST
(In reply to comment #132)
> What's next here?  Let's get this done.

Next step would be to make a decision on how to handle these leaks. We can disable the tests (which allows this to pass try cleanly) or we can spend some additional time looking for a fix to the applet problem.
Comment 135 Samuel Sidler (old account; do not CC) 2009-11-04 17:36:57 PST
I don't think we should disable the tests, but I'm willing to live with the leaks. Can we change our expected leak value to accomodate these new leaks? We'd really like to get this in 1.9.1.6 which code freezes on November 10 at 11:59pm.
Comment 136 Jim Mathies [:jimm] 2009-11-04 17:55:24 PST
(In reply to comment #135)
> I don't think we should disable the tests, but I'm willing to live with the
> leaks. Can we change our expected leak value to accomodate these new leaks?
> We'd really like to get this in 1.9.1.6 which code freezes on November 10 at
> 11:59pm.

Who would make that call?  Any pointers on how we would change tolerated leak values? I can work up a patch and request approval on it.
Comment 137 Samuel Sidler (old account; do not CC) 2009-11-04 18:05:34 PST
I just made that call. :)

I don't know how to change the tolerate leak values, unfortunately. Hopefully someone else can help with that.
Comment 138 Robert Sayre 2009-11-04 18:07:44 PST
Just fix the leaks.
Comment 139 Mike Shaver (:shaver) 2009-11-04 18:14:32 PST
I also don't understand why we're so stumped by the leaks here -- what's the interaction that's changed?  What have we looked at or tried?  A lot of different tests leak, so it seems like we should be able to track the lifecycle of the relevant objects without too much hassle.  It's not like we allocate the component manager very often in the normal course of things!
Comment 140 Jim Mathies [:jimm] 2009-11-08 14:52:28 PST
Has anyone else tried reproducing the crashtest leaks? I still can't, and I've even recently upgraded my system. If not I guess I can allocate a try slave again and see if I can get a debug build going on it.
Comment 141 Jim Mathies [:jimm] 2009-11-08 14:56:23 PST
(In reply to comment #140)
> Has anyone else tried reproducing the crashtest leaks? I still can't, and I've
> even recently upgraded my system. If not I guess I can allocate a try slave
> again and see if I can get a debug build going on it.

Never mind! After installing the latest java run time, I'm able to reproduce them!
Comment 142 Jim Mathies [:jimm] 2009-11-08 16:11:09 PST
Created attachment 411102 [details]
java refs

This appears to be the root of the problem. I can't find corresponding releases for two AddRef's from the java plugin, and on shutdown the component manager asserts with a left over refcount of 2.
Comment 143 Jim Mathies [:jimm] 2009-11-09 08:55:50 PST
FWIW, I can reproduce this on 1.9.1, w/Java(TM) Platform SE 6 U17, without the tear down patch applied. Trunk builds don't exhibit the problem.
Comment 144 Samuel Sidler (old account; do not CC) 2009-11-10 17:40:11 PST
So... code freeze for 1.9.1.6 is tonight (6.5 hours). While I'm all for getting leaks fixed, this is a leak on shutdown when using Java. It's not a huge issue and it only happens on 1.9.1. The trade off of stability vs leaking on shutdown is pretty clear to me (stability wins!).

I don't think we should push off landing this patch longer than 1.9.1.6. The chances are 1.9.1.7 won't be until early February.

Jim, other than the leaks, is attachment 395363 [details] [review] ready for the 1.9.1 branch?
Comment 145 Johnny Stenback (:jst, jst@mozilla.com) 2009-11-10 18:13:38 PST
I fully agree with Sam here, let's take this, the leaks mean nothing compared to the stability gains we're hoping to get out of this fix!
Comment 146 Jim Mathies [:jimm] 2009-11-10 18:43:24 PST
(In reply to comment #144)
> So... code freeze for 1.9.1.6 is tonight (6.5 hours). While I'm all for getting
> leaks fixed, this is a leak on shutdown when using Java. It's not a huge issue
> and it only happens on 1.9.1. The trade off of stability vs leaking on shutdown
> is pretty clear to me (stability wins!).
> 
> I don't think we should push off landing this patch longer than 1.9.1.6. The
> chances are 1.9.1.7 won't be until early February.
> 
> Jim, other than the leaks, is attachment 395363 [details] [review] ready for the 1.9.1 branch?

I can now reproduce this sans my patch, so the java issue is independent, although my patch on 1.9.1 triggers it with try server slave config.

Let me see if the patch if fresh, will post back here in a sec.
Comment 147 Jim Mathies [:jimm] 2009-11-10 18:49:56 PST
This still applies cleanly.
Comment 148 Samuel Sidler (old account; do not CC) 2009-11-10 19:18:40 PST
Comment on attachment 395363 [details] [review]
teardown refactor 1.9.1 v.3

Approved for 1.9.1.6. a=ss
Comment 149 Jim Mathies [:jimm] 2009-11-10 19:35:41 PST
This landed tonight:
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/3d0b0cfd9644

The java leaks were apparently known. We have leak threshold values related to java set on 1.9.1 that aren't currently set on try, which is why this problem manifested on test runs. (hat tip to nthomas for helping track this down.)

Related bugs: bug 471647, and bug 495533.
Comment 150 Al Billings [:abillings] 2009-11-19 17:33:22 PST
Henrik, you marked this verified fixed for 1.9.2. How did you verify this? I need to know since I need to verify this for 1.9.1 and I don't see a way to do so based on my reading of the comments.
Comment 151 Henrik Skupin (:whimboo) 2009-11-20 03:23:15 PST
I did that regarding our crash stats due to a missing testcase.

Jim, it looks like that a similar crash has been returned for 1.9.2. The first frames are identical. Shall we open a new bug on that? See bp-673ba0da-133f-488d-91ff-2d63b2091120

0  	xul.dll  	nsWindow::GetParentWindow  	 widget/src/windows/nsWindow.cpp:1063
1 	xul.dll 	nsWindow::GetParent 	widget/src/windows/nsWindow.cpp:1025
2 	xul.dll 	GetWidgetOffset 	layout/base/nsLayoutUtils.cpp:839
3 	xul.dll 	nsLayoutUtils::TranslateWidgetToView 	layout/base/nsLayoutUtils.cpp:860
4 	xul.dll 	nsLayoutUtils::GetEventCoordinatesRelativeTo 	layout/base/nsLayoutUtils.cpp:680
Comment 152 Jim Mathies [:jimm] 2009-11-20 06:06:01 PST
(In reply to comment #151)
> I did that regarding our crash stats due to a missing testcase.
> 
> Jim, it looks like that a similar crash has been returned for 1.9.2. The first
> frames are identical. Shall we open a new bug on that? See
> bp-673ba0da-133f-488d-91ff-2d63b2091120
> 
> 0      xul.dll      nsWindow::GetParentWindow      
> widget/src/windows/nsWindow.cpp:1063
> 1     xul.dll     nsWindow::GetParent     widget/src/windows/nsWindow.cpp:1025
> 2     xul.dll     GetWidgetOffset     layout/base/nsLayoutUtils.cpp:839
> 3     xul.dll     nsLayoutUtils::TranslateWidgetToView    
> layout/base/nsLayoutUtils.cpp:860
> 4     xul.dll     nsLayoutUtils::GetEventCoordinatesRelativeTo    
> layout/base/nsLayoutUtils.cpp:680

Definitely, that's a completely different stack. Looks like it has something to do with scroll.
Comment 153 Jim Mathies [:jimm] 2009-11-20 06:25:41 PST
There are a few common crash signatures here mixed in together. One is CreateWidget - 

http://crash-stats.mozilla.com/report/index/bcd92777-df9d-4f36-a7f5-8f0e02091120
http://crash-stats.mozilla.com/report/index/ebe43c86-d12d-44dd-affc-bf5ec2091120

That's likely bug 506108.

Another has to do with dispatching events -

http://crash-stats.mozilla.com/report/index/a85d727a-fcc7-46df-aab2-b5b652091120

We should split out a separate bug for the events crash.
Comment 154 Jim Mathies [:jimm] 2009-11-20 06:56:31 PST
Filed bug 530070.

How do we get bug numbers added to crash stats for specific stacks?
Comment 155 Johnny Stenback (:jst, jst@mozilla.com) 2009-11-20 08:36:04 PST
I think the only way we could do that would be to add enough of the top frames to the skiplist in the crash processor so that we can tell the two appart from just the signature of the crash report. In these two cases it seems we would need to add at least nsWindow::GetParentWindow and nsWindow::GetParent, so that the signatures for the two individual crashes would become "nsWindow::GetParentWindow | nsWindow::GetParent | nsBaseWidget::SetZIndex" vs "nsWindow::GetParentWindow | nsWindow::GetParent | GetWidgetOffset". I don't know whether that's a good idea in general, or if it's worth it just for this specific case.
Comment 156 Daniel Veditz 2010-01-24 20:49:55 PST
This appears to still happen on the 1.9.0 branch (where it's topcrash #32 or so). Would the patch here work for that branch? Do we need a new one? Can't really unhide the bug while it's still happening on that branch.

Does not appear to happen on the 1.8 branch, or at least doesn't appear anywhere in the top-500 crash reports. Tried searching to see if it happened at all but can't get a response from the talkback server.
Comment 157 Daniel Veditz 2010-01-25 10:17:53 PST
Would love to get this into 3.0.18 but blocking may not be realistic. If the patch applies please request approval ASAP, otherwise we'll block for the next 3.0.x release.

Note You need to log in before you can comment on or make changes to this bug.