Last Comment Bug 639885 - Crash [@ memmove ] via mozilla::layers::ReadbackManagerD3D10::ProcessTasks
: Crash [@ memmove ] via mozilla::layers::ReadbackManagerD3D10::ProcessTasks
Status: RESOLVED FIXED
: [sg:moderate][fx4-rc-ridealong] requi...
: relnote
Product: Core
Classification: Components
Component: Graphics
: Trunk
: x86 Windows 7
: -- critical (vote)
: ---
Assigned To: Bas Schouten (:bas)
: thebes
:
:
:
  Show dependency treegraph
 
Reported: 2011-03-08 10:00 PST by Ted Mielczarek [:ted, :luser] (on vacation until 07/18/2011)
Modified: 2011-04-08 13:03 PDT (History)
20 users (show)
See Also:
Crash Signature:
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  ---
  Macaw+
  .1-fixed
  ---
  ---
  ---
  ---


Attachments
Fix potential access-after-free (6.81 KB, patch)
2011-03-08 13:21 PST, Bas Schouten (:bas)
no flags Details | Diff | Splinter Review
Fix potential access-after-free v2 (6.64 KB, patch)
2011-03-08 14:23 PST, Bas Schouten (:bas)
jones.chris.g: review+
Details | Diff | Splinter Review
Abort in case of unsafe program state (for Gecko 2.0) (1.96 KB, patch)
2011-03-08 17:32 PST, Ehsan Akhgari [:ehsan]
jmuizelaar: review+
jones.chris.g: review+
bas.schouten: review+
Details | Diff | Splinter Review
Part 2: Stop using named objects [checkin: comment 58] (1.11 KB, patch)
2011-03-08 19:12 PST, Bas Schouten (:bas)
jmuizelaar: review+
dveditz: approval2.0+
Details | Diff | Splinter Review
Abort in case of unsafe program state (for Gecko 2.0) [check-in: comment 51] (1.99 KB, patch)
2011-03-08 21:37 PST, Ehsan Akhgari [:ehsan]
mbeltzner: review+
mbeltzner: approval2.0+
Details | Diff | Splinter Review

Summon comment box

Description Ted Mielczarek [:ted, :luser] (on vacation until 07/18/2011) 2011-03-08 10:00:18 PST
I hit this crash three times in the past two days:

 	mozcrt19.dll!memmove(unsigned char * dst=0x67e23fa0, unsigned char * src=0x67e23fa4, unsigned long count=4294967292)  Line 188	Asm
 	xul.dll!nsTArray_base<nsTArrayDefaultAllocator>::ShiftData(unsigned int start=0, unsigned int oldLen=0, unsigned int newLen=0, unsigned int elemSize=4)  Line 180 + 0xd bytes	C++
 	xul.dll!nsTArray<nsAutoPtr<mozilla::layers::ReadbackTask>,nsTArrayDefaultAllocator>::RemoveElementsAt(unsigned int start=1736935553, unsigned int count=1)  Line 836	C++
 	xul.dll!nsTArray<nsAutoPtr<mozilla::layers::ReadbackTask>,nsTArrayDefaultAllocator>::RemoveElementAt(unsigned int index=1736935646)  Line 841	C++
>	xul.dll!mozilla::layers::ReadbackManagerD3D10::ProcessTasks()  Line 232	C++
 	xul.dll!mozilla::layers::StartTaskThread(void * aManager=0x124f7730)  Line 135	C++
 	kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes	
 	ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes	
 	ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes	

The worst part is that before we crash, the memmove manages to overwrite a static member variable that Breakpad needs to function properly, so the exception handler crashes, so you just get a Windows crash dialog instead of our crash reporter.

All three times I managed to reproduce this I had another instance of Firefox running (with -no-remote). I hit it twice today under those circumstances. (The second time I attached a debugger and set a data breakpoint, which is how I determined that the member variable was being overwritten here.)
Comment 1 Andreas Gal :gal 2011-03-08 11:37:36 PST
Hidden until further analysis.
Comment 2 Robert Sayre 2011-03-08 11:42:08 PST
Need more data on whether we have a frequent crash. Leaving the nomination now.
Comment 3 Bas Schouten (:bas) 2011-03-08 11:44:59 PST
This crash is very strange, I've got a theory on what might cause something like this, but it should only occur on a device driver hang/reset or some 10-20 seconds after startup.

I will write a fix for that issue and hope it takes care of this. Sadly considering this shows a weakness in our crash reporter, we have no idea how often this happens in the wild.

I can see no way this can be exploited.
Comment 4 Benjamin Smedberg [:bsmedberg] 2011-03-08 11:46:30 PST
Given the codepath here, this is probably only related to windowless plugins on D3D10, which is most commonly testable on hulu and blip... most youtube videos are windowed.
Comment 5 chris hofmann 2011-03-08 11:59:59 PST
> Need more data on whether we have a frequent crash.

I don't get any matches for a stack or signature looking like this when trying to find this in reports in socorro, or searching the top ten lines of all the stacks in the top 300 for beta 12 or 3.6.14

http://people.mozilla.org/crash_stacks/stack-summary-4.0b12.txt
http://people.mozilla.org/crash_stacks/stack-summary-3.6.14.txt
Comment 6 Benjamin Smedberg [:bsmedberg] 2011-03-08 12:01:00 PST
Yes, that's because the bug here causes no crash reporter to come up.
Comment 7 Bas Schouten (:bas) 2011-03-08 13:21:38 PST
Created attachment 517848 [details] [review]
Fix potential access-after-free

This fixes the potential access-after-free bug here. We call terminate before releasing our last external reference inside the DeviceAttachments.

Terminate signals the task thread to stop after checking whether it has grabbed a reference to the ReadbackManager yet. That reference will be released when the task thread terminates.

This should fix the only cause I can think of for this crash, outside of someone else corrupting the readbackmanager.
Comment 8 Andreas Gal :gal 2011-03-08 13:37:27 PST
Is this read or write after free? Write after free is almost always exploitable in a multi-threaded system with user control over background threads.
Comment 9 Bas Schouten (:bas) 2011-03-08 14:08:07 PST
This is a write after free. But I don't see how anyone could affect what's being written to from JS or HTML or anything like that.
Comment 10 Chris Jones [:cjones] [:warhammer] 2011-03-08 14:08:56 PST
Comment on attachment 517848 [details] [review]
Fix potential access-after-free

This looks mostly OK, comments follow.

> struct DeviceAttachments
> {
>+  ~DeviceAttachments() {
>+    if (mReadbackManager) {
>+      mReadbackManager->Terminate();
>+    }

I would remove this ...

>+  mMayTerminate = ::CreateEventA(NULL, TRUE, FALSE, "MayTerminateEvent");

... rename this to mReadbackThreadHasRef ...

>+void
>+ReadbackManagerD3D10::Terminate()
>+{
>+  // Practically always this event should have been set by the time we reach
>+  // this, we need to do this to prevent the rare race condition where we'd
>+  // get terminated and released on the main thread before we take a reference
>+  // to ourselves.
>+  ::WaitForSingleObject(mMayTerminate, INFINITE);

... remove this ...

... and in ReadbackManagerD3D10::ReadbackManagerD3D10(), after
creating the readback thread, block on

  ::WaitForSingleObject(mReadbackThreadHasRef, INFINITE);

Your impl here looks OK, but I don't like distributing the invariant
that the readback thread has a reference around this code.  This will
mean that we block on startup of the readback thread, but I would hope
that's not too slow on windows.
Comment 11 Jeff Muizelaar [:jrmuizel] 2011-03-08 14:14:42 PST
It looks like this is:
  memmove(base, base+4, 4294967292);
where base is a heap address.

So move everything on the heap forward 4 bytes until we reach the end of the heap.
I'm sort of surprised that the static member variables for crash reporter are in memory contiguous with the heap.
Comment 12 Bas Schouten (:bas) 2011-03-08 14:23:48 PST
Created attachment 517877 [details] [review]
Fix potential access-after-free v2

Addressed cjones' comments as per IRC discussion.
Comment 13 Chris Jones [:cjones] [:warhammer] 2011-03-08 14:42:13 PST
Comment on attachment 517877 [details] [review]
Fix potential access-after-free v2

I still think the "mMayTerminate" name is ambiguous and confusing, but don't need to hold an sg on that.

>+void
>+ReadbackManagerD3D10::Terminate()
>+{

Assert main-thread here.
Comment 14 Mike Beltzner [:beltzner] 2011-03-08 15:22:56 PST
In order to make a judgement on blocking status, the following questions need to be answered:

 - is this a recent regression?
 - do we see any secondary indicators that the crash is common?
 - can it be trivially reproduced on Windows systems running common plugins on common websites?
Comment 15 JP Rosevear 2011-03-08 15:34:32 PST
(In reply to comment #14)
> In order to make a judgement on blocking status, the following questions need
> to be answered:
> 
>  - is this a recent regression?

Probably since this code landed only in b12.

>  - do we see any secondary indicators that the crash is common?

MS crash data will hopefully provide the answer.  Asa found nothing concrete in the input database (see release-drivers).

>  - can it be trivially reproduced on Windows systems running common plugins on
> common websites?

Not yet at least.
Comment 16 Jeff Muizelaar [:jrmuizel] 2011-03-08 16:20:20 PST
Bas, can you give a step-by-step walk through about your what happens with the current code, provided your theory about the problem is true?
Comment 17 Jeff Muizelaar [:jrmuizel] 2011-03-08 16:35:01 PST
Also, this patch seems to assume that DeviceAttachments will be the last thing to hold the reference. This stack seems to suggest otherwise:

 xul.dll!mozilla::layers::ReadbackManagerD3D10::~ReadbackManagerD3D10()  Line 148	C++
 xul.dll!mozilla::layers::ReadbackManagerD3D10::`scalar deleting destructor'()  + 0xf bytes	C++
 xul.dll!mozilla::layers::ReadbackManagerD3D10::Release()  Line 213 + 0x1c bytes	C++
 xul.dll!nsRefPtr<mozilla::layers::ReadbackManagerD3D10>::~nsRefPtr<mozilla::layers::ReadbackManagerD3D10>()  Line 970	C++
 xul.dll!mozilla::layers::LayerManagerD3D10::~LayerManagerD3D10()  Line 117 + 0x37 bytes	C++
 xul.dll!mozilla::layers::LayerManagerD3D10::`vector deleting destructor'()  + 0x4d bytes	C++
 xul.dll!mozilla::layers::LayerManager::Release()  Line 256 + 0x141 bytes	C++
 xul.dll!nsRefPtr<mozilla::layers::LayerManager>::assign_assuming_AddRef(mozilla::layers::LayerManager * newPtr=0x00000000)  Line 958	C++
 xul.dll!nsRefPtr<mozilla::layers::LayerManager>::assign_with_AddRef(mozilla::layers::LayerManager * rawPtr=0x00000000)  Line 942	C++
 xul.dll!nsRefPtr<mozilla::layers::LayerManager>::operator=(mozilla::layers::LayerManager * rhs=0x00000000)  Line 1026	C++
 xul.dll!nsWindow::Destroy()  Line 760	C++
 xul.dll!nsXULWindow::Destroy()  Line 556	C++
Comment 18 Jeff Muizelaar [:jrmuizel] 2011-03-08 16:45:33 PST
Ehsan and I have artificially reproduced the crash by changing the CreateSemaphore initial count to be one less than the maximum and visiting hulu.com
Comment 19 Jeff Muizelaar [:jrmuizel] 2011-03-08 16:49:50 PST
(In reply to comment #18)
> Ehsan and I have artificially reproduced the crash by changing the
> CreateSemaphore initial count to be one less than the maximum and visiting
> hulu.com

However, the reproduction may be a red herring...
Comment 20 Jeff Muizelaar [:jrmuizel] 2011-03-08 17:08:42 PST
Ted, did you notice a browser hang of about 5 seconds before the crash?
Comment 21 Ehsan Akhgari [:ehsan] 2011-03-08 17:32:54 PST
Created attachment 517941 [details] [review]
Abort in case of unsafe program state (for Gecko 2.0)

One possible failure scenario here is if the thread takes more than 5 seconds to shut down.  <http://mxr.mozilla.org/mozilla-central/source/gfx/layers/d3d10/ReadbackManagerD3D10.cpp#160>

In this case we end up with a running thread touching the ReadbackManagerD3D10 object.  Note that the destructor for nsTArray calls Clear on it, which sets mLength to 0, which might lead to the exact crash in comment 0.

I think it's best for us to change things so that we do a runtime abort if this happens.  This enables us to get crash data from our users who hit this, and might help us in debugging this bug further and fixing it in a future release.  This is what this patch does.

This code is really hard to reason about.  Jeff and I discussed this code and Bas patch for several hours today, and in the end neither of us seemed comfortable with taking Bas' patch here.  I highly recommend against taking that patch for Firefox 4, unless a safe fix which has been tested on the nightlies for some time can be backported to the 2.0 branch in a future dot-release.
Comment 22 Ehsan Akhgari [:ehsan] 2011-03-08 17:36:16 PST
(In reply to comment #21)
> One possible failure scenario here is if the thread takes more than 5 seconds
> to shut down. 

Possible examples of the cases where this might happen is having lots of windowless plugins on a page (pages with ads?), or an extremely busy system (possibly thrashing stuff to disk).  Basically either a large number of jobs to process, or the starvation of the background thread.
Comment 23 Ehsan Akhgari [:ehsan] 2011-03-08 17:43:39 PST
And let me also add that this code has a lot of other problems as well (such as using named objects, not doing enough error checking, using a semaphore as a means of tracking the number of objects in the queue, using CreateThread directly, etc).  I think those problems should be dealt with in other bugs though.
Comment 24 Ehsan Akhgari [:ehsan] 2011-03-08 17:55:19 PST
Note that if the scenario in comment 21 is correct, this might be exploitable, because we'd be accessing |this| after the object has gone away, and it would be possible to cause the memmove to overwrite a vtable, which means that we're owned.  Determining if this can really happen requires understanding the code way better than I do.
Comment 25 Ted Mielczarek [:ted, :luser] (on vacation until 07/18/2011) 2011-03-08 18:05:04 PST
(In reply to comment #20)
> Ted, did you notice a browser hang of about 5 seconds before the crash?

Can't say that I did, but I'm not a reliable observer, I don't think. This is on a Core i7 machine running Windows 7 x64, FWIW. The machine is not otherwise under heavy load, just browsing.
Comment 26 Bas Schouten (:bas) 2011-03-08 18:12:21 PST
I've setup a reporting structure for Ted that keeps reporting how many tasks are in the task queue. If it happens for him again, or maybe even if it doesn't, if we see a rising trend on the amount of tasks in the queue this could be responsible. It sounds very unlikely though. Even if this was the cause we'd simply start leaking tasks (the semaphore will refuse to increment while the queue grows).

It will be interesting debug information anyway, so let's hope it crashes and let's see what comes out of it.
Comment 27 Bas Schouten (:bas) 2011-03-08 18:17:37 PST
(In reply to comment #24)
> Note that if the scenario in comment 21 is correct, this might be exploitable,
> because we'd be accessing |this| after the object has gone away, and it would
> be possible to cause the memmove to overwrite a vtable, which means that we're
> owned.  Determining if this can really happen requires understanding the code
> way better than I do.

I agree with all this, I don't see a reasonable way for anyone to control what memory the memmove moves where though. But I don't see any attack vectors to inject anything into the vtable. Nor a way to get this vtable to be called after object destruction (the task function itself makes no virtual function calls, i.e. there's no vtable access after free).

So yes, memory can be moved into an area that once was a vtable, but not a vtable that's going to be used for anything.
Comment 28 Bas Schouten (:bas) 2011-03-08 18:22:42 PST
(In reply to comment #21)
> This code is really hard to reason about.  Jeff and I discussed this code and
> Bas patch for several hours today, and in the end neither of us seemed
> comfortable with taking Bas' patch here.  I highly recommend against taking
> that patch for Firefox 4, unless a safe fix which has been tested on the
> nightlies for some time can be backported to the 2.0 branch in a future
> dot-release.

Although any lifecycle management changes in a multithreaded environment are risky. I personally think this is a pretty safe patch (yes, it would require a QA cycle). If we do have any future RC I'd certainly recommend taking it as this is a pretty nasty bug, even though I can't imagine the prevalence being high.
Comment 29 Ehsan Akhgari [:ehsan] 2011-03-08 18:23:03 PST
(In reply to comment #27)
> (In reply to comment #24)
> > Note that if the scenario in comment 21 is correct, this might be exploitable,
> > because we'd be accessing |this| after the object has gone away, and it would
> > be possible to cause the memmove to overwrite a vtable, which means that we're
> > owned.  Determining if this can really happen requires understanding the code
> > way better than I do.
> 
> I agree with all this, I don't see a reasonable way for anyone to control what
> memory the memmove moves where though. But I don't see any attack vectors to
> inject anything into the vtable. Nor a way to get this vtable to be called
> after object destruction (the task function itself makes no virtual function
> calls, i.e. there's no vtable access after free).
> 
> So yes, memory can be moved into an area that once was a vtable, but not a
> vtable that's going to be used for anything.

I was not merely talking about the vtables for these objects.  The fault memmove call goes ahead and corrupts a large portion of the heap this way, so a value which is stored adjacent to an object pointer can be replaced with the pointer, for example, and that value could be constructed to point to the attacker's favorite vtable coming from an image, etc.  :-)
Comment 30 Ehsan Akhgari [:ehsan] 2011-03-08 18:24:52 PST
(In reply to comment #28)
> (In reply to comment #21)
> > This code is really hard to reason about.  Jeff and I discussed this code and
> > Bas patch for several hours today, and in the end neither of us seemed
> > comfortable with taking Bas' patch here.  I highly recommend against taking
> > that patch for Firefox 4, unless a safe fix which has been tested on the
> > nightlies for some time can be backported to the 2.0 branch in a future
> > dot-release.
> 
> Although any lifecycle management changes in a multithreaded environment are
> risky. I personally think this is a pretty safe patch (yes, it would require a
> QA cycle). If we do have any future RC I'd certainly recommend taking it as
> this is a pretty nasty bug, even though I can't imagine the prevalence being
> high.

Can you reply to comment 16, please?
Comment 31 Ehsan Akhgari [:ehsan] 2011-03-08 18:46:11 PST
Try server push: http://tbpl.mozilla.org/?tree=MozillaTry&rev=0ff4f7282a24
Comment 32 Bas Schouten (:bas) 2011-03-08 19:12:48 PST
Created attachment 517958 [details] [review]
Part 2: Stop using named objects [checkin: comment 58]

I don't agree with all of Ehsan's problems with the code, however he's certainly right that the use of named objects here is a -very- bad thing here! Let's stop doing it now.

I wrote this with in my mind that this was a singleton class, where it would've been benign, but not really that informative anyway. However under some conditions (driver crashes/etc.), it's not, if the shutdown then also takes longer than 5 seconds (I'll detail this situation as requested by Jeff in a separate comment), this will linger the old objects. This means that if the semaphore lingers, our new semaphore will share its state with the old one! This is a big problem since this could make the worker thread on one object process an event that was posted on the other. It would access the wrong array and not find any task there, and crash. This situation should also be extremely rare, and indeed if it does occur we're already stuck with an access-after-free without my other patch.

However it is very bad, as such I'm posting this here as we certainly want to fix this, quickly, as well.
Comment 33 Bas Schouten (:bas) 2011-03-08 19:21:14 PST
(In reply to comment #16)
> Bas, can you give a step-by-step walk through about your what happens with the
> current code, provided your theory about the problem is true?

Ehsan gives a pretty good description in comment 21.

Essentially what happens is this:

1) The destructor is called, this tells the thread to shutdown when it unblocks
2) The destructor waits for 5 seconds, and in order not to freeze the browser if the video driver is acting up, it continues execution, it finalizes and deallocates the object out under the task thread
3) The taskthread unblocks, and accesses the freed member variables
Comment 34 Bas Schouten (:bas) 2011-03-08 19:24:49 PST
Comment on attachment 517941 [details] [review]
Abort in case of unsafe program state (for Gecko 2.0)

As a measurement device this is a fine solution. I'm r+'ing this so we can go forward with it if this approach is the consensus.
Comment 35 Ehsan Akhgari [:ehsan] 2011-03-08 21:37:29 PST
Created attachment 517979 [details] [review]
Abort in case of unsafe program state (for Gecko 2.0) [check-in: comment 51]

This version changes the patch header so that if we decide to land it, it's hg import'able.
Comment 36 Mike Beltzner [:beltzner] 2011-03-09 05:00:00 PST
Comment on attachment 517979 [details] [review]
Abort in case of unsafe program state (for Gecko 2.0) [check-in: comment 51]

r+a=beltzner for landing on mozilla-central to get nightly feedback
Comment 37 Mike Beltzner [:beltzner] 2011-03-09 05:00:30 PST
(the r+ is carried over from Bas, whoops)
Comment 38 Ted Mielczarek [:ted, :luser] (on vacation until 07/18/2011) 2011-03-09 05:41:46 PST
Okay, so, I found STR thanks to Bas:
1) Run Firefox, load Hulu
2) Run another Firefox -no-remote, load Hulu
3) The first browser will crash
Comment 39 Ted Mielczarek [:ted, :luser] (on vacation until 07/18/2011) 2011-03-09 05:42:44 PST
I think this means that it's only reproducible when running two instances of Firefox, so this shouldn't be an issue that most people run into.
Comment 40 Jeff Muizelaar [:jrmuizel] 2011-03-09 05:48:23 PST
(In reply to comment #39)
> I think this means that it's only reproducible when running two instances of
> Firefox, so this shouldn't be an issue that most people run into.

And also suggests that this has to do with the named synchronization primitives.
Comment 41 Bas Schouten (:bas) 2011-03-09 05:48:39 PST
So, the second hunk of Ehsan's patch will catch what the actual bug is here. I've figured it out after thinking about it more.

The evil here is the named objects (indirectly: Nice catch Ehsan!), if you have two browser sessions open (i.e. two instances of firefox.exe) open in a single user login, the second will open the semaphore of the first (after all the name already exists within that Desktop Session). Then when one process gets a task processed, either one of the two can get released, if the wrong one is released mPendingReadbackTasks will be empty, and everything will fall apart. The fix for this is trivial, safe, and already part 2 on this bug.

This particular bug will only ever occur for users running -two or more- instances of firefox.exe (i.e. starting one with -no-remote), and having used windowless plugins in both of them.
Comment 42 Benjamin Smedberg [:bsmedberg] 2011-03-09 05:58:47 PST
Not a blocker.
Comment 43 Ehsan Akhgari [:ehsan] 2011-03-09 06:18:25 PST
(In reply to comment #41)
> So, the second hunk of Ehsan's patch will catch what the actual bug is here.
> I've figured it out after thinking about it more.
> 
> The evil here is the named objects (indirectly: Nice catch Ehsan!), if you have
> two browser sessions open (i.e. two instances of firefox.exe) open in a single
> user login, the second will open the semaphore of the first (after all the name
> already exists within that Desktop Session). Then when one process gets a task
> processed, either one of the two can get released, if the wrong one is released
> mPendingReadbackTasks will be empty, and everything will fall apart. The fix
> for this is trivial, safe, and already part 2 on this bug.
> 
> This particular bug will only ever occur for users running -two or more-
> instances of firefox.exe (i.e. starting one with -no-remote), and having used
> windowless plugins in both of them.

Is there any other scenario under which we allocate a second instance of ReadbackManagerD3D10 without the first instance's dtor being called?  For example, what happens if the graphics driver crashes/gets reset?

And BTW, does this mean that my abort patch is no longer needed?
Comment 44 Ehsan Akhgari [:ehsan] 2011-03-09 06:24:10 PST
And I still think that a scenario such as comment 21 needs to be addressed too.  In fact, I don't think why we're waiting 5 seconds for the thread to finish.  Why can't that be INFINITE?  I think whatever we do here, we need to address this case too, and my patch may help to see if comment 21 happens in practice, even with attachment 517958 [details] [review].
Comment 45 Andreas Gal :gal 2011-03-09 06:43:37 PST
I am not particularly confident in the patch, based on the discussion in the bug. We don't seem to understand at all still how this really triggers.
Comment 46 Andreas Gal :gal 2011-03-09 06:44:55 PST
Comment on attachment 517958 [details] [review]
Part 2: Stop using named objects [checkin: comment 58]

This seems to be the patch we really want to take instead.
Comment 47 Andreas Gal :gal 2011-03-09 06:45:56 PST
Also if comment 39 is correct, then this is most definitely not a blocker.
Comment 48 Ehsan Akhgari [:ehsan] 2011-03-09 06:48:36 PST
(In reply to comment #45)
> I am not particularly confident in the patch, based on the discussion in the
> bug. We don't seem to understand at all still how this really triggers.

Which patch do you mean?
Comment 49 Andreas Gal :gal 2011-03-09 06:53:15 PST
The one that is ready to land and got approval. Thats not going to fix anything. The unreviewed ones are the ones we need, right?
Comment 50 Ehsan Akhgari [:ehsan] 2011-03-09 07:47:27 PST
(In reply to comment #49)
> The one that is ready to land and got approval. Thats not going to fix
> anything. The unreviewed ones are the ones we need, right?

The purpose of that patch is not to fix anything.  Its purpose is to make the crashes visible in our crash data.  Please see the release-drivers thread which discusses why we need this patch at length.

The "part 2" patch fixes at least one manifestation of this problem, but I'm not convinced that it's all that we need to do in this bug yet.
Comment 51 Ehsan Akhgari [:ehsan] 2011-03-09 07:53:02 PST
Attachment 517979 [details] landed on mozilla-central as <http://hg.mozilla.org/mozilla-central/rev/cae1cd4aae73>.

(I initially landed it on the relbranch by mistake: <http://hg.mozilla.org/mozilla-central/rev/30a98bd29e53> and immediately backed it out: <http://hg.mozilla.org/mozilla-central/rev/864d3507de6e>)
Comment 52 Jeff Muizelaar [:jrmuizel] 2011-03-09 08:47:44 PST
Comment on attachment 517958 [details] [review]
Part 2: Stop using named objects [checkin: comment 58]

This is certainly an improvement, but why do we use win32 primitives instead of nspr here?
Comment 53 Bas Schouten (:bas) 2011-03-09 08:56:13 PST
(In reply to comment #52)
> Comment on attachment 517958 [details] [review]
> Part 2: Stop using named objects
> 
> This is certainly an improvement, but why do we use win32 primitives instead of
> nspr here?

This is platform specific code, roc told me it was fine to use platform primitives directly in these cases.

NSPR only offers us condition variables for use here, these (particularly our implementation), have a not-insignificant overhead. Implementing this using condition variables is also slightly less elegant. Semaphores are better suited for the Producer/Consumer problem and are, at least at this point, not available in an efficient mapping to Win32 APIs.

The main reason was to make the code as thin and light-weight as possible.
Comment 54 Daniel Veditz 2011-03-10 14:02:18 PST
The memory corruption memmove/memcpy can cause is usually exploitable, but getting your victim to run two copies of Firefox at the same time is a pretty big hurdle. Guessing [sg:low] severity in practice.

Probably don't need to keep this hidden.
Comment 55 [:Cww] 2011-03-11 12:18:44 PST
*** Bug 637331 has been marked as a duplicate of this bug. ***
Comment 56 Joe Drew (:JOEDREW!) 2011-03-11 14:57:56 PST
Stuart has given us the go-ahead to land attachment 517958 [details] [review] on mozilla-central. Bas, can you do so? (Or anyone else - adding checkin-needed.)
Comment 57 Ehsan Akhgari [:ehsan] 2011-03-11 18:23:26 PST
If there was ever another product using Gecko 2.0, users who ran those programs at the same time as Firefox could potentially run into the same issue as well!
Comment 58 Ehsan Akhgari [:ehsan] 2011-03-11 18:29:59 PST
http://hg.mozilla.org/mozilla-central/rev/4173c1b88642

After landing I realized that I had committed this under my name.  Going through bash log shows why:

hg qref -U 'Bas Schouten <bas.schouten@live.nl>'

(Note the uppercase U).

Sorry about that.  And in the future, please add a From header to the patches that you attach so that this won't happen again.  Thanks!
Comment 59 Joel 2011-03-15 17:28:22 PDT
Hi I Just had my first bug marked a dup of this one (sorry)

I can reproduce it is every time in 5 min or less if that helps.

Here's my dup https://bugzilla.mozilla.org/show_bug.cgi?id=641293

Let me know if there is anything a non-coding nOOb can do...
Comment 60 Ehsan Akhgari [:ehsan] 2011-03-23 19:50:57 PDT
This bug in its original form has been fixed.  Further patches should move to their own bugs.
Comment 61 Daniel Veditz 2011-03-31 12:04:37 PDT
Ehsan: what are the risks in taking the two approved patches on the mozilla2.0 repo?
Comment 62 Benjamin Smedberg [:bsmedberg] 2011-03-31 12:29:33 PDT
I believe it is absolutely essential to take these on the branch if there are going to be any other apps such as tbird running at the same time. Otherwise, this is going to end up as a guaranteed crash.
Comment 63 Ehsan Akhgari [:ehsan] 2011-03-31 17:25:03 PDT
(In reply to comment #61)
> Ehsan: what are the risks in taking the two approved patches on the mozilla2.0
> repo?

The *only* patch needed to fix this bug is attachment 517958 [details] [review].  It is as safe as a patch could get.  We basically create unnamed event objects, which is how event objects should be created by default, so I think this patch is a very safe change to take on a stable branch, and as Benjamin points out in comment 62, we should definitely take it.
Comment 64 Ehsan Akhgari [:ehsan] 2011-03-31 17:25:55 PDT
Comment on attachment 517958 [details] [review]
Part 2: Stop using named objects [checkin: comment 58]

Asking for approval to take this patch on mozilla-2.0.
Comment 65 Daniel Veditz 2011-04-01 11:47:27 PDT
Comment on attachment 517958 [details] [review]
Part 2: Stop using named objects [checkin: comment 58]

Approved for the mozilla2.0 repository, a=dveditz for release-drivers
Comment 66 Ehsan Akhgari [:ehsan] 2011-04-01 14:32:05 PDT
http://hg.mozilla.org/releases/mozilla-2.0/rev/d4b16b847d92
Comment 67 Joel 2011-04-01 16:05:11 PDT
Not fixed for me. went away for a week or so. now it's back. Forcing me back to Chrome, opera and safari.

Could someone please change the status or un-dup my bug.
Comment 68 Ehsan Akhgari [:ehsan] 2011-04-02 09:30:40 PDT
(In reply to comment #67)
> Not fixed for me. went away for a week or so. now it's back. Forcing me back to
> Chrome, opera and safari.

Ca you please go to about:crashes, and paste the contents of that page here?
Comment 69 Ryan Pertusio 2011-04-02 22:24:04 PDT
Verifying that crash no longer occurs for the following sites: Hulu, Pogo, Blip (which were 3 easy sites to test the crash).

CRASHES  with 4.0pre (2011-04-01) (20110401030210)or earlier
NO CRASH with 4.0pre (2011-04-02) (20110402030209)

Good job!
Comment 70 Joel 2011-04-02 23:22:14 PDT
No new reports in About:crashes. Ie, it's not leaving a crash report. The only crashes there have the wrong time/date.... It's in the but I lodged that was marked as a dupe (I'm not clever enough to confirm the dupe, Everything on this page goes straight over my head.)

It was fixed in RC2 yet since final release (and I don't think I updated?) it's crashing again.

When it crashes, its really fast, the window just closes (sometime as I type) with no warning and no crash report.

I'm mostly on Facebook and Mafia Wars (with both windows/profiles) when it happens. But it can happen with the Firefox start page as the only tab in two profiles. I don't think it's site related....

about:crashes copy/paste:
Report ID
        Date Submitted
      
    
    
    bp-84f2e4a9-0e5b-47f9-8faf-9b23021104012/04/20118:41 AM10340c57-8d37-4baf-bb76-ea5d44fac63331/03/20118:16 AM96ca20bc-26d8-4091-9fc5-464d2e65cd555/03/20115:19 PM22e9683e-15df-4426-9f5d-e2cfac2a13a327/01/20114:42 PMc5156629-7bbf-48f8-992b-3b69721ac0392/01/201112:02 PMa2e23d5f-7566-4be2-8381-9ae5538932192/01/20119:28 AMbp-b4cc27c8-d36e-4875-b649-7a0ae210122930/12/201011:22 AMbp-1cfb0957-1991-4e2c-92e1-be444210122727/12/201010:03 PMbp-ad475ca0-4a9d-4237-9500-794d5210122727/12/20109:58 PM9a812b12-70d5-458a-be1c-04b3032884c626/12/20101:57 PM77126794-b2ec-4d92-b538-2724394144ac26/12/201012:19 PMbp-2e0fe8e4-f330-43c7-b01c-fb422210122324/12/201012:34 PMbp-180b52eb-c175-48aa-b4de-c8f6c210122323/12/20108:00 PM

as you can see these crashes go back months, yet I have encountered multiple profile crashes at least daily for the last few weeks (except for a few days between RC2 and final FF4, when it worked fine).... The last one was on 31/3/11 just after 9pm, it was the third in under an hour (I went back to chrome for second profile after that). There should be strings of crash reports 5-10 min apart......

Hope this helps.
Comment 71 Benjamin Smedberg [:bsmedberg] 2011-04-08 13:03:10 PDT
*** Bug 641293 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.