Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail Call bug in RyuJIT - Incorrect parameters passed #4391

Closed
NickCraver opened this issue Jul 27, 2015 · 63 comments
Closed

Tail Call bug in RyuJIT - Incorrect parameters passed #4391

NickCraver opened this issue Jul 27, 2015 · 63 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization

Comments

@NickCraver
Copy link
Member

The following critical bug is known and fixed internally (not released) by Microsoft (per private discussions while security was evaluated). In this bug "we" refers to Marc Gravell (@mgravell) and I who tracked down the issue on the Stack Overflow side on the fence. I am posting it here for public visibility and to help users hitting the same issue. I have also created a blog post addressing discovery and severity here: Why you should wait on upgrading to .Net 4.6.

There is an issue in the .Net 4.6 RTM RyuJIT implementation that incorrectly hooks up parameters during some tail call optimizations. The net result of this is that tail call methods can get the wrong parameters passed. Here's our code exhibiting the issue:

void Set<T>(string key, T val, int? durationSecs, bool sliding, bool broadcastRemoveFromCache = false)
{
    SetWithPriority<T>(key, val, durationSecs, sliding, CacheItemPriority.Default);
}

void SetWithPriority<T>(string key, T val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    key = KeyInContext(key);

    RawSet(key, val, durationSecs, isSliding, priority);
}

void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    var absolute = !isSliding && durationSecs.HasValue 
                   ? DateTime.UtcNow.AddSeconds(durationSecs.Value) 
                   : Cache.NoAbsoluteExpiration;
    var sliding = isSliding && durationSecs.HasValue 
                  ? TimeSpan.FromSeconds(durationSecs.Value) 
                  : Cache.NoSlidingExpiration;

    HttpRuntime.Cache.Insert(cacheKey, val, null, absolute, sliding, priority, null);
}

The RawSet() is the last in the chain, and only when optimizations are enabled, is incorrectly optimized. If for example we call Set<T> with a durationSecs of 3600, we would expect that value to go all the way down the method chain. This doesn't happen. Instead, the value passed to the tail method (RawSet) is seemingly random (our assumption is it pulls some other value from the stack). For example, it may be 30, or 57, or null.

The net result for us is that items are getting cached for drastically shorter durations, or not at all (null often gets passed). On Stack Overflow this causes unpredictable local HTTP cache usage and more hits and heavier load to anything populating it. This is a production blocker on deployment for us, and (we believe) should be for anyone else.

We have collapsed the reproduction of this bug into a project you can run locally, available on GitHub. Here's an example test run (see the repo README for a full description):

Results Window

I can't stress how serious of a bug this is, due to the subtleness of the occurrence, the "only production" likelihood given the RELEASE-only nature, and the scary examples we can easily come up with. When the parameters you're passing aren't the ones the method is getting, all sanity goes out the window.

What if your method says how much stock to buy? What if it gives dosing information to a patient? What if it tells a plane what altitude to climb to?

This bug is critical, I am posting the issue for several reasons:

  • We're not jerks
  • To raise awareness
  • To prevent as many people as possible from deploying affected code
  • To usher a fix and deployment from Microsoft as soon as possible

If you have already deployed .Net 4.6, our recommendation at this point is to disable RyuJIT immediately.

@vcsjones
Copy link
Member

I understand from various sources that given the normal patch workflow, the next hotfix window is September. For the appropriate people at Microsoft, please consider an out-of-band emergency fix for this. This issue is a show-stopper for us in moving to .NET 4.6.

@richlander
Copy link
Member

Thanks @NickCraver for raising visibility.

I'll be talking with the team later today.

@vcsjones - For clarity, is this a show-stopper because you are concerned or because you have a repro w/your workload?

/cc @CarolEidt, @cmckinsey

@YuvalItzchakov
Copy link

Has anyone been able to understand the root cause of this? Narrowing the problem to certain use-cases can help "calm" people down.

@NickCraver
Copy link
Member Author

@richlander not to be difficult, but given the nature of the bug...how would you (as a developer) know? This is such a subtle production-only bug for most cases I think most wouldn't find it, even in testing.

@YuvalItzchakov the issue here is it can affect any library as well, not just your code. Even if you had specific patterns, I'm not sure that would help any here. In our case it was a RELEASE library distributed by our internal NuGet server...so really like any other NuGet package.

@vcsjones
Copy link
Member

@richlander @NickCraver pointed out that this was originally discovered while investigating an issue in MiniProfiler, which we actively use. We are re-opening a few incidents we had on an application to determine if this is the cause, or if it is related. At the time we did not have a mini-dump available to investigate, but that is the next thing we are working on.

@mgravell
Copy link
Member

@vcsjones the "mini profiler" thing here was just what was visible; the actual glitch manifested in a different library that was providing a custom storage provider for mini profilier. So: the issue didn't actually happen in mini-profiler, but at the same time: I can't say that mini-profiler isn't affected. Just: don't focus on the mention of mini-profiler.

@vcsjones
Copy link
Member

@mgravell Agree. My point being, this was seen in a real library. It isn't just some mocked together code that uses tricks to make the issue manifest. It doesn't take special work to hit it. Thus it is very likely this affects other real-world applications and it has severe consequences.

@mgravell
Copy link
Member

Yes; absolutely; that's what scares us. Especially if Windows 10 deploys .NET 4.6... Wednesday could see random eager people installing, and potentially have their desktop business apps start behaving different in subtle, hard to spot, yet dangerous (whether to business well-being, or to personal safety) ways... Plus of course the VS2015 release prompting server-side installs.

@masonwheeler
Copy link
Contributor

How likely is this to affect current work in progress to get RyuJit working for the 32- and 64-bit versions of ARM? Is this bug AMD64 specific or platform agnostic?

@richlander
Copy link
Member

The team is taking this very seriously. We're going to talk about it later today as folks get into the office.

Definitely appreciate all the attention on the issue.

@ChrisMcKee
Copy link

@mgravell would the release have any effect if the apps are targeting < 4.6 though? Probably too late to slipstream RyuJIT as disabled considering the mode of distribution. 👍 Good find

@vcsjones
Copy link
Member

@ChrisMcKee Yes this affects all applications that target 4.5.x (possibly earlier?) through 4.6. Just having RyuJIT installed is enough to introduce the problem. http://nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46/#comment-2159337569

@JamesThorpe
Copy link

@vcsjones My understanding is that the same JIT compiler would be used in any 4.x environment, possibly even things targetting .NET 2, but running under the 4.x CLR?

@masonwheeler
Copy link
Contributor

Wait, this will glitch out on 4.5 as well?

@NickCraver
Copy link
Member Author

@masonwheeler Correct, if .Net 4.6 is installed the JIT appears to be used for both. Our repro case targets .Net 4.5 and .Net 4.5.2 in the CorruptionRepro and StackRedis projects respectively.

@ChrisMcKee
Copy link

@NickCraver even bigger 'good catch'. I was sat whistling, ah well they'll fix 4.6. Nobody expects the regression fail.

@NotMyself
Copy link

Thanks for the heads up @NickCraver, was going to push to get 4.6 on our build server this week. Going to hold back till a patch is issued now.

@latkin
Copy link

latkin commented Jul 27, 2015

Is this the same tailcall bug as raised here? dotnet/fsharp#536

@AndreyAkinshin
Copy link
Member

I have found some troubles with asm code.
Part 1.
screen1
The call of OnLogDuration from SetWithPriority:

mov         r8,5BFFFCC3E8h
mov         r8,qword ptr [r8]        ; "LocalCache.SetWithPriority"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+70h]  ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that durationSecs stores at the address [rsp+70h]. So far, everything looks fine.

Part 2.
screen3
The call of OnLogDuration from RawSet:

mov         r8,5BFFFCC3F0h  
mov         r8,qword ptr [r8]        ; "RawSet"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+0C8h] ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that RyuJIT tries to read durationSecs from [rsp+0C8h]. But:

RSP                  = 0x0000005787AAD6C0
durationSecs address = 0x0000005787AAD790
durationSecs offset  = 0x00000000000000D0

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

@mgravell
Copy link
Member

@latkin I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable. Contrast: silently substituting parameter values.

@mgravell
Copy link
Member

@AndreyAkinshin nice bit of demonstration, thanks

@mmitche
Copy link
Member

mmitche commented Jul 27, 2015

@latkin I'll try and confirm whether it is the same or not.

@viboux
Copy link

viboux commented Jul 27, 2015

Which parameter is causing the trouble? The "object val" or the "int? durationSecs"?

The int? needs some boxing/unboxing above the real 32 bit memory integer. What is the size of the boxing object (with the hasValue)?

@AndreyAkinshin
Copy link
Member

@vbouret, the parameter is int? durationSecs. It stores on the stack in unboxed form, the size is 8 byte.

@mmitche
Copy link
Member

mmitche commented Jul 27, 2015

@masonwheeler You should only see this on the clr running in 64-bit mode.

@masonwheeler
Copy link
Contributor

@mmitche That's good to know. I know that if you want to load a 32-bit native DLL into the process, it will force everything into 32-bit mode, but what if you're calling an out-of-process COM server that's implemented in a 32-bit DLL? (Yes, this question is based on a very specific real-world use case.)

@mmitche
Copy link
Member

mmitche commented Jul 27, 2015

@masonwheeler I don't believe that would force it into 32-bit mode. However, IIRC, targeting your app to x86 instead of AnyCPU should cause it to run on the 32-bit CLR.

@masonwheeler
Copy link
Contributor

@mgravell

I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways that data corruption could lead to an exception being raised. For example, if one of the affected arguments was a delegate, you could easily see something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks to me like it could be.

@mgravell
Copy link
Member

@masonwheeler fair enough; that's a good point

On Mon, 27 Jul 2015 18:34 masonwheeler notifications@github.com wrote:

@mgravell https://github.com/mgravell

I don't think so; my reasoning: the bug you link to threw a fatal
execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways
that data corruption could lead to an exception being raised. For example,
if one of the affected arguments was a delegate, you could easily see
something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks
to me like it could be.


Reply to this email directly or view it on GitHub
https://github.com/dotnet/coreclr/issues/1296#issuecomment-125281869.

@masonwheeler
Copy link
Contributor

It was already plenty small. Are we playing Bug Golf? :P

@AndreyAkinshin
Copy link
Member

@masonwheeler, in such issues, it is very important to produce a minimal reproducible example, because every excess instruction complicates the analysis of the bug.

@mgravell
Copy link
Member

Nice; we did try a console repro, but it didn't show initially, so we stuck
with IIS to make sure we had something concrete to show. The fact that it
can be repro'd so minimally is fun.

On 28 July 2015 at 13:11, Andrey Akinshin notifications@github.com wrote:

@masonwheeler https://github.com/masonwheeler, in such issues, it is
very important to produce a minimal reproducible example, because every
excess instruction complicates the analysis of the bug.


Reply to this email directly or view it on GitHub
https://github.com/dotnet/coreclr/issues/1296#issuecomment-125578663.

Regards,

Marc

@trayburn
Copy link

For those who have been deep into the internals for this bug, can anyone confirm that the random memory address that is being read to get the "incorrect" value is 100% Absolutely Without A Doubt coming from within the memory space of the .NET Process? If not, as horrible as this is already, this becomes much worse as an angle of attack to read and/or write random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

@ayende
Copy link
Contributor

ayende commented Jul 28, 2015

There is no way for the process to be just randomly reading some other
process memory.
You need ReadMemory() calls to do that, and those don't just happen, even
with this issue.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Tue, Jul 28, 2015 at 4:00 PM, Tim Rayburn notifications@github.com
wrote:

For those who have been deep into the internals for this bug, can anyone
confirm that the random memory address that is being read to get the
"incorrect" value is 100% Absolutely Without A Doubt coming from within
the memory space of the .NET Process? If not, as horrible as this is
already, this becomes much worse as an angle of attack to read and/or write
random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could
introduce instability into systems running 4.6, and break the process
isolation of most shared hosting environments.


Reply to this email directly or view it on GitHub
https://github.com/dotnet/coreclr/issues/1296#issuecomment-125597778.

@JamesThorpe
Copy link

@trayburn Nick addressed this point on his blog:

The exact line we got from Microsoft (when discussing security) was: "The only data that could be erroneously passed would be locals from the current frame". We can't confirm this without seeing the fix applied (not in the public repo yet), but we have no reason to believe otherwise. That being said, we couldn't relate which locals we were seeing as values here (they may not be your locals, for example).

@Sebazzz
Copy link

Sebazzz commented Jul 28, 2015

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

It will not break process isolation as Ayende said, because reading from another process like that is not allowed in Windows. However, applications in the same application pool can affect each other since an app domain in an application pool cannot provide isolation for this, I think.

@trayburn
Copy link

@JamesThorpe I agree I've seen the comment where he passed that along, but given the delving done by @AndreyAkinshin above, his last line:

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

We see durationSecs being read, via offset, and being undershot from the correct value (0C8h < 0D0h) but if that could ever be an overshoot instead things get much worse, hence I think it worth asking.

Also given the utterly minimal repo we now have from @angelsl and @AndreyAkinshin I'm challenged to think where the value 32763 could possibly be a local, unless we're reading the middle of a pointer, which if we can assign means we could break memory boundaries by simply returning to the original object/pointer.

@YuvalItzchakov
Copy link

@jamesmanning You're talking about tail-recursion, which is a special case of tail-call.

Wikipedia:

Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is not needed any more, and it can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls).

@masonwheeler
Copy link
Contributor

@jamesmanning: Tail recursion is a specific case of tail-call optimization where the tail call is to the same function. It's not the only use for tail-call optimization, though.

@jamesmanning
Copy link

@YuvalItzchakov thanks - I had never done TCO outside of the context of recursive calls, but now it makes sense. 😄

Sorry for the noise!

@AndreyAkinshin
Copy link
Member

@mgravell, @angelsl, @trayburn, I have new interesting results: the bug allows us to get the return address. Let's discuss it in detail. My new code:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, int.MaxValue, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Hack hack = new Hack();
        hack.NullableInt = durationSecs;
        Console.WriteLine("{0:X}", hack.Long);
    }

    [StructLayout(LayoutKind.Explicit)]
    struct Hack
    {
        [FieldOffset(0)]
        public int? NullableInt;
        [FieldOffset(0)]
        public long Long;
    }
}

Expected value is 7FFFFFFF00000001, but with RyuJIT I have 7FFBAC784303 (you can have another value). We need to go deeper.

Part 1

screen1
Position: before the call of SetWithPriority. The address of the next line after the call line is 7FFBAC784303 (return address).

Part 2

screen2
Position: the last line of SetWithPriority, jmp instruction. On the picture you can see the correct value of durationSecs: 7FFFFFFF00000001. Ok, let's press F11.

Part 3

screen3
Position: the first line of RawSet. As we can see, the jmp instruction puts the return address (7FFBAC784303) before the value of durationSecs. As I wrote earlier, the durationSecs offset error equals to 8 bytes. Thus, RyuJIT uses the return address instead of durationSecs. My Hack structure helps me to get the value of the return address and print it.

@mmitche
Copy link
Member

mmitche commented Jul 28, 2015

@brandonagr
Copy link

@mmitche Am I correct in thinking that the fix for dotnet/coreclr#1299 is a different issue than the fix here? Just adding in the TailCallOpt registry key does not seem to fix 1299 for me locally, but it did fix the regression example you committed earlier for this issue when I ran both locally in the same project.

@Eyas
Copy link

Eyas commented Jul 29, 2015

Yup @schellap indicates its a bug in copy propagation-- a completely different compiler optimization. I don't think there's a public fix for that yet.

@forki
Copy link

forki commented Jul 29, 2015

Very cool. Can we expect similar blog posts for dotnet/coreclr#1299 and dotnet/fsharp#536?

@RichardD2
Copy link

Disabling RyuJIT caused our 64-bit ASP.NET 4.5 AppPool to die!

An unhandled exception occurred and the process was terminated.

Exception: System.Threading.ThreadAbortException
at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)

Followed by:

Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: KERNELBASE.dll, version: 6.1.7601.18847, time stamp: 0x554d7df1
Exception code: 0xe0434352
Fault offset: 0x000000000000b3dd
Faulting process id: 0xe58
Faulting application start time: 0x01d0ca0eb6823e3e
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll

And:

Problem signature:
P1: w3wp.exe
P2: 7.5.7601.17514
P3: 4ce7afa2
P4: KERNELBASE.dll
P5: 6.1.7601.18847
P6: 554d7df1
P7: e0434352
P8: 000000000000b3dd
P9: 
P10: 

And finally:

An unhandled win32 exception occurred in w3wp.exe [3672]. Just-In-Time debugging this exception failed with the following error: Debugger could not be started because no user is logged on.

Check the documentation index for 'Just-in-time debugging, errors' for more information.

After that error was logged five times, the AppPool was disabled, and the site returned a 503 error.

After removing the "EnableLegacyJIT" value and restarting the AppPool, the site seems to be back to normal.

This was on Windows Server 2008 R2 SP1 with all updates installed.

@mmitche
Copy link
Member

mmitche commented Jul 29, 2015

@RichardD2 This is interesting. Can you file a bug on Connect: http://connect.microsoft.com/VisualStudio

@mmitche
Copy link
Member

mmitche commented Jul 29, 2015

@RichardD2 Thanks!

@Rutix
Copy link

Rutix commented Jul 29, 2015

@mmitche Now that this is merged when can we expect the hotfix/patch to be rolled out?

@mmitche
Copy link
Member

mmitche commented Jul 29, 2015

@Rutix. We are still working that out. Watch the blog post for details :) http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx

@clivetong
Copy link

@mmitche I put in a repro for what looks like this 64bit RyuJIT bug some time ago - https://connect.microsoft.com/VisualStudio/feedback/details/1372514
Is it possible to check that this is fixed by this change?

@forki
Copy link

forki commented Jul 31, 2015

@clivetong your link is 404 for me

@clivetong
Copy link

@forki Thanks for the notification. I originally submitted it as private, but then changed it to public several weeks ago. I can't find any other options about visibility and other issues I have submitted as public are visible to others, so I guess it's game over.
connect

@forki
Copy link

forki commented Jul 31, 2015

so you say this bug was known before release?!

@leecow
Copy link
Member

leecow commented Jul 31, 2015

@clivetong - The issue you reported through Connect is the same as what we're talking about in this thread.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Jan 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests