[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644

bhack · 2024-05-19T13:39:38Z

🚀 The feature, motivation and pitch

Background

Handling PyTorch compile issues and ensuring reproducibility on minimal isolated code is currently quite labor-intensive. This challenge impacts both:

Users and developers trying to isolate and reproduce errors.
Triagers or compiler team members working with third-party compiled code, especially for public OSS models.

The complexity increases significantly when compiling full models or high-level def functions in a chain. Often, a single error might be hidden within a chain of errors, complicating error reporting and resolution.

Proposal

Enhanced Error Isolation and Reporting:
- Isolate Failed Function:
  Implement a mechanism to exactly isolate the function where the compilation failed. This will allow users to report the specific function causing the issue without additional effort.
- Record Fake Inputs:
  Automatically record fake inputs to facilitate error reproduction without the need for users to fully reproduce their dataset setup. This ensures that developers and triagers can recreate the issue reliably with minimal setup.
Performance Canary Mode:
- Store Baseline Info:
  Introduce a mode where running an uncompiled model stores baseline performance data (e.g., memory usage, speed) on disk.
- Automatic Regression Detection:
  When running the compiled model, automatically compare current performance against the stored baseline. If there are regressions in memory usage or speed, users should be warned.
- Simplified Reporting:
  In case of performance regressions, provide an easy and straightforward way for users to report these issues.

Benefits

For Users/Developers:
- Simplifies the process of isolating and reporting compile errors.
- Enhances reproducibility by automatically recording necessary inputs.
For Triagers/Compiler Team:
- Provides clearer insights into the specific functions causing issues.
- Facilitates quicker diagnosis and resolution of performance regressions.

/cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

ezyang · 2024-05-21T13:31:12Z

Performance canary mode is a good idea, I often want information about baseline versus optimized comparison.

bhack · 2024-05-21T13:47:56Z

What about instead the first point?

ezyang · 2024-05-21T18:41:11Z

Want to serialize MetaTensorDesc from fakeification, logical place is in structured_trace. Also good idea, not too difficult. Failed function should work already, we have user stacks and just report it.

bhack · 2024-05-21T18:59:34Z

Failed function should work already, we have user stacks and just report it.

But often triages still require minimal repro and it is a lot of work especially on intermediate/leaf function.
So when you have decorated/compiled an high level def and something it is going to fail in the compiled defs chain we need to to find a quickly way to report the issue without sharing everything for the reproducibility.

Another additional point is to have a compile deactivation decorator so that in the mean time we are going to open a ticket we could still disable the compilation of the failing def without doing a binary search in the code on the full def chain.
So in this case we could still use partially compiled working code or open new tickets in the compile failure backtrace.

ezyang · 2024-05-29T23:27:27Z

Yeah, agreed. I definitely agree there is stuff holistically here we can do better.

This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs when they are triggered from Dynamo. The logs look like this: ``` V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} ``` The `describer_id` is used to disambiguate ids. We expect it to be unique per frame id, but if there is a bug it possibly is not. Note you will get redundant dumps when evaluation restarts. tlparse can use this to give a visualization of input tensors to a model, you could also use this to generate example inputs to run graphs on. Some care is taken to avoid redumping the tensor metadata multiple times, which would happen ordinarily because AOTAutograd refakifies everything after Dynamo, to deal with metadata mutation. Partially fixes pytorch#126644 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#126879 Approved by: https://github.com/jamesjwu

bhack · 2024-06-11T13:18:43Z

See also
#128134 (comment)

ezyang · 2024-06-11T22:45:16Z

@bhack here's a doc I've been working on that I'd love if you help preview. Any comments / suggestions for what else to add would be helpful: https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit

bhack · 2024-06-11T23:26:05Z

it is good starting point but my impression is that it is going too fast to target users that want to go deeper in compilers internals. I think that first goal is to lowering the number of waiting for reproducibility issues also for the less intermediate pytorch developer.
I also expec, and it was the main scope of this ticket, that other then the minifier we could really have something else to automate the isolation/report of a failure other then the full TORCH_TRACE that probably it could be a solution more oriented to released OSS models (or internal model failures reporting to internal phabricator tickets).

How do you think we could make this document more visible to the community to collect comments from more users/dev profiles?
Do you think we could RFC in the forum or what else?

ezyang · 2024-06-12T01:19:14Z

I think that first goal is to lowering the number of waiting for reproducibility issues also for the less intermediate pytorch developer.

For this, I think we need to actually do some coding, unfortunately. Even for experts like me it is not easy extracting repros from live production issues. The doc is really the best I know how to do right now.

I also expec, and it was the main scope of this ticket, that other then the minifier we could really have something else to automate the isolation/report of a failure other then the full TORCH_TRACE that probably it could be a solution more oriented to released OSS models (or internal model failures reporting to internal phabricator tickets).

Is there something wrong with TORCH_TRACE for reporting failures? I was hoping it would not be too burdensome for people to run with TORCH_TRACE and upload it with their bug report.

How do you think we could make this document more visible to the community to collect comments from more users/dev profiles?

I'm currently collecting comments internally, and then I'll be doing more social media in the wider community as it gets more baked.

bhack · 2024-06-12T01:48:34Z

For this, I think we need to actually do some coding, unfortunately. Even for experts like me it is not easy extracting repros from live production issues. The doc is really the best I know how to do right now.

Is it so hard to trace the related source of a decorated def failure and record its own input? Cause at least we could tell to the user to bisect the right point moving the torch.compile from parent to leaf or backward and upload serialized inputs.

Is there something wrong with TORCH_TRACE for reporting failures? I was hoping it would not be too burdensome for people to run with TORCH_TRACE and upload it with their bug report.

It could be ok but I think it will work mainly for OSS released models or small function trace without too many disclosure issues. Or internal model (META) cause the TRACE are shared in the internal tracking system. But what about research/unpublished models? I don't know how it is practical to share more e2d trace in this case.

ezyang · 2024-06-12T20:35:01Z

Is it so hard to trace the related source of a decorated def failure and record its own input? Cause at least we could tell to the user to bisect the right point moving the torch.compile from parent to leaf or backward and upload serialized inputs.

@bhack I actually added dumps for this at #126879 but I haven't gotten around to actually using it for something like repros.

It could be ok but I think it will work mainly for OSS released models or small function trace without too many disclosure issues. Or internal model (META) cause the TRACE are shared in the internal tracking system. But what about research/unpublished models? I don't know how it is practical to share more e2d trace in this case.

OK, that's fair. But I think we are quickly getting out of the zone of feasibility here. If you have a bug, that happens on a private model, and you cannot share detailed logs, and you are not expert enough to do some minimization / investigation on your own, then there's not really much you can do besides put up the error message and pray someone can look at it and figure it out as is.

In an ideal world, automatic repro production would work great for this sort of situation. But we actually have a little bit of experience with this in the minifier, and the problem is that as the bugs get harder and harder to reproduce, we need more and more fidelity out of the minifier, and this just becomes quite a lot of work to maintain in the terminal state. Sometimes, finding the needle in the haystack (what exactly you needed to produce the problem) is most of the way to solving the problem in the first place.)

bhack · 2024-06-12T21:04:26Z

Yes I meant something probably "light" like #126879 at least to support defs bisect.
A quick way for the triagers to replay the input on the failed def. So that the user in many case could copy paste the python def source code and upload the registered input to replay as an attachment.

drisspg added the oncall: pt2 label May 20, 2024

xmfan added the feature A request for a proper, new feature. label May 20, 2024

xmfan added the triage review label May 21, 2024

ezyang mentioned this issue May 22, 2024

Add structured logging for tensor fakeification #126879

Closed

pytorchmergebot closed this as completed in 0aaac68 May 31, 2024

ezyang reopened this May 31, 2024

bhack mentioned this issue Jun 1, 2024

SyntaxError: unterminated string literal (detected at line 1) (<unknown>, line 1) #127637

Open

zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644

[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644

bhack commented May 19, 2024 •

edited by pytorch-bot bot

ezyang commented May 21, 2024

bhack commented May 21, 2024

ezyang commented May 21, 2024

bhack commented May 21, 2024

ezyang commented May 29, 2024

bhack commented Jun 11, 2024

ezyang commented Jun 11, 2024

bhack commented Jun 11, 2024

ezyang commented Jun 12, 2024

bhack commented Jun 12, 2024

ezyang commented Jun 12, 2024 •

edited

bhack commented Jun 12, 2024

[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644

[torch.compile]: Enhanced Error Reporting and Performance Canary Mode #126644

Comments

bhack commented May 19, 2024 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

Background

Proposal

Benefits

Alternatives

Additional context

ezyang commented May 21, 2024

bhack commented May 21, 2024

ezyang commented May 21, 2024

bhack commented May 21, 2024

ezyang commented May 29, 2024

bhack commented Jun 11, 2024

ezyang commented Jun 11, 2024

bhack commented Jun 11, 2024

ezyang commented Jun 12, 2024

bhack commented Jun 12, 2024

ezyang commented Jun 12, 2024 • edited

bhack commented Jun 12, 2024

bhack commented May 19, 2024 •

edited by pytorch-bot bot

ezyang commented Jun 12, 2024 •

edited