[pipelining] add back support for multi-use parameters/buffers #126653

kwen2501 · 2024-05-19T19:46:48Z

Stack from ghstack (oldest at bottom):

Motivation

Resolves #126626 to support TorchTitan.

With this PR, we add back support for cases where a parameter or buffer is used in multiple stages. An example of such usage is in LLaMA (torchtitan), code snippet:

for layer in self.layers.values():
    h = layer(h, self.freqs_cis)

Solution

Step 1:
Remove the previous guards of if len(node.users) == 1.
Step 2:
Call move_param_to_callee multiple times, one for each stage ("callee").
Step 3:
Delay deletion of the get_attr node (for getting the param) from root till this param has been sunk into each stage that uses it.

The PR also cleans up the old code around this (dropping the TRANSMIT mode and supporting REPLICATE mode only).

Test

Changed the ExampleCode model to use mm_param1 in multiple stages.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Resolves #126626 [ghstack-poisoned]

pytorch-bot · 2024-05-19T19:46:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126653

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 89de9ec with merge base 5ea956a ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (#127288)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Resolves #126626 ghstack-source-id: 35a3783f260d57972079289291f4ce827584d037 Pull Request resolved: #126653

wconstab · 2024-05-20T17:00:56Z

Which titan issue was this addressing? something with freqs_cis?

kwen2501 · 2024-05-20T18:01:16Z

See #126626. I filed it against pytorch rather than titan.
But yeah, it is wrt this code block in titan:

for layer in self.layers.values():
    h = layer(h, self.freqs_cis)

freqs_cis will be used in multiple stages once we cut the model by group of layers.

wconstab · 2024-05-24T22:37:17Z

test/distributed/pipelining/test_pipe.py

        self.mm_param1 = torch.nn.Parameter(torch.randn(d_hid, d_hid))
        self.mm_param2 = torch.nn.Parameter(torch.randn(d_hid, d_hid))
        self.lin1 = torch.nn.Linear(d_hid, d_hid)
        self.lin2 = torch.nn.Linear(d_hid, d_hid)

    def forward(self, x, y):
-        x = torch.mm(x, self.mm_param0)
+        x = torch.mm(x, self.mm_param1)  # mutli-use param


typo. (again, typo below)

wconstab · 2024-05-24T22:38:46Z

torch/distributed/pipelining/_IR.py

+                logger.info(
+                    f"Parameter {node.target} used in multiple stages: {node.users}."  # noqa: G004
+                )
+            for user in node.users:
                assert user.op == "call_module"
                # Move parameter into submodule
                move_param_to_callee(


does this affect the fqn of the shared parameter?

No. This PR targets parameters (single FQN) used by multiple stages once the original model is split.

@pianpwk 's PR targets the tied parameter case (aliasing):
#127094

wconstab · 2024-05-24T22:53:31Z

test/distributed/pipelining/test_pipe.py

        skip_connection = x
        x = x + y
        x = torch.relu(x)
        pipe_split()
-        x = torch.mm(x, self.mm_param1)
+        x = torch.mm(x, self.mm_param1)  # mutli-use param


do we have tests that verify fqn sanity (perhaps you added them along with unflattener)?

it'd be nice to confirm that when using multi-use param, the model's state_dict is clean and only has the original copy so checkpoint save/load will work as expected.

tbh, we don't have support for multi-use param in training yet. Because that would require an all-reduce between the multiple copies of that param, before the next batch forward happens. So, it would be kind of early to talk about how to save them before we can train them :)

But, multi-use buffer (as in titan case) and multi-user param in inference are different stories, they can be supported today.

wconstab · 2024-05-24T23:18:24Z

I pulled this PR to see if it helps run torchtitan with tracer. It does get further, no longer error during tracing, so presumably the freqs_cis thing is worked out.

But there is still a tracer issue with applying TP/DP iterating the transformer layers.

wconstab · 2024-05-24T23:27:27Z

i checked the fqns and they look correct to me. So i think this PR is good to land based on fixing the immediate issue with freqs_cis. however will need to do more work to verify e2e

wconstab · 2024-05-24T23:29:27Z

torch/distributed/pipelining/_IR.py

            callee = root.get_submodule(callee_name)
            assert not hasattr(
                callee, param_fqn
            ), f"Module {callee_name} already has a parameter named {param_fqn}"
+
+            # Assign the parameter to the submodule
            if is_buffer:
                _assign_attr(


im kinda confused though, how come we can assign the attr to a submodule and not cause fqn duplication?

We are moving the attr to the submodule.
The original attr will be removed IIRC.

pianpwk

changes make sense, and stacked tests seem to work well

kwen2501 · 2024-05-28T21:45:10Z

I pulled this PR to see if it helps run torchtitan with tracer. It does get further, no longer error during tracing, so presumably the freqs_cis thing is worked out.

But there is still a tracer issue with applying TP/DP iterating the transformer layers.

Thanks for checking.
The error you see is basically saying:
"I want a ModuleDict after split to be still a ModuleDict, and I want .items() to still work on it."
But that is currently not in pippy's contract -- what's broken is broken.
User code needs change to support all cases, e.g. .items() --> .children().

kwen2501 · 2024-05-28T21:47:50Z

@pytorchbot merge

pytorchmergebot · 2024-05-28T21:49:55Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

…dules" This PR fixes the issue mentioned [here](pytorch/pytorch#126653 (comment)): "Module object has no attributed items." The reason is, a split `ModuleDict` is no longer a `ModuleDict`. (Future support is not guaranteed.) It would be more generally applicable if we use `named_children()` and `register_module()` to access and update submodules. [ghstack-poisoned]

This PR fixes the issue mentioned [here](pytorch/pytorch#126653 (comment)): "Module object has no attributed items." The reason is, a split `ModuleDict` is no longer a `ModuleDict`. (Future support is not guaranteed.) It would be more generally applicable if we use `named_children()` and `register_module()` to access and update submodules. [ghstack-poisoned]

kwen2501 · 2024-05-29T03:17:57Z

@pytorchbot merge

pytorchmergebot · 2024-05-29T03:20:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #362 * __->__ #371 This PR fixes the issue mentioned [here](pytorch/pytorch#126653 (comment)): "Module object has no attributed items." The reason is, a split `ModuleDict` is no longer a `ModuleDict`. It would be more generally applicable if we use `named_children()` and `register_module()` to access and update submodules.

…ch#126653) ## Motivation Resolves pytorch#126626 to support TorchTitan. With this PR, we add back support for cases where a parameter or buffer is used in multiple stages. An example of such usage is in LLaMA (torchtitan), code snippet: ``` for layer in self.layers.values(): h = layer(h, self.freqs_cis) ``` ## Solution Step 1: Remove the previous guards of `if len(node.users) == 1`. Step 2: Call `move_param_to_callee` multiple times, one for each stage ("callee"). Step 3: Delay deletion of the `get_attr` node (for getting the param) from root till this param has been sunk into each stage that uses it. The PR also cleans up the old code around this (dropping the TRANSMIT mode and supporting REPLICATE mode only). ## Test Changed the `ExampleCode` model to use `mm_param1` in multiple stages. Pull Request resolved: pytorch#126653 Approved by: https://github.com/pianpwk

…dules" This PR fixes the issue mentioned [here](pytorch/pytorch#126653 (comment)): "Module object has no attributed items." The reason is, a split `ModuleDict` is no longer a `ModuleDict`. It would be more generally applicable if we use `named_children()` and `register_module()` to access and update submodules. [ghstack-poisoned]

This PR fixes the issue mentioned [here](pytorch/pytorch#126653 (comment)): "Module object has no attributed items." The reason is, a split `ModuleDict` is no longer a `ModuleDict`. It would be more generally applicable if we use `named_children()` and `register_module()` to access and update submodules. [ghstack-poisoned]

[pipelining] add back support for multi-use parameters/buffers

89de9ec

Resolves #126626 [ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 19, 2024

kwen2501 added a commit that referenced this pull request May 19, 2024

[pipelining] add back support for multi-use parameters/buffers

6592492

Resolves #126626 ghstack-source-id: 35a3783f260d57972079289291f4ce827584d037 Pull Request resolved: #126653

kwen2501 requested review from wconstab and H-Huang May 20, 2024 16:56

kwen2501 mentioned this pull request May 20, 2024

[WIP][pipelining] Add param aliasing test #126702

Closed

This was referenced May 24, 2024

init #127091

Open

[pipelining] handle param aliasing #127094

Closed

wconstab reviewed May 24, 2024

View reviewed changes

pianpwk self-requested a review May 28, 2024 18:46

pianpwk approved these changes May 28, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 28, 2024

pytorchmergebot added the merging label May 28, 2024

pytorchmergebot removed the merging label May 28, 2024

kwen2501 mentioned this pull request May 29, 2024

Use general way to access and update submodules pytorch/torchtitan#371

Merged

kwen2501 added the release notes: distributed (pipeline) release notes category label May 29, 2024

pytorchmergebot added the merging label May 29, 2024

pytorchmergebot added the Merged label May 29, 2024

pytorchmergebot closed this in 8090145 May 29, 2024

pytorchmergebot removed the merging label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pipelining] add back support for multi-use parameters/buffers #126653

[pipelining] add back support for multi-use parameters/buffers #126653

kwen2501 commented May 19, 2024 •

edited

pytorch-bot bot commented May 19, 2024 •

edited

wconstab commented May 20, 2024

kwen2501 commented May 20, 2024

wconstab May 24, 2024

wconstab May 24, 2024

kwen2501 May 28, 2024

kwen2501 May 28, 2024

wconstab May 24, 2024

kwen2501 May 28, 2024 •

edited

kwen2501 May 28, 2024

wconstab commented May 24, 2024

wconstab commented May 24, 2024

wconstab May 24, 2024

kwen2501 May 28, 2024

pianpwk left a comment

kwen2501 commented May 28, 2024

kwen2501 commented May 28, 2024

pytorchmergebot commented May 28, 2024

kwen2501 commented May 29, 2024

pytorchmergebot commented May 29, 2024

[pipelining] add back support for multi-use parameters/buffers #126653

[pipelining] add back support for multi-use parameters/buffers #126653

Conversation

kwen2501 commented May 19, 2024 • edited

Motivation

Solution

Test

pytorch-bot bot commented May 19, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126653

✅ You can merge normally! (1 Unrelated Failure)

wconstab commented May 20, 2024

kwen2501 commented May 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 May 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wconstab commented May 24, 2024

wconstab commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pianpwk left a comment

Choose a reason for hiding this comment

kwen2501 commented May 28, 2024

kwen2501 commented May 28, 2024

pytorchmergebot commented May 28, 2024

Merge failed

kwen2501 commented May 29, 2024

pytorchmergebot commented May 29, 2024

Merge started

kwen2501 commented May 19, 2024 •

edited

pytorch-bot bot commented May 19, 2024 •

edited

kwen2501 May 28, 2024 •

edited