Add NHWC support for group normalization #126635

ZelboK · 2024-05-19T03:14:59Z

Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels.

cc: @mikaylagawarecki

pytorch-bot · 2024-05-19T03:15:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126635

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7ad6380 with merge base 71f4915 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ZelboK · 2024-05-19T03:18:15Z

aten/src/ATen/native/cuda/group_norm_kernel.cu

+    auto cur_b = b[cur_sample * C + cur_channel];
+    Y[index] = (static_cast<T_ACC>(X[index]) + cur_b) * cur_a;
+  }
+}


I'm actually not sure how to achieve this behavior using tensor iterators and gpu_kernel - so it may be the case that this is unneeded.

ZelboK · 2024-05-19T03:25:23Z

aten/src/ATen/native/cuda/group_norm_kernel.cu

+
+    for (int64_t c = 0; c < group_channels; c++) {
+      val = welford_op.reduce(val, static_cast<T_ACC>(X[index + c]), index + c);
+    }


This kernel uses a different indexing strategy that will work for rNHWC tensors and uses welfords algorithm. Aside from that, the logic is very similar.

ZelboK · 2024-05-19T03:29:16Z

aten/src/ATen/native/cuda/group_norm_kernel.cu

+      case MemoryFormat::ChannelsLast: {
+            ApplyScaleBiasNHWCKernel<T><<<N * G, num_threads, 0, cuda_stream>>>(X_data, Y_data, N, height, width, C, D*HxW, a_data, b_data);
+            C10_CUDA_KERNEL_LAUNCH_CHECK();
+            break;


I do not know how to do this with a tensor iterator. IF someone could show me how to that would be great.

ZelboK · 2024-05-19T03:30:00Z

I intend on adding tests for this in test_nn.py for example - I would just like an initial glance over the the general code before I proceed further, in case I missed something obvious.

drhead · 2024-05-19T07:09:29Z

Really looking forward to this! I've found that at least in some cases a naive implementation is actually faster than the existing native group norm kernel when using channels last format, and I'd love to see how much better a proper channels last kernel does.

ZelboK · 2024-05-20T01:02:38Z

Really looking forward to this! I've found that at least in some cases a naive implementation is actually faster than the existing native group norm kernel when using channels last format, and I'd love to see how much better a proper channels last kernel does.

Interested to know the context? What GPU, architecture are you on? If possible could you give me a minimal reproducible example of a naive implementation outperforming native so I could take a look?

drhead · 2024-05-20T01:48:01Z

Interested to know the context? What GPU, architecture are you on? If possible could you give me a minimal reproducible example of a naive implementation outperforming native so I could take a look?

This was on a 3090, using Stable Diffusion 1.5 in inference mode -- I'm not sure that it would be easy make a minimal reproducible example because I think it is at least partially dependent on having operations dispatched fairly far ahead of how fast they execute on the GPU. But to summarize, I first made sure that every aten::copy_ and every blocking operation was gone from the main inference loop. After that and switching the model to channels_last memory format, the only aten::copy_ ops remaining were right before the group norm kernel to switch to contiguous, and at the next conv layer following a group norm to switch back to channels last. After switching to a naive implementation which allowed me to avoid the copy operations, I noticed around a 5% overall increase in speed. I have not tested training with this setup.

ZelboK · 2024-06-12T00:20:59Z

cc @mikaylagawarecki

Running the tests that failed before in test_nn.py pass locally

eqy · 2024-06-12T00:21:52Z

aten/src/ATen/native/cuda/group_norm_kernel.cu

+      break;
+    }
+    default: {
+      break; // is this okay?


If it's unsupported we might want to explicitly raise an exception here

aten/src/ATen/native/group_norm.cpp

test/test_nn.py

eqy · 2024-06-12T00:25:36Z

aten/src/ATen/native/cuda/group_norm_kernel.cu

+        int num_blocks =
+            (N * height * width * C + kCUDANumThreads - 1) / kCUDANumThreads;
+
+        ApplyScaleBiasNHWCKernel<T>


Could the previous TensorIterator code just be adapted by adding a permute before the view here?

Do you mean to use permute to convert them to NCHW, and then convert them back to NHWC after it's done? Please correct me if i'm wrong, I apologize as i'm not that faimliar with getting these iterators to work.

Something like this?

auto X_permuted = X.permute({0, 3, 1, 2}); // N, C, H, W auto Y_permuted = Y.permute({0, 3, 1, 2}); // N, C, H, W TensorIterator iter = TensorIteratorConfig() .check_all_same_dtype(std::is_same<T, T_ACC>::value) .resize_outputs(false) .add_owned_output(Y_permuted.view({N * C, H * W})) .add_owned_const_input(X_permuted.view({N * C, H * W})) .add_owned_input(a.view({N * C, 1})) .add_owned_input(b.view({N * C, 1})) .build(); gpu_kernel(iter, [] GPU_LAMBDA(T x, T_ACC a, T_ACC b) -> T { return a * static_cast<T_ACC>(x) + b; }); Y = Y_permuted.permute({0, 2, 3, 1}); // N, H, W, C

Yes, something like that---I'm not sure if TensorIterator is smart about getting the right (fast) memory access pattern for the permuted Tensor but it's worth an attempt

@eqy So I get this with the above code

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Am I missing some steps here or is it just not straight forward to work with these iterators when the tensors are in NHWC?

(Somewhat offtopic) curious to know if these reused tensor iterator kernels have better compile times than hand-defined kernels? Or do they have clever dispatching and optimizations going on?

Reshape should be fine in this case as well (same semantics as what we intend).

I wouldn't say it's really about compile time in this case. TensorIterator kernels save a lot of boilerplate code (imagine writing the same kernel every time for every pointwise OP variant but they also do a good job in achieving high B/W utilization for common-case workloads. There's a fair amount of optimization work that was done by e.g., @zasdfgbnm and if you look at the underlying kernels here they dispatch to there's the same optimizations that you would find in manually optimized kernels (vectorization, loop unrolling, etc.).

Hm, I can't seem to get the above code to work with the reshape either - 100% mismatched elements in the tests lol. I've tried quite a few different approaches but I can't seem to get it right...

I can spend some time optimizing this kernel if need be with some vectorizing, but I'll have to defer to you on whether or not that level of hand-tuned optimizations are less maintainable/desired at this layer of PyTorch.

Do you think you could help me get the iterator part right? Sorry for the trouble 😅

eqy · 2024-06-12T00:26:24Z

test/test_nn.py

-            helper(self, (2, 9, 7, 200, 15), 3, torch.channels_last_3d, is_mixed)
-            helper(self, (2, 60, 7, 200, 15), 3, torch.channels_last_3d, is_mixed)
+
+            if device == 'cpu':


Maybe leave a comment about channels_last_3d not being supported on CUDA?

eqy

Please back out the kineto submodule change

ZelboK · 2024-06-12T14:32:37Z

Please back out the kineto submodule change

Is there an easy way to achieve this? I'd like to avoid messing up and doing a bad rebase somehow leading to 100+ people getting notified

Edit: I seem to have figured it out but i'd like to avoid running into this problem in the future. Could you outline me your workflow for git and what commands you run etc? Do you do an interactive add or do you run commands for hte submodules regularly?

Add NHWC suppor for grup normalization. New kernels needed

fcc27ae

ZelboK requested a review from eqy as a code owner May 19, 2024 03:14

pytorchbot added the open source label May 19, 2024

revert header removal unintentional

5ba655d

ZelboK commented May 19, 2024

View reviewed changes

add breaks for siwtch case

c9565ce

ZelboK commented May 19, 2024

View reviewed changes

ZelboK changed the title ~~Add NHWC suppor for grup normalization. New kernels needed~~ Add NHWC support for group normalization May 19, 2024

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 20, 2024

ZelboK added 2 commits June 2, 2024 20:43

Merge branch 'main' into group_norm_nhwc

96142c7

Fix rowwiseNHWC logic and apply scaling properly.

707702d

pytorch-bot bot added the release notes: nn release notes category label Jun 12, 2024

ZelboK added 2 commits June 11, 2024 20:18

format

6fbd026

Merge branch 'main' into group_norm_nhwc

a27c1c1

eqy reviewed Jun 12, 2024

View reviewed changes

Clean up and add comments

aadbcf5

eqy reviewed Jun 12, 2024

View reviewed changes

ZelboK added 2 commits June 12, 2024 10:19

format

1a613f5

Merge branch 'main' into group_norm_nhwc

4f7d5f2

Update kineto submodule to the correct commit

7ad6380

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NHWC support for group normalization #126635

Add NHWC support for group normalization #126635

ZelboK commented May 19, 2024

pytorch-bot bot commented May 19, 2024 •

edited

ZelboK May 19, 2024

ZelboK May 19, 2024 •

edited

ZelboK May 19, 2024

ZelboK commented May 19, 2024

drhead commented May 19, 2024

ZelboK commented May 20, 2024 •

edited

drhead commented May 20, 2024

ZelboK commented Jun 12, 2024

eqy Jun 12, 2024

eqy Jun 12, 2024

ZelboK Jun 12, 2024

eqy Jun 12, 2024 •

edited

ZelboK Jun 12, 2024 •

edited

eqy Jun 12, 2024

ZelboK Jun 12, 2024 •

edited

eqy Jun 12, 2024

eqy left a comment

ZelboK commented Jun 12, 2024 •

edited

Add NHWC support for group normalization #126635

Are you sure you want to change the base?

Add NHWC support for group normalization #126635

Conversation

ZelboK commented May 19, 2024

pytorch-bot bot commented May 19, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126635

✅ No Failures

ZelboK May 19, 2024

Choose a reason for hiding this comment

ZelboK May 19, 2024 • edited

Choose a reason for hiding this comment

ZelboK May 19, 2024

Choose a reason for hiding this comment

ZelboK commented May 19, 2024

drhead commented May 19, 2024

ZelboK commented May 20, 2024 • edited

drhead commented May 20, 2024

ZelboK commented Jun 12, 2024

eqy Jun 12, 2024

Choose a reason for hiding this comment

eqy Jun 12, 2024

Choose a reason for hiding this comment

ZelboK Jun 12, 2024

Choose a reason for hiding this comment

eqy Jun 12, 2024 • edited

Choose a reason for hiding this comment

ZelboK Jun 12, 2024 • edited

Choose a reason for hiding this comment

eqy Jun 12, 2024

Choose a reason for hiding this comment

ZelboK Jun 12, 2024 • edited

Choose a reason for hiding this comment

eqy Jun 12, 2024

Choose a reason for hiding this comment

eqy left a comment

Choose a reason for hiding this comment

ZelboK commented Jun 12, 2024 • edited

pytorch-bot bot commented May 19, 2024 •

edited

ZelboK May 19, 2024 •

edited

ZelboK commented May 20, 2024 •

edited

eqy Jun 12, 2024 •

edited

ZelboK Jun 12, 2024 •

edited

ZelboK Jun 12, 2024 •

edited

ZelboK commented Jun 12, 2024 •

edited