

Hello,This is a followup to Adam’s original RFC (RFC: Firstclass Matrix type, [1]) and his followup ([RFC] Matrix support (take 2, [2]). The main sticking points of the last thread were centred around representing padding, propagating shape information (dimensions) and the number of proposed intrinsics [8].We've incorporated the feedback into our design and would like to share our updated proposal. We stripped it down to the minimum required to realize the major benefits, like elimination of loads/stores to temporaries and generation of tuned loops for matrix multiplies. We discuss further extensions and points that came up during the earlier discussions after the new proposal.TLDR: Our new proposal boils down to adding 4 matrix related intrinsics (transpose, multiply, and strided load & store) with shape information and an IR pass, that lowers the intrinsics to regular vector operations. The IR pass can also be extended to break down operations into chunks suitable for specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and to reducing spilling. We've designed the approach so it's flexible and easy to extend.ProposalAfter the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3]. With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support rowmajor layouts.Initially, we would like to propose the following intrinsics: <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32 <N>)
Treat %in as containing a matrix with M rows and N columns and transpose it.
 <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32 <M>, i32 <N>, i32<K>)
Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and N columns and multiply them.
 <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Load a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient loading of sub matrixes.
 void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Store a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient storing of sub matrixes. The floating point versions of the intrinsics also take fastmath flags, which can be used to optin to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets. Additionally, targets can implement their own lowering to a specialized ISA extension. We implemented such a lowering for an internal matrix accelerator. The example below shows how the matrix_transpose builtin will be lowered in the IR lowering pass.define void @transpose(<8 x double> * %A, <8 x double>* %B) { entry:%a = load <8 x double>, <8 x double>* %A, align 16%b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)store <8 x double> %c, <8 x double>* %B, align 16ret void}declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)Will get lowered to something like define void @transpose(<8 x double> * %A, <8 x double>* %B) {entry:%0 = bitcast <8 x double>* %A to <2 x double>*%1 = load <2 x double>, <2 x double>* %0, align 8%2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2%3 = bitcast double* %2 to <2 x double>*%4 = load <2 x double>, <2 x double>* %3, align 8%5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4%6 = bitcast double* %5 to <2 x double>*%7 = load <2 x double>, <2 x double>* %6, align 8%8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>%9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6%10 = bitcast double* %9 to <2 x double>*%11 = load <2 x double>, <2 x double>* %10, align 8%12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>%13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 0, i32 2, i32 undef, i32 undef>%14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>%15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 4>%16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>%17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>%18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 5>%19 = bitcast <8 x double>* %C to <4 x double>*store <4 x double> %15, <4 x double>* %19, align 8%20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4%21 = bitcast double* %20 to <4 x double>*store <4 x double> %18, <4 x double>* %21, align 8ret void}Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>* %A, align 16` and lower it to a series of `load <2 x double>, <2 x double>*`, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.This infrastructure can also be used to improve codegen for large vector operations in general. For example, consider loading 2 <100 x double> vectors, adding them and storing the result. Currently this will be lowered naively doing most of the loads first, then most of the adds and then most of the stores, causing plenty of spilling on AArch64. We can significantly improve the generated code by breaking it into load/add/store blocks that fit into the target registers.We also plan to implement some additional optimisations as part of the lowering, like generating fused/tiled loops for matrix multiplications and direct FMA generation.As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.In terms of integration, we are planning on adding a separate MatrixIRBuilder utility (as suggested in [4]), which can be used by frontends to create various matrix operations. For point wise operations, they will lower to vector instructions directly. Potential Future ExtensionsDuring the previous discussions, a few extensions were discussed, but we would like exclude them from the initial implementation. Those are NDimension
Given that the dimensions are represented as arguments to intrinsics, we can go from supporting 2 dimensions to N dimensions by using variadic arguments for shape information and handle it accordingly while lowering.  Padding
For small matrixes, it might be desirable to automatically add additional padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3 matrix, to allow for using vector instructions operating on powerof2 number of elements or satisfy an alignment requirement by a target. This allows for additional optimizations, but is not required for lowering the intrinsics. We also haven't seen this being an issue so far. We should be able to iterate on that, once it becomes an issue. (Earlier discussion [5] )
Support in ClangWe are also planning to expose the matrix support in Clang via a matrix type on the C/C++ level (similar to the ext_vector_type attribute) together with a set of matrix builtins. Similar to the LLVM part, we would like to propose a pragmatic solution for common matrix operations, while being flexible enough to allow iterative improvements to accommodate additional requirements. The main motivation for the matrix support on the Clang side is to give users a way to Guarantee generating highquality code for matrix operations and trees of matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
 Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
 Move optimisations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
We propose adding a new matrix value type, that can be declared via a `matrix_type` attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred. Initially, the matrix type supports 2 dimensions (rows & columns). For example, a 4x4 float matrix would be declared as `typedef float m4x4_t __attribute__((matrix_type(4, 4)));`. Initially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. But our proposed type can be extended naturally to Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
 Support column/rowwise padding, by adding a column_padding clause to the attribute.
Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.  Support row & column major layouts, by adding a layout clause to the attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes. In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.In addition to the type, we propose adding a set of overloaded builtins to perform operations on matrix values. To start with, we would like to add the following builtins: __builtin_matrix_add, __builtin_matrix_subtract, __builtin_matrix_multiply, __builtin_matrix_scalar_multiply, __builtin_matrix_transpose, __builtin_matrix_insert, __builtin_matrix_extract, __builtin_matrix_column_load, __builtin_matrix_column_store. Those builtins allowed us to cover a wide range of uses cases internally, but as mentioned on the LLVM side already, we are also evaluating adding additional operations, like clamping matrixes or getting the minimum/maximum of a matrix. We should be able to iterate on the set of operations, as required by actual users.In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see LLVM part), that can lower the various builtins to LLVM IR (in terms of matrix builtins or vector operations, depending on the builtin).We currently have an internal implementation based on Adam’s original proposal ('first class matrix type') and are using that for a large internal project in production. We also prototyped the approach outlined in this proposal ('flattened matrixes') to make sure we can produce equivalent code (and thus performance) with both approaches on a large codebase.We think predication is out of scope for the initial proposal and the right forum to discuss predication support in general are the related discussions on the list, the LLVMVP proposal, [7] and the round table at the Dev Meeting.We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.Cheers,Florian[1] http://lists.llvm.org/pipermail/llvmdev/2018October/126871.html[2] http://lists.llvm.org/pipermail/llvmdev/2018December/128322.html[3] http://lists.llvm.org/pipermail/llvmdev/2018October/126982.html[4] http://lists.llvm.org/pipermail/llvmdev/2018December/128636.html[5] http://lists.llvm.org/pipermail/llvmdev/2018December/128331.html[6] https://clang.llvm.org/docs/LanguageExtensions.html#vectorsandextendedvectors[7] https://reviews.llvm.org/D57504[8] http://lists.llvm.org/pipermail/llvmdev/2018December/128636.html_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Hi Florian, thanks for the update! On Oct 28, 2019, at 9:07 AM, Florian Hahn via cfedev < [hidden email]> wrote:
... After the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3].
Nice! With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support rowmajor layouts.Initially, we would like to propose the following intrinsics: <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32 <N>)
Treat %in as containing a matrix with M rows and N columns and transpose it.
 <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32 <M>, i32 <N>, i32<K>)
Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and N columns and multiply them.
 <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Load a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient loading of sub matrixes.
 void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Store a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient storing of sub matrixes.
This all seems reasonable to me  a constrained extension that can start out as experimental. I personally don’t have any significant objections to this. The floating point versions of the intrinsics also take fastmath flags, which can be used to optin to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.
What is your though on the FP semantics of matmul? There is a lot of room for interpretation and various ‘fast’ approximations that are occasionally interesting. Should this use the existing fast math flags or are they different? The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets.
Nice! Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>* %A, align 16` and lower it to a series of `load <2 x double>, <2 x double>*`, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.
I’m very glad to hear that this only affects performance but not correctness. This ensures that the extension is correct in the face of outlining and various code motion passes that can introduces weird phi nodes that may (in unusual cases) escape the limits of your reasoning. As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.
+1. It would be interesting to extend the set of target independent vector intrinsics in general. Potential Future Extensions
 Padding
For small matrixes, it might be desirable to automatically add additional padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3 matrix, to allow for using vector instructions operating on powerof2 number of elements or satisfy an alignment requirement by a target. This allows for additional optimizations, but is not required for lowering the intrinsics. We also haven't seen this being an issue so far. We should be able to iterate on that, once it becomes an issue. (Earlier discussion [5] )
Random thought, but would it be enough to model these as undef elements if they became important? We propose adding a new matrix value type, that can be declared via a `matrix_type` attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred.
I don’t care *strongly* about this at all, but I have a weak preference for introducing a new attribute for this. There is effectively no cost to doing so, and this would make the documentation in the clang extensions manual much easier to understand.
In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.
I’m sure the clang folks will want to know what you mean by ‘constant’s in terms of ICE, constexpr, integer template arguments, etc... We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.
+1, sounds like a very promising direction, thank you!
Chris
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Hi Chris,
Thanks for taking a look!
Hi Florian, thanks for the update! On Oct 28, 2019, at 9:07 AM, Florian Hahn via cfedev < [hidden email]> wrote:
... After the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3].
Nice! With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support rowmajor layouts.Initially, we would like to propose the following intrinsics: <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32 <N>)
Treat %in as containing a matrix with M rows and N columns and transpose it.
 <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32 <M>, i32 <N>, i32<K>)
Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and N columns and multiply them.
 <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Load a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient loading of sub matrixes.
 void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Store a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient storing of sub matrixes.
This all seems reasonable to me  a constrained extension that can start out as experimental. I personally don’t have any significant objections to this. The floating point versions of the intrinsics also take fastmath flags, which can be used to optin to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.
What is your though on the FP semantics of matmul? There is a lot of room for interpretation and various ‘fast’ approximations that are occasionally interesting. Should this use the existing fast math flags or are they different?
Initially we thought of using the existing fast math flags directly, apply them to the generated instruction while lowering and rely on the existing related optimizations in InstCombine & co. But it should be possible to handle additional FP semantics directly while doing the lowering, if there are users that could benefit. The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets.
Nice! Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>* %A, align 16` and lower it to a series of `load <2 x double>, <2 x double>*`, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.
I’m very glad to hear that this only affects performance but not correctness. This ensures that the extension is correct in the face of outlining and various code motion passes that can introduces weird phi nodes that may (in unusual cases) escape the limits of your reasoning. As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.
+1. It would be interesting to extend the set of target independent vector intrinsics in general.
Potential Future Extensions
 Padding
For small matrixes, it might be desirable to automatically add additional padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3 matrix, to allow for using vector instructions operating on powerof2 number of elements or satisfy an alignment requirement by a target. This allows for additional optimizations, but is not required for lowering the intrinsics. We also haven't seen this being an issue so far. We should be able to iterate on that, once it becomes an issue. (Earlier discussion [5] )
Random thought, but would it be enough to model these as undef elements if they became important?
Interesting! Yes, I think we could use undef while lowering the builtins in clang, to deal with padding. We propose adding a new matrix value type, that can be declared via a `matrix_type` attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred.
I don’t care *strongly* about this at all, but I have a weak preference for introducing a new attribute for this. There is effectively no cost to doing so, and this would make the documentation in the clang extensions manual much easier to understand.
In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.
I’m sure the clang folks will want to know what you mean by ‘constant’s in terms of ICE, constexpr, integer template arguments, etc... We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.
+1, sounds like a very promising direction, thank you!
Thanks for all the valuable input during the earlier discussions!
Cheers, Florian _______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Kristof Beyls via cfedev
Thanks a lot for the comments so far!
I’m planning to start preparing a set of initial patches for the LLVM side in a couple of days. Please let me know if there are any remaining highlevel concerns.
I received additional feedback offlist to try to remove the transpose and load/store intrinsics initially. I'll probably exclude them from the initial set of patches (we currently mostly have them for convenience) and add them as we go along, if required.
Cheers,
Florian
> On Oct 28, 2019, at 17:07, Florian Hahn via llvmdev < [hidden email]> wrote:
>
> Hello,
>
> This is a followup to Adam’s original RFC (RFC: Firstclass Matrix type, [1]) and his followup ([RFC] Matrix support (take 2, [2]). The main sticking points of the last thread were centred around representing padding, propagating shape information (dimensions) and the number of proposed intrinsics [8].
>
> We've incorporated the feedback into our design and would like to share our updated proposal. We stripped it down to the minimum required to realize the major benefits, like elimination of loads/stores to temporaries and generation of tuned loops for matrix multiplies. We discuss further extensions and points that came up during the earlier discussions after the new proposal.
>
> TLDR: Our new proposal boils down to adding 4 matrix related intrinsics (transpose, multiply, and strided load & store) with shape information and an IR pass, that lowers the intrinsics to regular vector operations. The IR pass can also be extended to break down operations into chunks suitable for specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and to reducing spilling. We've designed the approach so it's flexible and easy to extend.
> Proposal
>
> After the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3].
>
> With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support rowmajor layouts.
>
> Initially, we would like to propose the following intrinsics:
> • <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32 <N>)
> Treat %in as containing a matrix with M rows and N columns and transpose it.
>
> • <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32 <M>, i32 <N>, i32<K>)
> Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and N columns and multiply them.
>
> • <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
> Load a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient loading of sub matrixes.
>
> • void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
> Store a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient storing of sub matrixes.
>
> The floating point versions of the intrinsics also take fastmath flags, which can be used to optin to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.
>
> The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets. Additionally, targets can implement their own lowering to a specialized ISA extension. We implemented such a lowering for an internal matrix accelerator.
>
> The example below shows how the matrix_transpose builtin will be lowered in the IR lowering pass.
>
> define void @transpose(<8 x double> * %A, <8 x double>* %B) {
> entry:
> %a = load <8 x double>, <8 x double>* %A, align 16
> %b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
> store <8 x double> %c, <8 x double>* %B, align 16
> ret void
> }
> declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)
>
> Will get lowered to something like
>
> define void @transpose(<8 x double> * %A, <8 x double>* %B) {
> entry:
> %0 = bitcast <8 x double>* %A to <2 x double>*
> %1 = load <2 x double>, <2 x double>* %0, align 8
> %2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
> %3 = bitcast double* %2 to <2 x double>*
> %4 = load <2 x double>, <2 x double>* %3, align 8
> %5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
> %6 = bitcast double* %5 to <2 x double>*
> %7 = load <2 x double>, <2 x double>* %6, align 8
> %8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
> %9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
> %10 = bitcast double* %9 to <2 x double>*
> %11 = load <2 x double>, <2 x double>* %10, align 8
> %12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
> %13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 0, i32 2, i32 undef, i32 undef>
> %14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>
> %15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
> %16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
> %17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>
> %18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 5>
> %19 = bitcast <8 x double>* %C to <4 x double>*
> store <4 x double> %15, <4 x double>* %19, align 8
> %20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4
> %21 = bitcast double* %20 to <4 x double>*
> store <4 x double> %18, <4 x double>* %21, align 8
> ret void
> }
>
> Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>* %A, align 16` and lower it to a series of `load <2 x double>, <2 x double>*`, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.
>
> This infrastructure can also be used to improve codegen for large vector operations in general. For example, consider loading 2 <100 x double> vectors, adding them and storing the result. Currently this will be lowered naively doing most of the loads first, then most of the adds and then most of the stores, causing plenty of spilling on AArch64. We can significantly improve the generated code by breaking it into load/add/store blocks that fit into the target registers.
> We also plan to implement some additional optimisations as part of the lowering, like generating fused/tiled loops for matrix multiplications and direct FMA generation.
>
> As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.
>
> In terms of integration, we are planning on adding a separate MatrixIRBuilder utility (as suggested in [4]), which can be used by frontends to create various matrix operations. For point wise operations, they will lower to vector instructions directly.
> Potential Future Extensions
>
> During the previous discussions, a few extensions were discussed, but we would like exclude them from the initial implementation. Those are
> • NDimension
> Given that the dimensions are represented as arguments to intrinsics, we can go from supporting 2 dimensions to N dimensions by using variadic arguments for shape information and handle it accordingly while lowering.
> • Padding
> For small matrixes, it might be desirable to automatically add additional padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3 matrix, to allow for using vector instructions operating on powerof2 number of elements or satisfy an alignment requirement by a target. This allows for additional optimizations, but is not required for lowering the intrinsics. We also haven't seen this being an issue so far. We should be able to iterate on that, once it becomes an issue. (Earlier discussion [5] )
> Support in Clang
>
> We are also planning to expose the matrix support in Clang via a matrix type on the C/C++ level (similar to the ext_vector_type attribute) together with a set of matrix builtins. Similar to the LLVM part, we would like to propose a pragmatic solution for common matrix operations, while being flexible enough to allow iterative improvements to accommodate additional requirements. The main motivation for the matrix support on the Clang side is to give users a way to
> • Guarantee generating highquality code for matrix operations and trees of matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
> • Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
> • Move optimisations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
>
> We propose adding a new matrix value type, that can be declared via a `matrix_type` attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred. Initially, the matrix type supports 2 dimensions (rows & columns). For example, a 4x4 float matrix would be declared as `typedef float m4x4_t __attribute__((matrix_type(4, 4)));`.
>
> Initially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. But our proposed type can be extended naturally to
> • Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
> • Support column/rowwise padding, by adding a column_padding clause to the attribute.
> Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.
> • Support row & column major layouts, by adding a layout clause to the attribute.
> Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes.
> In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.
>
> In addition to the type, we propose adding a set of overloaded builtins to perform operations on matrix values. To start with, we would like to add the following builtins: __builtin_matrix_add, __builtin_matrix_subtract, __builtin_matrix_multiply, __builtin_matrix_scalar_multiply, __builtin_matrix_transpose, __builtin_matrix_insert, __builtin_matrix_extract, __builtin_matrix_column_load, __builtin_matrix_column_store. Those builtins allowed us to cover a wide range of uses cases internally, but as mentioned on the LLVM side already, we are also evaluating adding additional operations, like clamping matrixes or getting the minimum/maximum of a matrix. We should be able to iterate on the set of operations, as required by actual users.
>
> In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see LLVM part), that can lower the various builtins to LLVM IR (in terms of matrix builtins or vector operations, depending on the builtin).
>
>
>
> We currently have an internal implementation based on Adam’s original proposal ('first class matrix type') and are using that for a large internal project in production. We also prototyped the approach outlined in this proposal ('flattened matrixes') to make sure we can produce equivalent code (and thus performance) with both approaches on a large codebase.
>
> We think predication is out of scope for the initial proposal and the right forum to discuss predication support in general are the related discussions on the list, the LLVMVP proposal, [7] and the round table at the Dev Meeting.
>
> We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.
>
> Cheers,
> Florian
>
>
> [1] http://lists.llvm.org/pipermail/llvmdev/2018October/126871.html> [2] http://lists.llvm.org/pipermail/llvmdev/2018December/128322.html> [3] http://lists.llvm.org/pipermail/llvmdev/2018October/126982.html> [4] http://lists.llvm.org/pipermail/llvmdev/2018December/128636.html> [5] http://lists.llvm.org/pipermail/llvmdev/2018December/128331.html> [6] https://clang.llvm.org/docs/LanguageExtensions.html#vectorsandextendedvectors> [7] https://reviews.llvm.org/D57504> [8] http://lists.llvm.org/pipermail/llvmdev/2018December/128636.html> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgibin/mailman/listinfo/llvmdev_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Hi,
I just uploaded the initial patch adding the matrix intrinsics as described, as well as an initial lowering pass: https://reviews.llvm.org/D70456The first version of the patch works without any shape propagation and lowers each intrinsic independently. We plan to iterate on the pass once it is in, to improve the generated code as discussed in the RFC.
In a bit, I’ll also share a patch for the clang side, to get a concrete discussion started about the Clang side.
Thanks,
Florian
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Kristof Beyls via cfedev
On Mon, 28 Oct 2019 at 09:08, Florian Hahn via cfedev < [hidden email]> wrote: Hello,This is a followup to Adam’s original RFC (RFC: Firstclass Matrix type, [1]) and his followup ([RFC] Matrix support (take 2, [2]). The main sticking points of the last thread were centred around representing padding, propagating shape information (dimensions) and the number of proposed intrinsics [8].We've incorporated the feedback into our design and would like to share our updated proposal. We stripped it down to the minimum required to realize the major benefits, like elimination of loads/stores to temporaries and generation of tuned loops for matrix multiplies. We discuss further extensions and points that came up during the earlier discussions after the new proposal.TLDR: Our new proposal boils down to adding 4 matrix related intrinsics (transpose, multiply, and strided load & store) with shape information and an IR pass, that lowers the intrinsics to regular vector operations. The IR pass can also be extended to break down operations into chunks suitable for specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and to reducing spilling. We've designed the approach so it's flexible and easy to extend.ProposalAfter the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3]. With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support rowmajor layouts.Initially, we would like to propose the following intrinsics: <C x Ty> matrix_transpose(<C x Ty> %in, i32 <M>, i32 <N>)
Treat %in as containing a matrix with M rows and N columns and transpose it.
 <C x Ty> matrix_multiply(<A x Ty> %X, <B x Ty> %Y, i32 <M>, i32 <N>, i32<K>)
Treat %X as matrix with M rows and K columns, %Y as matrix with K rows and N columns and multiply them.
 <C x Ty>matrix_columnwise_load(Ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Load a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient loading of sub matrixes.
 void matrix_columnwise_store(<C x Ty> %MatrixVal, ty* %Ptr, i64 %Stride, i32 <M>, i32 <N>)
Store a matrix with M rows and N columns, using a stride of %Stride between columns. This allows for convenient storing of sub matrixes. The floating point versions of the intrinsics also take fastmath flags, which can be used to optin to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets. Additionally, targets can implement their own lowering to a specialized ISA extension. We implemented such a lowering for an internal matrix accelerator. The example below shows how the matrix_transpose builtin will be lowered in the IR lowering pass.define void @transpose(<8 x double> * %A, <8 x double>* %B) { entry:%a = load <8 x double>, <8 x double>* %A, align 16%b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)store <8 x double> %c, <8 x double>* %B, align 16ret void}declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)Will get lowered to something like define void @transpose(<8 x double> * %A, <8 x double>* %B) {entry:%0 = bitcast <8 x double>* %A to <2 x double>*%1 = load <2 x double>, <2 x double>* %0, align 8%2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2%3 = bitcast double* %2 to <2 x double>*%4 = load <2 x double>, <2 x double>* %3, align 8%5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4%6 = bitcast double* %5 to <2 x double>*%7 = load <2 x double>, <2 x double>* %6, align 8%8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>%9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6%10 = bitcast double* %9 to <2 x double>*%11 = load <2 x double>, <2 x double>* %10, align 8%12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>%13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 0, i32 2, i32 undef, i32 undef>%14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>%15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 4>%16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>%17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>%18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 5>%19 = bitcast <8 x double>* %C to <4 x double>*store <4 x double> %15, <4 x double>* %19, align 8%20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4%21 = bitcast double* %20 to <4 x double>*store <4 x double> %18, <4 x double>* %21, align 8ret void}Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to `%a = load <8 x double>, <8 x double>* %A, align 16` and lower it to a series of `load <2 x double>, <2 x double>*`, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.This infrastructure can also be used to improve codegen for large vector operations in general. For example, consider loading 2 <100 x double> vectors, adding them and storing the result. Currently this will be lowered naively doing most of the loads first, then most of the adds and then most of the stores, causing plenty of spilling on AArch64. We can significantly improve the generated code by breaking it into load/add/store blocks that fit into the target registers.We also plan to implement some additional optimisations as part of the lowering, like generating fused/tiled loops for matrix multiplications and direct FMA generation.As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.In terms of integration, we are planning on adding a separate MatrixIRBuilder utility (as suggested in [4]), which can be used by frontends to create various matrix operations. For point wise operations, they will lower to vector instructions directly. Potential Future ExtensionsDuring the previous discussions, a few extensions were discussed, but we would like exclude them from the initial implementation. Those are NDimension
Given that the dimensions are represented as arguments to intrinsics, we can go from supporting 2 dimensions to N dimensions by using variadic arguments for shape information and handle it accordingly while lowering.  Padding
For small matrixes, it might be desirable to automatically add additional padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3 matrix, to allow for using vector instructions operating on powerof2 number of elements or satisfy an alignment requirement by a target. This allows for additional optimizations, but is not required for lowering the intrinsics. We also haven't seen this being an issue so far. We should be able to iterate on that, once it becomes an issue. (Earlier discussion [5] )
Support in ClangWe are also planning to expose the matrix support in Clang via a matrix type on the C/C++ level (similar to the ext_vector_type attribute) together with a set of matrix builtins. Similar to the LLVM part, we would like to propose a pragmatic solution for common matrix operations, while being flexible enough to allow iterative improvements to accommodate additional requirements. The main motivation for the matrix support on the Clang side is to give users a way to Guarantee generating highquality code for matrix operations and trees of matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
 Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
 Move optimisations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
We propose adding a new matrix value type, that can be declared via a `matrix_type` attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred. Initially, the matrix type supports 2 dimensions (rows & columns). For example, a 4x4 float matrix would be declared as `typedef float m4x4_t __attribute__((matrix_type(4, 4)));`.
Hi Florian,
This extension clearly satisfies point 2, and I'm happy to trust that we can iterate to satisfying points 6 and 7. I would like to hear your thoughts on the other criteria.
Initially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. But our proposed type can be extended naturally to Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
 Support column/rowwise padding, by adding a column_padding clause to the attribute.
Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.  Support row & column major layouts, by adding a layout clause to the attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes. In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.In addition to the type, we propose adding a set of overloaded builtins to perform operations on matrix values. To start with, we would like to add the following builtins: __builtin_matrix_add, __builtin_matrix_subtract, __builtin_matrix_multiply, __builtin_matrix_scalar_multiply, __builtin_matrix_transpose, __builtin_matrix_insert, __builtin_matrix_extract, __builtin_matrix_column_load, __builtin_matrix_column_store. Those builtins allowed us to cover a wide range of uses cases internally, but as mentioned on the LLVM side already, we are also evaluating adding additional operations, like clamping matrixes or getting the minimum/maximum of a matrix. We should be able to iterate on the set of operations, as required by actual users.In terms of LLVM integration, we plan to provide a MatrixIRBuilder (see LLVM part), that can lower the various builtins to LLVM IR (in terms of matrix builtins or vector operations, depending on the builtin).We currently have an internal implementation based on Adam’s original proposal ('first class matrix type') and are using that for a large internal project in production. We also prototyped the approach outlined in this proposal ('flattened matrixes') to make sure we can produce equivalent code (and thus performance) with both approaches on a large codebase.We think predication is out of scope for the initial proposal and the right forum to discuss predication support in general are the related discussions on the list, the LLVMVP proposal, [7] and the round table at the Dev Meeting.We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.Cheers,Florian[1] http://lists.llvm.org/pipermail/llvmdev/2018October/126871.html[2] http://lists.llvm.org/pipermail/llvmdev/2018December/128322.html[3] http://lists.llvm.org/pipermail/llvmdev/2018October/126982.html[4] http://lists.llvm.org/pipermail/llvmdev/2018December/128636.html[5] http://lists.llvm.org/pipermail/llvmdev/2018December/128331.html[6] https://clang.llvm.org/docs/LanguageExtensions.html#vectorsandextendedvectors[7] https://reviews.llvm.org/D57504[8] http://lists.llvm.org/pipermail/llvmdev/2018December/128636.html _______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev

