RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Keane, Erich via cfe-dev
Summary
-------

New Power ISA v3.1 [0] introduces instructions to accelerate matrix
multiplication. We want to expose these instructions through a list of
target-dependent builtins and new Clang types in the form of a language
extension. This RFC gives more details on the requirements for these
types and explains how we (IBM) are implementing them in Clang.

We present the frontend implementation as an RFC because we need to add
target-specific checks in Sema and want to get feedback on our implementation
of these checks. The backend implementation does not impact the other targets
so it is not part of this RFC. Comments and questions are welcome.

Introduction
------------

The new instructions manipulate matrices that the CPU represents by new 512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be able
to modify and extract values of the matrices and perform computations with them
by using the dedicated builtins only. The instructions are designed to be used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1]. We need to add a new target-dependent type and restrict its use.
We give more details on these restrictions below. To be able to manipulate
these matrices, we want to add the `__vector_quad` type to Clang. This type
would be a PowerPC-specific builtin type mapped to the new 512-bit registers.

Similarly, some of these instructions take 256-bit values that must be stored
in two consecutive VSX registers. To represent these values and minimize the
number of copies between VSX registers, we also want to add the PowerPC-specific
builtin type `__vector_pair` that would be mapped to consecutive VSX registers.

Value initialization
--------------------

The only way to initialize a `__vector_pair` is by calling a builtin taking two
128-bit vectors and assembling them to form a 256-bit pair. A similar builtin
exists to assemble four 128-bit vectors to form a 512-bit `__vector_quad`:

vector unsigned char v1 = ...;
vector unsigned char v2 = ...;
vector unsigned char v3 = ...;
vector unsigned char v4 = ...;
__vector_pair vp;
__vector_quad vq;
__builtin_mma_assemble_pair(&vp, v1, v2);
__builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);

The other way to initialize a `__vector_quad` is to call a builtin mapped to an
instruction generating a new value of this type:

__vector_quad vq1;
__builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1
__vector_quad vq2;
__builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated in vq2

Both `__vector_pair` and `__vector_quad` can also be loaded from pointers that
can potentially be casted from void or char pointers.

Value extraction
----------------

The only way to extract values from a matrix is to call the builtins
disassembling `__vector_pair` and `__vector_quad` values back into two
and four 128-bit vectors respectively:

vector unsigned char* vpr = ...;
vector unsigned char* vqr = ...;
__builtin_mma_disassemble_pair(vpr, &vp);
__builtin_mma_disassemble_acc(vqr, &vq);

Once the values are disassembled to vectors, the user can extract values as
usual, for example using the subscript operator on the vector unsigned char
values. So the typical workflow to efficiently use these instructions in a
kernel is to first initialize the matrices, then perform computations and finally
disassemble them to extract the result of the computations. These three steps
should be done using the provided builtins.

Semantics
---------

To enforce using values of these types in kernels, thus to avoid copies from/to
the matrix multiplication unit, we want to prevent as many implicit copies
as possible. That means that it should only be possible to declare values of
these types as local variables. We want to prevent any other way to declare and
use non-pointer variables of these types (global variable, function parameter,
function return, etc...).

The only situations in which these types and values of these types can be
used are:
  * Local variable declaration
  * Assignment operator
  * Builtin call parameter
  * Memory allocation
  * Typedef & alias

Implementation
--------------

We have implemented the support of these types, builtins and intrinsics in both
Clang's frontend and the LLVM PowerPC backend. We will post the backend
implementation later. We implemented and tested this support out-of-tree in
conjunction with the GCC team to ensure a common API and ensure source
compatibility. For this RFC, we have 5 patches for the frontend:
  * Add options to control MMA support on PowerPC targets [2].
  * Define the two new types as Clang target-dependent builtin types.
    As the other targets, we decided to define these types in a separate
    `PPCtypes.def` file to improve extensibility in case we need to add other
    PowerPC-specific types in the future [3].
  * Add the builtin definitions. These builtins use the two new types,
    so they use custom type descriptors. To avoid pervasive changes,
    we use custom decoding of these descriptors [4].
  * Add the Sema checks to restrict the use of the two types.
    We prevent the use of non-pointer values of these types in any declaration
    that is not a local variable declaration. We also prevent them to
    be passed as function arguments and to be returned from functions [5].
  * Implement the minimal required changes to LLVM to support the builtins.
    In this patch, we enable the use of v256i1 for intrinsic arguments and
    define all the MMA intrinsics the builtins are mapped to [6].

The backend implementation should not impact other targets. We do not plan to
add any type to LLVM. `__vector_pair` and `__vector_quad` are generated as
`v256i1` and `v512i1` respectively (both are currently unused in the PowerPC
backend). VSX pair registers will be allocated to the `v256i1` type and the
new accumulator registers will be allocated to the `v512i1` type.

[0] Power ISA v3.1, https://ibm.ent.box.com/s/hhjfw0x0lrbtyzmiaffnbxh2fuo0fog0
[1] https://clang.llvm.org/docs/MatrixTypes.html
[2] https://reviews.llvm.org/D81442
[3] https://reviews.llvm.org/D81508
[4] https://reviews.llvm.org/D81748
[5] https://reviews.llvm.org/D82035
[6] https://reviews.llvm.org/D81744

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Keane, Erich via cfe-dev


On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
Summary
-------

New Power ISA v3.1 [0] introduces instructions to accelerate matrix
multiplication. We want to expose these instructions through a list of
target-dependent builtins and new Clang types in the form of a language
extension. This RFC gives more details on the requirements for these
types and explains how we (IBM) are implementing them in Clang.

We present the frontend implementation as an RFC because we need to add
target-specific checks in Sema and want to get feedback on our implementation
of these checks. The backend implementation does not impact the other targets
so it is not part of this RFC. Comments and questions are welcome.

Introduction
------------

The new instructions manipulate matrices that the CPU represents by new 512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be able
to modify and extract values of the matrices and perform computations with them
by using the dedicated builtins only. The instructions are designed to be used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1].


If this is part of the documented system ABI, and what will be supported by GCC, then we should support it too.

That having been said, I'm not convinced that this is a good idea, and supporting the target-independent matrix types would be better. I understand that the copying will be expensive, and is something that should be avoided, but this is true to some extent for everything: there are some usages that compile to machine code efficiently and some that don't. We generally, however, favor the ability to create abstractions that *can* be compiled efficiently as part of expected use cases, even if we cannot guarantee that all uses will produce efficient code. In his case, you're prohibiting the creation of abstractions (by semantically restricting to local variables) because you fear that not all uses will compile to efficient code. Are there some other structural reasons why supporting these are regular values would be problematic?


We need to add a new target-dependent type and restrict its use.
We give more details on these restrictions below. To be able to manipulate
these matrices, we want to add the `__vector_quad` type to Clang. This type
would be a PowerPC-specific builtin type mapped to the new 512-bit registers.


Okay.

 -Hal



Similarly, some of these instructions take 256-bit values that must be stored
in two consecutive VSX registers. To represent these values and minimize the
number of copies between VSX registers, we also want to add the PowerPC-specific
builtin type `__vector_pair` that would be mapped to consecutive VSX registers.

Value initialization
--------------------

The only way to initialize a `__vector_pair` is by calling a builtin taking two
128-bit vectors and assembling them to form a 256-bit pair. A similar builtin
exists to assemble four 128-bit vectors to form a 512-bit `__vector_quad`:

vector unsigned char v1 = ...;
vector unsigned char v2 = ...;
vector unsigned char v3 = ...;
vector unsigned char v4 = ...;
__vector_pair vp;
__vector_quad vq;
__builtin_mma_assemble_pair(&vp, v1, v2);
__builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);

The other way to initialize a `__vector_quad` is to call a builtin mapped to an
instruction generating a new value of this type:

__vector_quad vq1;
__builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1
__vector_quad vq2;
__builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated in vq2

Both `__vector_pair` and `__vector_quad` can also be loaded from pointers that
can potentially be casted from void or char pointers.

Value extraction
----------------

The only way to extract values from a matrix is to call the builtins
disassembling `__vector_pair` and `__vector_quad` values back into two
and four 128-bit vectors respectively:

vector unsigned char* vpr = ...;
vector unsigned char* vqr = ...;
__builtin_mma_disassemble_pair(vpr, &vp);
__builtin_mma_disassemble_acc(vqr, &vq);

Once the values are disassembled to vectors, the user can extract values as
usual, for example using the subscript operator on the vector unsigned char
values. So the typical workflow to efficiently use these instructions in a
kernel is to first initialize the matrices, then perform computations and finally
disassemble them to extract the result of the computations. These three steps
should be done using the provided builtins.

Semantics
---------

To enforce using values of these types in kernels, thus to avoid copies from/to
the matrix multiplication unit, we want to prevent as many implicit copies
as possible. That means that it should only be possible to declare values of
these types as local variables. We want to prevent any other way to declare and
use non-pointer variables of these types (global variable, function parameter,
function return, etc...).

The only situations in which these types and values of these types can be
used are:
  * Local variable declaration
  * Assignment operator
  * Builtin call parameter
  * Memory allocation
  * Typedef & alias

Implementation
--------------

We have implemented the support of these types, builtins and intrinsics in both
Clang's frontend and the LLVM PowerPC backend. We will post the backend
implementation later. We implemented and tested this support out-of-tree in
conjunction with the GCC team to ensure a common API and ensure source
compatibility. For this RFC, we have 5 patches for the frontend:
  * Add options to control MMA support on PowerPC targets [2].
  * Define the two new types as Clang target-dependent builtin types.
    As the other targets, we decided to define these types in a separate
    `PPCtypes.def` file to improve extensibility in case we need to add other
    PowerPC-specific types in the future [3].
  * Add the builtin definitions. These builtins use the two new types,
    so they use custom type descriptors. To avoid pervasive changes,
    we use custom decoding of these descriptors [4].
  * Add the Sema checks to restrict the use of the two types.
    We prevent the use of non-pointer values of these types in any declaration
    that is not a local variable declaration. We also prevent them to
    be passed as function arguments and to be returned from functions [5].
  * Implement the minimal required changes to LLVM to support the builtins.
    In this patch, we enable the use of v256i1 for intrinsic arguments and
    define all the MMA intrinsics the builtins are mapped to [6].

The backend implementation should not impact other targets. We do not plan to
add any type to LLVM. `__vector_pair` and `__vector_quad` are generated as
`v256i1` and `v512i1` respectively (both are currently unused in the PowerPC
backend). VSX pair registers will be allocated to the `v256i1` type and the
new accumulator registers will be allocated to the `v512i1` type.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Keane, Erich via cfe-dev


On Mon, 22 Jun 2020 at 19:01, Hal Finkel <[hidden email]> wrote:


On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
Summary
-------

New Power ISA v3.1 [0] introduces instructions to accelerate matrix
multiplication. We want to expose these instructions through a list of
target-dependent builtins and new Clang types in the form of a language
extension. This RFC gives more details on the requirements for these
types and explains how we (IBM) are implementing them in Clang.

We present the frontend implementation as an RFC because we need to add
target-specific checks in Sema and want to get feedback on our implementation
of these checks. The backend implementation does not impact the other targets
so it is not part of this RFC. Comments and questions are welcome.

Introduction
------------

The new instructions manipulate matrices that the CPU represents by new 512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be able
to modify and extract values of the matrices and perform computations with them
by using the dedicated builtins only. The instructions are designed to be used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1].


If this is part of the documented system ABI, and what will be supported by GCC, then we should support it too.

That having been said, I'm not convinced that this is a good idea, and supporting the target-independent matrix types would be better. I understand that the copying will be expensive, and is something that should be avoided, but this is true to some extent for everything: there are some usages that compile to machine code efficiently and some that don't. We generally, however, favor the ability to create abstractions that *can* be compiled efficiently as part of expected use cases, even if we cannot guarantee that all uses will produce efficient code. In his case, you're prohibiting the creation of abstractions (by semantically restricting to local variables) because you fear that not all uses will compile to efficient code. Are there some other structural reasons why supporting these are regular values would be problematic?

Supporting these as regular values would be problematic for several reasons. These new accumulator registers are actually each associated with 4 of the existing 128-bit VSR vector registers. A particularity of MMA is that when an accumulator contains defined data, its 4 associated registers contain undefined data and cannot be used. When copying an accumulator, we need to:
  1. Copy its value back to its four associated VSRs
  2. Copy these 4 VSRs to the VSRs associated with the destination accumulator
  3. Copy these VSRs to the destination accumulator
  4. If the copy is not a kill, copy the 4 VSRs associated with the source back to the source accumulator

So if these registers were supported as regular values, we would have really expensive copy (and also expensive function calls and returns) and we would prevent from using 4 vector registers per live accumulator. More importantly (something I should have mentioned in the RFC), the new instructions actually implement a single operation that is the outer product. That means that supporting these as regular values would imply copying accumulators back to their associated VSRs and generating non-MMA instructions for any other operation anyway. Therefore, it is likely that programs using matrices would actually be less efficient.

However, although we're not planning on supporting the target-independent matrix types for these reasons, we're not excluding supporting the target-independent matrix operations. We are exploring implementing the target-independent matrix multiplication operation with MMA kernels. That way, on PowerPC, programs using target-independent matrix types and operations would actually benefit from MMA for matrix multiplication with no additional effort.

Baptiste.


We need to add a new target-dependent type and restrict its use.
We give more details on these restrictions below. To be able to manipulate
these matrices, we want to add the `__vector_quad` type to Clang. This type
would be a PowerPC-specific builtin type mapped to the new 512-bit registers.


Okay.

 -Hal



Similarly, some of these instructions take 256-bit values that must be stored
in two consecutive VSX registers. To represent these values and minimize the
number of copies between VSX registers, we also want to add the PowerPC-specific
builtin type `__vector_pair` that would be mapped to consecutive VSX registers.

Value initialization
--------------------

The only way to initialize a `__vector_pair` is by calling a builtin taking two
128-bit vectors and assembling them to form a 256-bit pair. A similar builtin
exists to assemble four 128-bit vectors to form a 512-bit `__vector_quad`:

vector unsigned char v1 = ...;
vector unsigned char v2 = ...;
vector unsigned char v3 = ...;
vector unsigned char v4 = ...;
__vector_pair vp;
__vector_quad vq;
__builtin_mma_assemble_pair(&vp, v1, v2);
__builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);

The other way to initialize a `__vector_quad` is to call a builtin mapped to an
instruction generating a new value of this type:

__vector_quad vq1;
__builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1
__vector_quad vq2;
__builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated in vq2

Both `__vector_pair` and `__vector_quad` can also be loaded from pointers that
can potentially be casted from void or char pointers.

Value extraction
----------------

The only way to extract values from a matrix is to call the builtins
disassembling `__vector_pair` and `__vector_quad` values back into two
and four 128-bit vectors respectively:

vector unsigned char* vpr = ...;
vector unsigned char* vqr = ...;
__builtin_mma_disassemble_pair(vpr, &vp);
__builtin_mma_disassemble_acc(vqr, &vq);

Once the values are disassembled to vectors, the user can extract values as
usual, for example using the subscript operator on the vector unsigned char
values. So the typical workflow to efficiently use these instructions in a
kernel is to first initialize the matrices, then perform computations and finally
disassemble them to extract the result of the computations. These three steps
should be done using the provided builtins.

Semantics
---------

To enforce using values of these types in kernels, thus to avoid copies from/to
the matrix multiplication unit, we want to prevent as many implicit copies
as possible. That means that it should only be possible to declare values of
these types as local variables. We want to prevent any other way to declare and
use non-pointer variables of these types (global variable, function parameter,
function return, etc...).

The only situations in which these types and values of these types can be
used are:
  * Local variable declaration
  * Assignment operator
  * Builtin call parameter
  * Memory allocation
  * Typedef & alias

Implementation
--------------

We have implemented the support of these types, builtins and intrinsics in both
Clang's frontend and the LLVM PowerPC backend. We will post the backend
implementation later. We implemented and tested this support out-of-tree in
conjunction with the GCC team to ensure a common API and ensure source
compatibility. For this RFC, we have 5 patches for the frontend:
  * Add options to control MMA support on PowerPC targets [2].
  * Define the two new types as Clang target-dependent builtin types.
    As the other targets, we decided to define these types in a separate
    `PPCtypes.def` file to improve extensibility in case we need to add other
    PowerPC-specific types in the future [3].
  * Add the builtin definitions. These builtins use the two new types,
    so they use custom type descriptors. To avoid pervasive changes,
    we use custom decoding of these descriptors [4].
  * Add the Sema checks to restrict the use of the two types.
    We prevent the use of non-pointer values of these types in any declaration
    that is not a local variable declaration. We also prevent them to
    be passed as function arguments and to be returned from functions [5].
  * Implement the minimal required changes to LLVM to support the builtins.
    In this patch, we enable the use of v256i1 for intrinsic arguments and
    define all the MMA intrinsics the builtins are mapped to [6].

The backend implementation should not impact other targets. We do not plan to
add any type to LLVM. `__vector_pair` and `__vector_quad` are generated as
`v256i1` and `v512i1` respectively (both are currently unused in the PowerPC
backend). VSX pair registers will be allocated to the `v256i1` type and the
new accumulator registers will be allocated to the `v512i1` type.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Keane, Erich via cfe-dev
Hi,

Sorry for jumping in a bit late, I missed the initial discussion.

On Jul 21, 2020, at 05:13, Baptiste Saleil via cfe-dev <[hidden email]> wrote:
On Mon, 22 Jun 2020 at 19:01, Hal Finkel <[hidden email]> wrote:
On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
The new instructions manipulate matrices that the CPU represents by new 512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be able
to modify and extract values of the matrices and perform computations with them
by using the dedicated builtins only. The instructions are designed to be used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1].


If this is part of the documented system ABI, and what will be supported by GCC, then we should support it too.

That having been said, I'm not convinced that this is a good idea, and supporting the target-independent matrix types would be better. I understand that the copying will be expensive, and is something that should be avoided, but this is true to some extent for everything: there are some usages that compile to machine code efficiently and some that don't. We generally, however, favor the ability to create abstractions that *can* be compiled efficiently as part of expected use cases, even if we cannot guarantee that all uses will produce efficient code. In his case, you're prohibiting the creation of abstractions (by semantically restricting to local variables) because you fear that not all uses will compile to efficient code. Are there some other structural reasons why supporting these are regular values would be problematic?

Supporting these as regular values would be problematic for several reasons. These new accumulator registers are actually each associated with 4 of the existing 128-bit VSR vector registers. A particularity of MMA is that when an accumulator contains defined data, its 4 associated registers contain undefined data and cannot be used. When copying an accumulator, we need to:
  1. Copy its value back to its four associated VSRs
  2. Copy these 4 VSRs to the VSRs associated with the destination accumulator
  3. Copy these VSRs to the destination accumulator 
  4. If the copy is not a kill, copy the 4 VSRs associated with the source back to the source accumulator

So if these registers were supported as regular values, we would have really expensive copy (and also expensive function calls and returns) and we would prevent from using 4 vector registers per live accumulator. More importantly (something I should have mentioned in the RFC), the new instructions actually implement a single operation that is the outer product. That means that supporting these as regular values would imply copying accumulators back to their associated VSRs and generating non-MMA instructions for any other operation anyway. Therefore, it is likely that programs using matrices would actually be less efficient.


Form the reasoning above, it sounds like there seem to be no structural reasons that prevent using the matrix type extension, unless I am missing something.

But if I understand correctly, the main motivation for introducing the new types with the additional restrictions is to prevent users from writing code that cannot be mapped directly to the hardware?

In particular, is a concern with the matrix types extension that a user could specify a matrix operation that cannot be mapped directly to the MMA extension, e.g. a multiple of 13 x 7 float matrixes?
And specify costly accesses, for example repeated access to elements that live in different VSR registers?

However, although we're not planning on supporting the target-independent matrix types for these reasons, we're not excluding supporting the target-independent matrix operations. We are exploring implementing the target-independent matrix multiplication operation with MMA kernels. That way, on PowerPC, programs using target-independent matrix types and operations would actually benefit from MMA for matrix multiplication with no additional effort.

We recently started working on providing some infrastructure to allow for target-specific lowering upstream. Any collaboration on that front would be very welcome, to make sure things are general enough to support many different hardware extensions.

Cheers,
Florian

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Keane, Erich via cfe-dev
Hi,

On Tue, 21 Jul 2020 at 10:06, Florian Hahn <[hidden email]> wrote:
Hi,

Sorry for jumping in a bit late, I missed the initial discussion.

On Jul 21, 2020, at 05:13, Baptiste Saleil via cfe-dev <[hidden email]> wrote:
On Mon, 22 Jun 2020 at 19:01, Hal Finkel <[hidden email]> wrote:
On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
The new instructions manipulate matrices that the CPU represents by new 512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be able
to modify and extract values of the matrices and perform computations with them
by using the dedicated builtins only. The instructions are designed to be used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1].


If this is part of the documented system ABI, and what will be supported by GCC, then we should support it too.

That having been said, I'm not convinced that this is a good idea, and supporting the target-independent matrix types would be better. I understand that the copying will be expensive, and is something that should be avoided, but this is true to some extent for everything: there are some usages that compile to machine code efficiently and some that don't. We generally, however, favor the ability to create abstractions that *can* be compiled efficiently as part of expected use cases, even if we cannot guarantee that all uses will produce efficient code. In his case, you're prohibiting the creation of abstractions (by semantically restricting to local variables) because you fear that not all uses will compile to efficient code. Are there some other structural reasons why supporting these are regular values would be problematic?

Supporting these as regular values would be problematic for several reasons. These new accumulator registers are actually each associated with 4 of the existing 128-bit VSR vector registers. A particularity of MMA is that when an accumulator contains defined data, its 4 associated registers contain undefined data and cannot be used. When copying an accumulator, we need to:
  1. Copy its value back to its four associated VSRs
  2. Copy these 4 VSRs to the VSRs associated with the destination accumulator
  3. Copy these VSRs to the destination accumulator 
  4. If the copy is not a kill, copy the 4 VSRs associated with the source back to the source accumulator

So if these registers were supported as regular values, we would have really expensive copy (and also expensive function calls and returns) and we would prevent from using 4 vector registers per live accumulator. More importantly (something I should have mentioned in the RFC), the new instructions actually implement a single operation that is the outer product. That means that supporting these as regular values would imply copying accumulators back to their associated VSRs and generating non-MMA instructions for any other operation anyway. Therefore, it is likely that programs using matrices would actually be less efficient.


Form the reasoning above, it sounds like there seem to be no structural reasons that prevent using the matrix type extension, unless I am missing something.

But if I understand correctly, the main motivation for introducing the new types with the additional restrictions is to prevent users from writing code that cannot be mapped directly to the hardware?

In particular, is a concern with the matrix types extension that a user could specify a matrix operation that cannot be mapped directly to the MMA extension, e.g. a multiple of 13 x 7 float matrixes?
And specify costly accesses, for example repeated access to elements that live in different VSR registers?
 
Not really. The problem is that MMA allows storing matrices in the new registers but actually supports no operation on these matrices.
The two only things it can do is compute the outer product of two vectors then store the result as a matrix and compute the outer product
of two vectors then add the result to a given matrix. (no copy, no addition, no element access, no transpose, etc...).
So if we use the new MMA registers for the matrix extension, *all* the operations except matrix multiplication would actually be slower than without MMA.
For example, accessing a single element of a matrix would imply copying the accumulator register to its four VSR registers, extracting the element from one of the VSR then copying the four VSRs to the accumulator (whereas we just extract an element from a VSR without MMA).
Similarly for binary operations, we would need to copy the accumulator to its VSRs, do the operation on the VSRs, and copy them back to the accumulator.

That's the reason why we think it is better to support the matrix extension through the llvm.matrix.multiply only. That way, the multiplication is accelerated without the need to write target-dependent code and there is no negative impact on the other operations.

The motivation to add target-dependent types is that users who want to explicitly generate code to accelerate matrix multiplication on PowerPC (typically linear algebra library developers) can do so without the need to write inline assembly . And the additional restrictions are added to help them to write these kernels with optimal performance, e.g. by preventing copies. 

However, although we're not planning on supporting the target-independent matrix types for these reasons, we're not excluding supporting the target-independent matrix operations. We are exploring implementing the target-independent matrix multiplication operation with MMA kernels. That way, on PowerPC, programs using target-independent matrix types and operations would actually benefit from MMA for matrix multiplication with no additional effort.

We recently started working on providing some infrastructure to allow for target-specific lowering upstream. Any collaboration on that front would be very welcome, to make sure things are general enough to support many different hardware extensions.

Thanks, we'll take a look at that.

Baptiste.

Cheers,
Florian

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Keane, Erich via cfe-dev
In reply to this post by Keane, Erich via cfe-dev


On 7/20/20 11:13 PM, Baptiste Saleil wrote:


On Mon, 22 Jun 2020 at 19:01, Hal Finkel <[hidden email]> wrote:


On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
Summary
-------

New Power ISA v3.1 [0] introduces instructions to accelerate matrix
multiplication. We want to expose these instructions through a list of
target-dependent builtins and new Clang types in the form of a language
extension. This RFC gives more details on the requirements for these
types and explains how we (IBM) are implementing them in Clang.

We present the frontend implementation as an RFC because we need to add
target-specific checks in Sema and want to get feedback on our implementation
of these checks. The backend implementation does not impact the other targets
so it is not part of this RFC. Comments and questions are welcome.

Introduction
------------

The new instructions manipulate matrices that the CPU represents by new 512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be able
to modify and extract values of the matrices and perform computations with them
by using the dedicated builtins only. The instructions are designed to be used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1].


If this is part of the documented system ABI, and what will be supported by GCC, then we should support it too.

That having been said, I'm not convinced that this is a good idea, and supporting the target-independent matrix types would be better. I understand that the copying will be expensive, and is something that should be avoided, but this is true to some extent for everything: there are some usages that compile to machine code efficiently and some that don't. We generally, however, favor the ability to create abstractions that *can* be compiled efficiently as part of expected use cases, even if we cannot guarantee that all uses will produce efficient code. In his case, you're prohibiting the creation of abstractions (by semantically restricting to local variables) because you fear that not all uses will compile to efficient code. Are there some other structural reasons why supporting these are regular values would be problematic?

Supporting these as regular values would be problematic for several reasons. These new accumulator registers are actually each associated with 4 of the existing 128-bit VSR vector registers. A particularity of MMA is that when an accumulator contains defined data, its 4 associated registers contain undefined data and cannot be used. When copying an accumulator, we need to:
  1. Copy its value back to its four associated VSRs
  2. Copy these 4 VSRs to the VSRs associated with the destination accumulator
  3. Copy these VSRs to the destination accumulator
  4. If the copy is not a kill, copy the 4 VSRs associated with the source back to the source accumulator

So if these registers were supported as regular values, we would have really expensive copy (and also expensive function calls and returns


I don't see why you call four vector moves or memory access expensive? What's the alternative? The programmer needs to move the data around somehow anyway if that's the thing that they need to do.


) and we would prevent from using 4 vector registers per live accumulator. More importantly (something I should have mentioned in the RFC), the new instructions actually implement a single operation that is the outer product. That means that supporting these as regular values would imply copying accumulators back to their associated VSRs and generating non-MMA instructions for any other operation anyway. Therefore, it is likely that programs using matrices would actually be less efficient.


I'm assuming that you'll model the registers explicitly (using RegisterTuples or similar in TableGen), so you'll end up with a collection of registers that alias appropriately with their underlying VSRs, and the general infrastructure will handle the details of copying, killing, and so on. Is that correct?

If we add patterns for, say, adding, that use subregister extraction aong with the underlying VSR instructions, then hopefully the infrastructure will coalesce away any unnecessary copies and we'll get the right "in place" matrix addition. To say that the types support only outer product is, based on my interpretation of your description, technically correct, but on the other hand, elementwise operations (e.g., add) can be directly supported using the underlying operations on the VSRs at reasonably-low cost. Is this correct?

I'm not sure exactly what our legalization framework does for matrix types currently, but presumably it should handle expansion for the other cases.

Regardless of what the frontend accepts, I would prefer to see, where possible, types modeled using generic LLVM types and operations.



However, although we're not planning on supporting the target-independent matrix types for these reasons, we're not excluding supporting the target-independent matrix operations. We are exploring implementing the target-independent matrix multiplication operation with MMA kernels. That way, on PowerPC, programs using target-independent matrix types and operations would actually benefit from MMA for matrix multiplication with no additional effort.


That sounds good to me.

Thanks again,

Hal



Baptiste.


We need to add a new target-dependent type and restrict its use.
We give more details on these restrictions below. To be able to manipulate
these matrices, we want to add the `__vector_quad` type to Clang. This type
would be a PowerPC-specific builtin type mapped to the new 512-bit registers.


Okay.

 -Hal



Similarly, some of these instructions take 256-bit values that must be stored
in two consecutive VSX registers. To represent these values and minimize the
number of copies between VSX registers, we also want to add the PowerPC-specific
builtin type `__vector_pair` that would be mapped to consecutive VSX registers.

Value initialization
--------------------

The only way to initialize a `__vector_pair` is by calling a builtin taking two
128-bit vectors and assembling them to form a 256-bit pair. A similar builtin
exists to assemble four 128-bit vectors to form a 512-bit `__vector_quad`:

vector unsigned char v1 = ...;
vector unsigned char v2 = ...;
vector unsigned char v3 = ...;
vector unsigned char v4 = ...;
__vector_pair vp;
__vector_quad vq;
__builtin_mma_assemble_pair(&vp, v1, v2);
__builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);

The other way to initialize a `__vector_quad` is to call a builtin mapped to an
instruction generating a new value of this type:

__vector_quad vq1;
__builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1
__vector_quad vq2;
__builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated in vq2

Both `__vector_pair` and `__vector_quad` can also be loaded from pointers that
can potentially be casted from void or char pointers.

Value extraction
----------------

The only way to extract values from a matrix is to call the builtins
disassembling `__vector_pair` and `__vector_quad` values back into two
and four 128-bit vectors respectively:

vector unsigned char* vpr = ...;
vector unsigned char* vqr = ...;
__builtin_mma_disassemble_pair(vpr, &vp);
__builtin_mma_disassemble_acc(vqr, &vq);

Once the values are disassembled to vectors, the user can extract values as
usual, for example using the subscript operator on the vector unsigned char
values. So the typical workflow to efficiently use these instructions in a
kernel is to first initialize the matrices, then perform computations and finally
disassemble them to extract the result of the computations. These three steps
should be done using the provided builtins.

Semantics
---------

To enforce using values of these types in kernels, thus to avoid copies from/to
the matrix multiplication unit, we want to prevent as many implicit copies
as possible. That means that it should only be possible to declare values of
these types as local variables. We want to prevent any other way to declare and
use non-pointer variables of these types (global variable, function parameter,
function return, etc...).

The only situations in which these types and values of these types can be
used are:
  * Local variable declaration
  * Assignment operator
  * Builtin call parameter
  * Memory allocation
  * Typedef & alias

Implementation
--------------

We have implemented the support of these types, builtins and intrinsics in both
Clang's frontend and the LLVM PowerPC backend. We will post the backend
implementation later. We implemented and tested this support out-of-tree in
conjunction with the GCC team to ensure a common API and ensure source
compatibility. For this RFC, we have 5 patches for the frontend:
  * Add options to control MMA support on PowerPC targets [2].
  * Define the two new types as Clang target-dependent builtin types.
    As the other targets, we decided to define these types in a separate
    `PPCtypes.def` file to improve extensibility in case we need to add other
    PowerPC-specific types in the future [3].
  * Add the builtin definitions. These builtins use the two new types,
    so they use custom type descriptors. To avoid pervasive changes,
    we use custom decoding of these descriptors [4].
  * Add the Sema checks to restrict the use of the two types.
    We prevent the use of non-pointer values of these types in any declaration
    that is not a local variable declaration. We also prevent them to
    be passed as function arguments and to be returned from functions [5].
  * Implement the minimal required changes to LLVM to support the builtins.
    In this patch, we enable the use of v256i1 for intrinsic arguments and
    define all the MMA intrinsics the builtins are mapped to [6].

The backend implementation should not impact other targets. We do not plan to
add any type to LLVM. `__vector_pair` and `__vector_quad` are generated as
`v256i1` and `v512i1` respectively (both are currently unused in the PowerPC
backend). VSX pair registers will be allocated to the `v256i1` type and the
new accumulator registers will be allocated to the `v512i1` type.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Supporting the new PowerPC MMA instructions in Clang/LLVM

Keane, Erich via cfe-dev


On Mon, 27 Jul 2020 at 13:28, Hal Finkel <[hidden email]> wrote:


On 7/20/20 11:13 PM, Baptiste Saleil wrote:


On Mon, 22 Jun 2020 at 19:01, Hal Finkel <[hidden email]> wrote:


On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:
Summary
-------

New Power ISA v3.1 [0] introduces instructions to accelerate matrix
multiplication. We want to expose these instructions through a list of
target-dependent builtins and new Clang types in the form of a language
extension. This RFC gives more details on the requirements for these
types and explains how we (IBM) are implementing them in Clang.

We present the frontend implementation as an RFC because we need to add
target-specific checks in Sema and want to get feedback on our implementation
of these checks. The backend implementation does not impact the other targets
so it is not part of this RFC. Comments and questions are welcome.

Introduction
------------

The new instructions manipulate matrices that the CPU represents by new 512-bit
registers called `accumulators`. Copying matrices, modifying values and
extracting values of matrices may cause the CPU to copy values from/to the
matrix multiplication unit. To avoid degrading performance, we thus want to
minimize the number of times these operations are used. So the user will be able
to modify and extract values of the matrices and perform computations with them
by using the dedicated builtins only. The instructions are designed to be used in
computational kernels and we want to enforce that specific workflow.

Because of this restriction, we cannot rely on the target-independent matrix
types [1].


If this is part of the documented system ABI, and what will be supported by GCC, then we should support it too.

That having been said, I'm not convinced that this is a good idea, and supporting the target-independent matrix types would be better. I understand that the copying will be expensive, and is something that should be avoided, but this is true to some extent for everything: there are some usages that compile to machine code efficiently and some that don't. We generally, however, favor the ability to create abstractions that *can* be compiled efficiently as part of expected use cases, even if we cannot guarantee that all uses will produce efficient code. In his case, you're prohibiting the creation of abstractions (by semantically restricting to local variables) because you fear that not all uses will compile to efficient code. Are there some other structural reasons why supporting these are regular values would be problematic?

Supporting these as regular values would be problematic for several reasons. These new accumulator registers are actually each associated with 4 of the existing 128-bit VSR vector registers. A particularity of MMA is that when an accumulator contains defined data, its 4 associated registers contain undefined data and cannot be used. When copying an accumulator, we need to:
  1. Copy its value back to its four associated VSRs
  2. Copy these 4 VSRs to the VSRs associated with the destination accumulator
  3. Copy these VSRs to the destination accumulator
  4. If the copy is not a kill, copy the 4 VSRs associated with the source back to the source accumulator

So if these registers were supported as regular values, we would have really expensive copy (and also expensive function calls and returns


I don't see why you call four vector moves or memory access expensive? What's the alternative? The programmer needs to move the data around somehow anyway if that's the thing that they need to do.

The reason I say it's expensive is because moving an accumulator cannot be done by just moving its 4 associated VSRs. With the new ISA, before being able to use an accumulator, it needs to be primed: we need to generate an instruction (xxmtacc) to copy the 4 VSRs to the accumulator. Another particularity is that when an accumulator is primed, the data in its associated VSRs is undefined, so to get the data back to the VSRs, the accumulator needs to be unprimed: we need to generate an instruction (xxmfacc) to copy the accumulator back to its associated VSRs. That means that to copy an accumulator, we need to generate an xxmfacc, generate the VSR copies, generate an xxmtacc for the destination and if the copy is not a kill, generate an xxmtacc to also reprime the source.

The alternative would be to try to prevent the user from copying these registers (this is what we want to do with Sema checks).
Because these registers are designed to be used for matrix multiplication, so in typed kernels with a very limited set of features, the user shouldn't need to move the data around.
For a user, the classical way of using these registers for matrix multiplication is to prime the accumulators he wants to use, then compute successive outer products in loops using the accumulators as builtin call arguments and finally unprime the accumulators to get the result of the matrix multiplications. 


) and we would prevent from using 4 vector registers per live accumulator. More importantly (something I should have mentioned in the RFC), the new instructions actually implement a single operation that is the outer product. That means that supporting these as regular values would imply copying accumulators back to their associated VSRs and generating non-MMA instructions for any other operation anyway. Therefore, it is likely that programs using matrices would actually be less efficient.


I'm assuming that you'll model the registers explicitly (using RegisterTuples or similar in TableGen), so you'll end up with a collection of registers that alias appropriately with their underlying VSRs, and the general infrastructure will handle the details of copying, killing, and so on. Is that correct?

That's correct, these registers will be defined as any other register class in the PowerPC backend and we let the general infrastructure handle copying, killing, etc... 

If we add patterns for, say, adding, that use subregister extraction aong with the underlying VSR instructions, then hopefully the infrastructure will coalesce away any unnecessary copies and we'll get the right "in place" matrix addition. To say that the types support only outer product is, based on my interpretation of your description, technically correct, but on the other hand, elementwise operations (e.g., add) can be directly supported using the underlying operations on the VSRs at reasonably-low cost. Is this correct?

I'm not sure exactly what our legalization framework does for matrix types currently, but presumably it should handle expansion for the other cases.

This is where the concept of prime/unprime is an issue. The infrastructure would allow to support these operations at reasonably-low cost (with a generated code that would be similar to what we have now) except that the accumulator would need to be unprimed (xxmfacc) before each operation and primed (xxmtacc) after. And there would be no way to avoid generating these instructions. 

Regardless of what the frontend accepts, I would prefer to see, where possible, types modeled using generic LLVM types and operations.

Yes, we're trying to avoid using target-specialized code as much as possible.

Thanks,

Baptiste.



However, although we're not planning on supporting the target-independent matrix types for these reasons, we're not excluding supporting the target-independent matrix operations. We are exploring implementing the target-independent matrix multiplication operation with MMA kernels. That way, on PowerPC, programs using target-independent matrix types and operations would actually benefit from MMA for matrix multiplication with no additional effort.


That sounds good to me.

Thanks again,

Hal



Baptiste.


We need to add a new target-dependent type and restrict its use.
We give more details on these restrictions below. To be able to manipulate
these matrices, we want to add the `__vector_quad` type to Clang. This type
would be a PowerPC-specific builtin type mapped to the new 512-bit registers.


Okay.

 -Hal



Similarly, some of these instructions take 256-bit values that must be stored
in two consecutive VSX registers. To represent these values and minimize the
number of copies between VSX registers, we also want to add the PowerPC-specific
builtin type `__vector_pair` that would be mapped to consecutive VSX registers.

Value initialization
--------------------

The only way to initialize a `__vector_pair` is by calling a builtin taking two
128-bit vectors and assembling them to form a 256-bit pair. A similar builtin
exists to assemble four 128-bit vectors to form a 512-bit `__vector_quad`:

vector unsigned char v1 = ...;
vector unsigned char v2 = ...;
vector unsigned char v3 = ...;
vector unsigned char v4 = ...;
__vector_pair vp;
__vector_quad vq;
__builtin_mma_assemble_pair(&vp, v1, v2);
__builtin_mma_assemble_acc(&vq, v1, v2, v3, v4);

The other way to initialize a `__vector_quad` is to call a builtin mapped to an
instruction generating a new value of this type:

__vector_quad vq1;
__builtin_mma_xxsetaccz(&vq1); // zero-initializes vq1
__vector_quad vq2;
__builtin_mma_xvi4ger8(&vq2, v1, v2); // new value generated in vq2

Both `__vector_pair` and `__vector_quad` can also be loaded from pointers that
can potentially be casted from void or char pointers.

Value extraction
----------------

The only way to extract values from a matrix is to call the builtins
disassembling `__vector_pair` and `__vector_quad` values back into two
and four 128-bit vectors respectively:

vector unsigned char* vpr = ...;
vector unsigned char* vqr = ...;
__builtin_mma_disassemble_pair(vpr, &vp);
__builtin_mma_disassemble_acc(vqr, &vq);

Once the values are disassembled to vectors, the user can extract values as
usual, for example using the subscript operator on the vector unsigned char
values. So the typical workflow to efficiently use these instructions in a
kernel is to first initialize the matrices, then perform computations and finally
disassemble them to extract the result of the computations. These three steps
should be done using the provided builtins.

Semantics
---------

To enforce using values of these types in kernels, thus to avoid copies from/to
the matrix multiplication unit, we want to prevent as many implicit copies
as possible. That means that it should only be possible to declare values of
these types as local variables. We want to prevent any other way to declare and
use non-pointer variables of these types (global variable, function parameter,
function return, etc...).

The only situations in which these types and values of these types can be
used are:
  * Local variable declaration
  * Assignment operator
  * Builtin call parameter
  * Memory allocation
  * Typedef & alias

Implementation
--------------

We have implemented the support of these types, builtins and intrinsics in both
Clang's frontend and the LLVM PowerPC backend. We will post the backend
implementation later. We implemented and tested this support out-of-tree in
conjunction with the GCC team to ensure a common API and ensure source
compatibility. For this RFC, we have 5 patches for the frontend:
  * Add options to control MMA support on PowerPC targets [2].
  * Define the two new types as Clang target-dependent builtin types.
    As the other targets, we decided to define these types in a separate
    `PPCtypes.def` file to improve extensibility in case we need to add other
    PowerPC-specific types in the future [3].
  * Add the builtin definitions. These builtins use the two new types,
    so they use custom type descriptors. To avoid pervasive changes,
    we use custom decoding of these descriptors [4].
  * Add the Sema checks to restrict the use of the two types.
    We prevent the use of non-pointer values of these types in any declaration
    that is not a local variable declaration. We also prevent them to
    be passed as function arguments and to be returned from functions [5].
  * Implement the minimal required changes to LLVM to support the builtins.
    In this patch, we enable the use of v256i1 for intrinsic arguments and
    define all the MMA intrinsics the builtins are mapped to [6].

The backend implementation should not impact other targets. We do not plan to
add any type to LLVM. `__vector_pair` and `__vector_quad` are generated as
`v256i1` and `v512i1` respectively (both are currently unused in the PowerPC
backend). VSX pair registers will be allocated to the `v256i1` type and the
new accumulator registers will be allocated to the `v512i1` type.

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev