Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
Hi,

Alexey Bataev and I (Lingda Li) would like to have your attention on an ongoing discussion of 2 schemes to implement the declare mapper in OpenMP 5.0. The detailed discussion can be found at https://reviews.llvm.org/D59474

Scheme 1 (the one has been implemented by me in https://reviews.llvm.org/D59474):
For each mapper function, the compiler generates a function like this:

```
void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
  // Allocate space for an array section first.
  if (size > 1 && !maptype.IsDelete)
     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
   
  // Map members.
  for (unsigned i = 0; i < size; i++) {
     // For each component specified by this mapper:
     for (auto c : components) {
       ...; // code to generate c.arg_base, c.arg_begin, c.arg_size, c.arg_type
       if (c.hasMapper())
         (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
       else
         <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
     }
  }
  // Delete the array section.
  if (size > 1 && maptype.IsDelete)
    <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
}
```
This function is passed to the OpenMP runtime, and the runtime will call this function to finish the data mapping.


Scheme 2 (which Alexey proposes):
Alexey proposes to move parts of the mapper function above into the OpenMP runtime, so the compiler will generate code below:
```
void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
  ...; // code to generate arg_base, arg_begin, arg_size, arg_type, arg_mapper.
 auto sub_components[] = {...}; // fill in generated begin, base, ...
 __tgt_mapper(base, begin, size, type, sub_components);
}
```

`__tgt_mapper` is a runtime function as below:
```
void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, auto components[]) {
  // Allocate space for an array section first.
  if (size > 1 && !maptype.IsDelete)
     <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
   
  // Map members.
  for (unsigned i = 0; i < size; i++) {
     // For each component specified by this mapper:
     for (auto c : components) {
       if (c.hasMapper())
         (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
       else
         <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
     }
  }
  // Delete the array section.
  if (size > 1 && maptype.IsDelete)
    <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
}
```

Comparison:
Why to choose 1 over 2:
1. In scheme 2,  the compiler needs to generate all map types and pass them to __tgt_mapper through sub_components. But in this case, the compiler won't be able to generate the correct MEMBER_OF field in map type. As a result, the runtime has to fix it using the mechanism we already have here: __tgt_mapper_num_components. This not only increases complexity, but also, it means the runtime needs further manipulation of the map type, which creates locality issues. While in the current scheme, the map type is generated by compiler once, so the data locality will be very good in this case.
2. In scheme 2, sub_components includes all components that should be mapped. If we are mapping an array, this means we need to map many components, which will need to allocate memory for sub_components in the heap. This creates further memory management burden and is not an efficient way to use memory.
3. In scheme 1, we are able to inline nested mapper functions. As a result, the compiler can do further optimizations to optimize the mapper function, e.g., eliminate redundant computation, loop unrolling, and thus achieve potentially better performance. We cannot achieve these optimizations in scheme 2.

Why to choose 2 over 1:
1. Less code in the mapper function codegen (I doubt this because the codegen function of scheme 1 uses less than 200 loc)


We will appreciate if you can share your opinions.

Thanks,
Lingda Li

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev


On Fri, Jun 28, 2019 at 9:49 AM Li, Lingda <[hidden email]> wrote:
I don't think we can have the buffer allocated within the mapper function. It has to be done in the runtime, because of nested mappers.
First, all mapper functions are born in the same way. We cannot make the outer most mapper function allocate memory, whether the inner one doesn't and has to use what is allocated by the outer most mapper function.
I suppose we still need to allocate memory in the runtime, so the runtime can pass the pointer and size to the mapper function, and the outer mapper function can then pass them into inner ones.
Again, this is just like the current implementation, except that we don't use vecter::push_back(), instead we use something like a manual implementation of vector::push_back() (because we need to use the pointer and the current index)

I believe the key question here is whether it is true that (the overhead of push_back() > the overhead of precalculating the total number + the memory allocation overhead + directly memory write). This will decide whether this change is necessary. Any opinions?

Thanks,
Lingda Li

From: Alexey Bataev <[hidden email]>
Sent: Thursday, June 27, 2019 5:05 PM
To: Li, Lingda
Cc: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios; Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif, Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert, Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph; Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky, Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael P; Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M; Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert; Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita Chandrasekaran; [hidden email]; Sergio Pino Gallardo; Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha; Wilmarth, Terry L; Tianyi Zhang; [hidden email]; Wang Chen; Wael Yehia; Tian, Xinmin
Subject: Re: Re: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen
 
Yes, we need 2 functions, but thw first one can be optimized very effectively. After the optimizations and inlining it will end up with just return s1+s2+s3... I think, inost cases those sizes will be constant, since the mapper maps constant number of elements. And, thus, this expression will be optimized to just a constant value.
You don't need to pass these functions to runtime. We can call the directly from the compiler.
1st call: get number of elements.
2nd: allocate the buffer
3rd call: call mapper with this preallocated buffer that fills this buffer without any calls of the runtime functions.
4th call: call the runtime to pass the buffer to the runtime.

Best regards,
Alexey Bataev

27 июня 2019 г., в 16:53, Li, Lingda <[hidden email]> написал(а):

If we precalculate the size, first, it means we need to generate 2 functions for each mapper, rather than 1 now. One for mapping information filling as we have, the other for size calculation (This will not return constant values, because size depends on how many instances we are mapping). Both these 2 functions will need to be passed to the runtime. The runtime will need to precalculate the number of components first, then allocate memory, then call the mapper function to fill it up.

Compared with the scheme 1, the differences are:
1) An extra call to calculate the total number, while scheme 1 does not;
2) A preallocated buffer, whose pointer and the current number should be passed to the mapper function, then the mapper function uses them to fill components, while scheme 1 uses push_back() to do the same thing.

Is there really a benefit doing this? push_back() should be efficient enough compared with directly writing to memory.

If people here think that, the overhead of push_back() > the overhead of precalculating the total number + the memory allocation overhead + directly memory write, then we can consider this scheme.

Thanks,
Lingda Li



From: Alexey Bataev <[hidden email]>
Sent: Thursday, June 27, 2019 4:26 PM
To: Li, Lingda
Cc: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios; Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif, Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert, Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph; Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky, Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael P; Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M; Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert; Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita Chandrasekaran; [hidden email]; Sergio Pino Gallardo; Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha; Wilmarth, Terry L; Tianyi Zhang; [hidden email]; Wang Chen; Wael Yehia; Tian, Xinmin
Subject: Re: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen
 

If the functions are inlined (the ines, intended for size precalculation). They can be optimized out very effectively since in most cases they will return constant values.
If we could do this, we won't need vectors and oush_backs, we can use preallocated memory and internal counter.
--------------
Best regards,
Alexey Bataev

<graycol.gif>"Li, Lingda" ---06/27/2019 04:13:03 PM---Hi Alexey, I think that's why we choose to use variable size storage like std::vector to store the m

From: "Li, Lingda" <[hidden email]>
To: Alexey Bataev <[hidden email]>, Deepak Eachempati <[hidden email]>
Cc: "Narayanaswamy, Ravi" <[hidden email]>, "Alexandre Eichenberger" <[hidden email]>, "Chapman, Barbara (Contact)" <[hidden email]>, "Bobrovsky, Konstantin S" <[hidden email]>, Carlo Bertolli <[hidden email]>, "Chan, SiuChi" <[hidden email]>, "Cownie, James H" <[hidden email]>, David Oehmke <[hidden email]>, "Denny, Joel E." <[hidden email]>, "Dmitriev, Serguei N" <[hidden email]>, "Doerfert, Johannes Rudolf" <[hidden email]>, Ettore Tiotto <[hidden email]>, "[hidden email]" <[hidden email]>, Gheorghe-Teod Bercea <[hidden email]>, Hal Finkel <[hidden email]>, "[hidden email]" <[hidden email]>, Jeeva Paudel <[hidden email]>, Jeff Heath <[hidden email]>, Jeffrey Sandoval <[hidden email]>, "Jones, Jeff C" <[hidden email]>, "[hidden email]" <[hidden email]>, Kelvin Li <[hidden email]>, "Kevin K O'Brien" <[hidden email]>, "[hidden email]" <[hidden email]>, "Kotsifakou, Maria" <[hidden email]>, "Krishnaiyer, Rakesh" <[hidden email]>, "Lieberman, Ron" <[hidden email]>, "Lopez, Matthew Graham" <[hidden email]>, "[hidden email]" <[hidden email]>, Martin Kong <[hidden email]>, Matt Martineau <[hidden email]>, "Menard, Lorri" <[hidden email]>, "Monteleone, Robert" <[hidden email]>, "[hidden email]" <[hidden email]>, "Rao, Premanand M" <[hidden email]>, "Rice, Michael P" <[hidden email]>, "Robichaux, Joseph" <[hidden email]>, "[hidden email]" <[hidden email]>, "Rokos, Georgios" <[hidden email]>, Samuel Antao <[hidden email]>, "Sarah McNamara" <[hidden email]>, "[hidden email]" <[hidden email]>, Sergio Pino Gallardo <[hidden email]>, "Sharif, Hashim" <[hidden email]>, "Sjodin, Jan" <[hidden email]>, Sunil Shrestha <[hidden email]>, Sunita Chandrasekaran <[hidden email]>, "Tian, Xinmin" <[hidden email]>, Tianyi Zhang <[hidden email]>, "[hidden email]" <[hidden email]>, Wael Yehia <[hidden email]>, Wang Chen <[hidden email]>, "Wilmarth, Terry L" <[hidden email]>
Date: 06/27/2019 04:13 PM
Subject: [EXTERNAL] Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen





Hi Alexey,

I think that's why we choose to use variable size storage like std::vector to store the mapping information at the first place, right? It'll be costly to precalculate the total number of components, especially in the presence of nested mappers. Besides, a runtime function call is just a std::vector::push, so I think it's okay to have multiple function calls.

Thanks,
Lingda Li


From: Alexey Bataev <[hidden email]>
Sent:
Thursday, June 27, 2019 3:52 PM
To:
Deepak Eachempati
Cc:
Li, Lingda; Narayanaswamy, Ravi; Alexandre Eichenberger; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; Carlo Bertolli; Chan, SiuChi; Cownie, James H; David Oehmke; Denny, Joel E.; Dmitriev, Serguei N; Doerfert, Johannes Rudolf; Ettore Tiotto; [hidden email]; Gheorghe-Teod Bercea; Hal Finkel; [hidden email]; Jeeva Paudel; Jeff Heath; Jeffrey Sandoval; Jones, Jeff C; [hidden email]; Kelvin Li; Kevin K O'Brien; [hidden email]; Kotsifakou, Maria; Krishnaiyer, Rakesh; Lieberman, Ron; Lopez, Matthew Graham; [hidden email]; Martin Kong; Matt Martineau; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; Samuel Antao; Sarah McNamara; [hidden email]; Sergio Pino Gallardo; Sharif, Hashim; Sjodin, Jan; Sunil Shrestha; Sunita Chandrasekaran; Tian, Xinmin; Tianyi Zhang; [hidden email]; Wael Yehia; Wang Chen; Wilmarth, Terry L
Subject:
Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Lingda, can we in scheme 1 precalculate the total number of components, allocate memory for these precalculate number of elements, then fill it with mappers and only after that call the runtime function (only once!) to transfer the mappings to the runtime?

Best regards,
Alexey Bataev

27 июня 2019 г., в 15:44, Deepak Eachempati <[hidden email]> написал(а):
      Got it. Thanks.

      -- Deepak

      From: Li, Lingda [[hidden email]]
      Sent:
      Thursday, June 27, 2019 2:41 PM
      To:
      Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
      Subject:
      Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      In the current scheme, all mappings within a mapper function is done atomically by one thread. In the mapper function of the example in the original email, <push> will just push the mapping information into an internal data structure. Once all mapping information is available, the runtime will do the real mapping together. For your example, the behavior is the same as the code below:

      ...
      #pragma omp parallel num_threads(2)
      {
      if (omp_get_thread_num() == 0) {
      #pragma omp target map(s.x, s.p[0:s.x])
      {
      for (int i = 0; i < s.x; i++) s.p[i] = i;
      }
      } else {
      #pragma omp target map(other_data)
      {
      // work on other_data
      }
      }
      ...

      From: Deepak Eachempati <[hidden email]>
      Sent:
      Thursday, June 27, 2019 3:34 PM
      To:
      Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger'; 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
      Subject:
      RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      I was referring to something like this, where another thread is not trying to map the same data:

      #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
      S s;
      ...
      #pragma omp parallel num_threads(2)
      {
      if (omp_get_thread_num() == 0) {
      #pragma omp target map(s)
      {
      for (int i = 0; i < s.x; i++) s.p[i] = i;
      }
      } else {
      #pragma omp target map(other_data)
      {
      // work on other_data
      }
      }
      ...

      Since I believe you are mapping s.x and s.p as separate map operations, it is possible that another thread could map ‘other_data’ in between those two maps. If this happens, will your implementation still ensure that s.x and s.p are positioned at the right offsets with respect to the same base address (&s)?

      -- Deepak

      From: Li, Lingda [[hidden email]]
      Sent:
      Thursday, June 27, 2019 2:26 PM
      To:
      Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
      Subject:
      Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      When 2 threads try to concurrently map the same data, it behaves the same as when 2 threads concurrently map the same data using map clauses, and mappers don't introduce extra considerations here. For instance, both threads use #omp target enter data concurrently.

      When 2 threads concurrently maps the same data, my understanding based on the current code is, it will create 2 copies of the same data, either copy is correctly to use. It may have a problem when both copies are mapped back if not synchronized correctly, but this is a programming issue, not the responsibility of OpenMP.

      Thanks,
      Lingda Li

      From: Deepak Eachempati <[hidden email]>
      Sent:
      Thursday, June 27, 2019 3:17 PM
      To:
      Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger'; 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
      Subject:
      RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      Thanks.

      Is it possible for another thread to be concurrently mapped something else while the maps from the mapper function are taking place? If so, how do you guarantee that the allocation for each component will get you the right addresses in device memory? Sorry if this was covered before and I missed it.

      -- Deepak

      From: Li, Lingda [[hidden email]]
      Sent:
      Thursday, June 27, 2019 2:08 PM
      To:
      Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
      Subject:
      Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      Hi Deepak,

      Yes, it handles this case. The first part of mapper function (initially allocate space for the whole array) is just an optimization, not required for correctness, as suggested by you in an early discussion.

      In your example, s.x and s.p will be allocated separately (not in a single allocation). But Clang guarantees that their addresses will be correct because s.x and s.p share the same base address, which is &s.

      Thanks,
      Lingda Li

      From: Deepak Eachempati <[hidden email]>
      Sent:
      Thursday, June 27, 2019 2:49 PM
      To:
      Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger'; 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; '[hidden email]'; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
      Subject:
      RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      For Scheme 1, it looks like you are doing separate maps for each component when size == 1. It seems like the first and last if statements should have “size >= 1” rather than “size > 1”.

      If the mapper is declared like this:

      struct S {
      int x;
      ... // other stuff
      int *p;
      };

      #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])

      And you have:

      S s;
      ...
      #pragma omp target map(s)
      {
      for (int i = 0; i < s.x; i++) s.p[i] = i;
      }

      Since the target construct is just mapping a single structure of type S, there should be one map that takes care of mapping storage for s.x and s.p with a single allocation, and a separate map for the array section s.p[0:s.x], and finally the pointer attachment of s.p to s.p[0:s.x]. Does Scheme 1 handle this?

      -- Deepak


      From: Li, Lingda [[hidden email]]
      Sent:
      Thursday, June 27, 2019 1:07 PM
      To:
      Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; Deepak Eachempati <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; '[hidden email]' <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; Li, Lingda <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
      Subject:
      Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      Hi,

      Alexey and I would like to have your attention on an ongoing discussion of 2 schemes to implement the declare mapper in OpenMP 5.0. The detailed discussion can be found at https://reviews.llvm.org/D59474

      Scheme 1 (the one has been implemented by me in https://reviews.llvm.org/D59474):
      The detailed design can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
      For each mapper function, the compiler generates a function like this:

      ```
      void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
      // Allocate space for an array section first.
      if (size > 1 && !maptype.IsDelete)
      <push>(base, begin, size*sizeof(Ty), clearToFrom(type));

      // Map members.
      for (unsigned i = 0; i < size; i++) {
      // For each component specified by this mapper:
      for (auto c : components) {

      ...; // code to generate c.arg_base, c.arg_begin, c.arg_size, c.arg_type
      if (c.hasMapper())
      (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
      else
      <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
      }
      }
      // Delete the array section.
      if (size > 1 && maptype.IsDelete)
      <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
      }

      ```
      This function is passed to the OpenMP runtime, and the runtime will call this function to finish the data mapping.


      Scheme 2 (which Alexey proposes):
      Alexey proposes to move parts of the mapper function above into the OpenMP runtime, so the compiler will generate code below:
      ```
      void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
      ...; // code to generate arg_base, arg_begin, arg_size, arg_type, arg_mapper.
      auto sub_components[] = {...}; // fill in generated begin, base, ...
      __tgt_mapper(base, begin, size, type, sub_components);
      }

      ```

      `__tgt_mapper` is a runtime function as below:
      ```
      void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, auto components[]) {
      // Allocate space for an array section first.
      if (size > 1 && !maptype.IsDelete)
      <push>(base, begin, size*sizeof(Ty), clearToFrom(type));

      // Map members.
      for (unsigned i = 0; i < size; i++) {
      // For each component specified by this mapper:
      for (auto c : components) {
      if (c.hasMapper())
      (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
      else
      <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
      }
      }
      // Delete the array section.
      if (size > 1 && maptype.IsDelete)
      <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
      }

      ```

      Comparison:
      Why to choose 1 over 2:
      1. In scheme 2, the compiler needs to generate all map types and pass them to __tgt_mapper through sub_components. But in this case, the compiler won't be able to generate the correct MEMBER_OF field in map type. As a result, the runtime has to fix it using the mechanism we already have here: __tgt_mapper_num_components. This not only increases complexity, but also, it means the runtime needs further manipulation of the map type, which creates locality issues. While in the current scheme, the map type is generated by compiler once, so the data locality will be very good in this case.
      2. In scheme 2, sub_components includes all components that should be mapped. If we are mapping an array, this means we need to map many components, which will need to allocate memory for sub_components in the heap. This creates further memory management burden and is not an efficient way to use memory.
      3. In scheme 1, we are able to inline nested mapper functions. As a result, the compiler can do further optimizations to optimize the mapper function, e.g., eliminate redundant computation, loop unrolling, and thus achieve potentially better performance. We cannot achieve these optimizations in scheme 2.

      Why to choose 2 over 1:
      1. Less code in the mapper function codegen (I doubt this because the codegen function of scheme 1 uses less than 200 loc)
      Alexey may have other reasons.

      We will appreciate if you can share your thoughts.

      Thanks,
      Lingda Li

      From: Narayanaswamy, Ravi <[hidden email]>
      Sent:
      Wednesday, June 19, 2019 3:09 PM
      To:
      'Alexandre Eichenberger'; 'Alexey Bataev'; '[hidden email]'; Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; Deepak Eachempati; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; '[hidden email]'; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; '[hidden email]'; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; '[hidden email]'; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; Narayanaswamy, Ravi; 'Oscar R. Hernandez'; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; Rodgers, Gregory; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha ([hidden email]); 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
      Subject:
      OpenMP / HPC in Clang / LLVM Multi-company Telecom Meeting Minutes June 19th 2019

      Next Meeting : July 10th (Moved from July 3rd)

      Opens :
      - Documentation
      - Greg : Can we have documents for libopenmp and Libomptarget.
      - Alexey suggested having 3 documents: libopenmp, Libomptarget and device plugin
      - Hal will convert the existing libomptarget document. Once done others can update document to capture the existing implementation
      Future addition to libomptarget will also require update to document.
      - Next libopenmp document will be created if it does not exist or updated if one exists.

      LTO for fat binary linking
      - Serguei (Intel) has an implementation which enables LTO and doing away with linker scripts.
      Everybody agreed this is a good idea, especially some linkers don’t have support for linker scripts.
      AMD is interested in enabling enabling LTO and will like to see the code
      Serguei to post the code to get feedback from all
      - Hal to present in next meeting his proposal to support static fat archives using LTO.

      OpenMP 5.0 Features
      - No update on setting up the public website. Johannes was out attending ISC.
      - New features added since last release (courtesy of Kelvin)
      - allocate clause/allocate directive - parsing+sema, codegen
      - mutexinout dependence-type for task
      - user-defined mapper (declare mapper) - parsing+sema.
      - omp_get_device_num() API routine


      Development Activity
      - Async API
      Support in Clang and libopenmp including lit test had been checked in by Doru

      - Mapper support
      Initial support for Mapper has been posted for review Lingda. Once approved, the rest of the support will be done
      Lingda : Should the old API being replaced by the new similar API with extra mapper argument be obsoleted
      Suggestion was for clang to not generated but keep the API in libomptarget for backward compatible. In the future it can be obsoleted

      - Required Directives
      Support for required directives has been checked in by Doru.
      There was one issue with checking for requires directive and confirming it the Declare type is TO or LINK.
      Doru removed the check and added note to make sure if things change in future need to modify this code.

      Roll Call :
Company Attendees
19-Jun
AMD
Greg Rodgers
x
Ashwin Aji
Jan Sjodin
x
Ron Lieberman
x
sameer Sahasrabuddhe
Andrey Kasaurov
ANL Hal Finkel
x
Johannes Doerfert
IBM Alexandre Eichenberger
Carlo Bertolli
Kelvin Li
Doru
x
Alexey Bataev
x
Intel Andrey Churbanov
Ravi Narayanaswamy
x
Serguei Dmitriev
x
Rajiv Deodhar
Lorri Menard
Terry Wilmarth
Rao, Prem
Hansang Bae
George Rokos
x
Cray Deepak Eachempati
x
Micron John Leidel
Nvidia James Beyer
x
ORNL Graham Lopez
Joel Denny
Geoffroy Vallee
Oscar Hernandez
SBU/BNL Lingda Li
x
Jose Monlsave
Martin Kong
TI Eric Stotzer
U of Bristol Mat Martineau
U of Delaware Sunita Chandrasekaran
U of Illinois Hashim Sharif
Rice John Mellor-Crummey
LSU Tianyi Zhang



      .........................................................................................................................................
      àJoin Skype Meeting Join by phone
      <a href="tel:+1(916)356-2663%20(or%20your%20local%20bridge%20access%20#)%20Choose%20bridge%205." target="_blank">+1(916)356-2663 (or your local bridge access #) Choose bridge 5. (Global) English (United States)
      Find a local number

      Conference ID: 7607896966
      Forgot your dial-in PIN? |Help

      [!OC([1033])!]
      .........................................................................................................................................




_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev

Hi Lingda, thanks for your comments.
We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.
With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.
But anyway, I agree, it would be good to hear some other opinions.
--------------
Best regards,
Alexey Bataev

Inactive hide details for "Li, Lingda" ---06/28/2019 09:49:09 AM---............................................................"Li, Lingda" ---06/28/2019 09:49:09 AM---...............................................................................................................................

From: "Li, Lingda" <[hidden email]>
To: Alexey Bataev <[hidden email]>
Cc: Alexandre Eichenberger <[hidden email]>, "Chapman, Barbara (Contact)" <[hidden email]>, Kevin K O'Brien <[hidden email]>, "Carlo Bertolli" <[hidden email]>, Deepak Eachempati <[hidden email]>, "Denny, Joel E." <[hidden email]>, David Oehmke <[hidden email]>, "Ettore Tiotto" <[hidden email]>, "[hidden email]" <[hidden email]>, "Rokos, Georgios" <[hidden email]>, Gheorghe-Teod Bercea <[hidden email]>, "[hidden email]" <[hidden email]>, Hal Finkel <[hidden email]>, "Sharif, Hashim" <[hidden email]>, "Cownie, James H" <[hidden email]>, "Sjodin, Jan" <[hidden email]>, "[hidden email]" <[hidden email]>, "Doerfert, Johannes Rudolf" <[hidden email]>, "Jones, Jeff C" <[hidden email]>, "[hidden email]" <[hidden email]>, "Robichaux, Joseph" <[hidden email]>, Jeff Heath <[hidden email]>, "[hidden email]" <[hidden email]>, Kelvin Li <[hidden email]>, "Bobrovsky, Konstantin S" <[hidden email]>, "Kotsifakou, Maria" <[hidden email]>, "[hidden email]" <[hidden email]>, "Lopez, Matthew Graham" <[hidden email]>, "Menard, Lorri" <[hidden email]>, Martin Kong <[hidden email]>, Sarah McNamara <[hidden email]>, "Rice, Michael P" <[hidden email]>, "Matt Martineau" <[hidden email]>, "[hidden email]" <[hidden email]>, Jeeva Paudel <[hidden email]>, "Rao, Premanand M" <[hidden email]>, "Krishnaiyer, Rakesh" <[hidden email]>, "Narayanaswamy, Ravi" <[hidden email]>, "Monteleone, Robert" <[hidden email]>, "Lieberman, Ron" <[hidden email]>, Samuel Antao <[hidden email]>, Jeffrey Sandoval <[hidden email]>, Sunita Chandrasekaran <[hidden email]>, "[hidden email]" <[hidden email]>, Sergio Pino Gallardo <[hidden email]>, "Dmitriev, Serguei N" <[hidden email]>, "Chan, SiuChi" <[hidden email]>, Sunil Shrestha <[hidden email]>, "Wilmarth, Terry L" <[hidden email]>, Tianyi Zhang <[hidden email]>, "[hidden email]" <[hidden email]>, Wang Chen <[hidden email]>, Wael Yehia <[hidden email]>, "Tian, Xinmin" <[hidden email]>, "Li, Lingda (Contact)" <[hidden email]>
Date: 06/28/2019 09:49 AM
Subject: [EXTERNAL] Re: Re: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen





I don't think we can have the buffer allocated within the mapper function. It has to be done in the runtime, because of nested mappers.
First, all mapper functions are born in the same way. We cannot make the outer most mapper function allocate memory, whether the inner one doesn't and has to use what is allocated by the outer most mapper function.
I suppose we still need to allocate memory in the runtime, so the runtime can pass the pointer and size to the mapper function, and the outer mapper function can then pass them into inner ones.
Again, this is just like the current implementation, except that we don't use vecter::push_back(), instead we use something like a manual implementation of vector::push_back() (because we need to use the pointer and the current index)

I believe the key question here is whether it is true that (the overhead of push_back() > the overhead of precalculating the total number + the memory allocation overhead + directly memory write). This will decide whether this change is necessary. Any opinions?

Thanks,
Lingda Li


From: Alexey Bataev <[hidden email]>
Sent:
Thursday, June 27, 2019 5:05 PM
To:
Li, Lingda
Cc:
Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios; Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif, Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert, Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph; Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky, Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael P; Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M; Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert; Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita Chandrasekaran; [hidden email]; Sergio Pino Gallardo; Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha; Wilmarth, Terry L; Tianyi Zhang; [hidden email]; Wang Chen; Wael Yehia; Tian, Xinmin
Subject:
Re: Re: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Yes, we need 2 functions, but thw first one can be optimized very effectively. After the optimizations and inlining it will end up with just return s1+s2+s3... I think, inost cases those sizes will be constant, since the mapper maps constant number of elements. And, thus, this expression will be optimized to just a constant value.
You don't need to pass these functions to runtime. We can call the directly from the compiler.
1st call: get number of elements.
2nd: allocate the buffer
3rd call: call mapper with this preallocated buffer that fills this buffer without any calls of the runtime functions.
4th call: call the runtime to pass the buffer to the runtime.

Best regards,
Alexey Bataev

27 июня 2019 г., в 16:53, Li, Lingda <[hidden email]> написал(а):
      If we precalculate the size, first, it means we need to generate 2 functions for each mapper, rather than 1 now. One for mapping information filling as we have, the other for size calculation (This will not return constant values, because size depends on how many instances we are mapping). Both these 2 functions will need to be passed to the runtime. The runtime will need to precalculate the number of components first, then allocate memory, then call the mapper function to fill it up.

      Compared with the scheme 1, the differences are:
      1) An extra call to calculate the total number, while scheme 1 does not;
      2) A preallocated buffer, whose pointer and the current number should be passed to the mapper function, then the mapper function uses them to fill components, while scheme 1 uses push_back() to do the same thing.

      Is there really a benefit doing this? push_back() should be efficient enough compared with directly writing to memory.

      If people here think that, the overhead of push_back() > the overhead of precalculating the total number + the memory allocation overhead + directly memory write, then we can consider this scheme.

      Thanks,
      Lingda Li




      From: Alexey Bataev <[hidden email]>
      Sent:
      Thursday, June 27, 2019 4:26 PM
      To:
      Li, Lingda
      Cc:
      Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios; Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif, Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert, Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph; Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky, Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael P; Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M; Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert; Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita Chandrasekaran; [hidden email]; Sergio Pino Gallardo; Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha; Wilmarth, Terry L; Tianyi Zhang; [hidden email]; Wang Chen; Wael Yehia; Tian, Xinmin
      Subject:
      Re: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      If the functions are inlined (the ines, intended for size precalculation). They can be optimized out very effectively since in most cases they will return constant values.
      If we could do this, we won't need vectors and oush_backs, we can use preallocated memory and internal counter.
      --------------
      Best regards,
      Alexey Bataev


      <graycol.gif>"Li, Lingda" ---06/27/2019 04:13:03 PM---Hi Alexey, I think that's why we choose to use variable size storage like std::vector to store the m

      From:
      "Li, Lingda" <[hidden email]>
      To:
      Alexey Bataev <[hidden email]>, Deepak Eachempati <[hidden email]>
      Cc:
      "Narayanaswamy, Ravi" <[hidden email]>, "Alexandre Eichenberger" <[hidden email]>, "Chapman, Barbara (Contact)" <[hidden email]>, "Bobrovsky, Konstantin S" <[hidden email]>, Carlo Bertolli <[hidden email]>, "Chan, SiuChi" <[hidden email]>, "Cownie, James H" <[hidden email]>, David Oehmke <[hidden email]>, "Denny, Joel E." <[hidden email]>, "Dmitriev, Serguei N" <[hidden email]>, "Doerfert, Johannes Rudolf" <[hidden email]>, Ettore Tiotto <[hidden email]>, "[hidden email]" <[hidden email]>, Gheorghe-Teod Bercea <[hidden email]>, Hal Finkel <[hidden email]>, "[hidden email]" <[hidden email]>, Jeeva Paudel <[hidden email]>, Jeff Heath <[hidden email]>, Jeffrey Sandoval <[hidden email]>, "Jones, Jeff C" <[hidden email]>, "[hidden email]" <[hidden email]>, Kelvin Li <[hidden email]>, "Kevin K O'Brien" <[hidden email]>, "[hidden email]" <[hidden email]>, "Kotsifakou, Maria" <[hidden email]>, "Krishnaiyer, Rakesh" <[hidden email]>, "Lieberman, Ron" <[hidden email]>, "Lopez, Matthew Graham" <[hidden email]>, "[hidden email]" <[hidden email]>, Martin Kong <[hidden email]>, Matt Martineau <[hidden email]>, "Menard, Lorri" <[hidden email]>, "Monteleone, Robert" <[hidden email]>, "[hidden email]" <[hidden email]>, "Rao, Premanand M" <[hidden email]>, "Rice, Michael P" <[hidden email]>, "Robichaux, Joseph" <[hidden email]>, "[hidden email]" <[hidden email]>, "Rokos, Georgios" <[hidden email]>, Samuel Antao <[hidden email]>, "Sarah McNamara" <[hidden email]>, "[hidden email]" <[hidden email]>, Sergio Pino Gallardo <[hidden email]>, "Sharif, Hashim" <[hidden email]>, "Sjodin, Jan" <[hidden email]>, Sunil Shrestha <[hidden email]>, Sunita Chandrasekaran <[hidden email]>, "Tian, Xinmin" <[hidden email]>, Tianyi Zhang <[hidden email]>, "[hidden email]" <[hidden email]>, Wael Yehia <[hidden email]>, Wang Chen <[hidden email]>, "Wilmarth, Terry L" <[hidden email]>
      Date:
      06/27/2019 04:13 PM
      Subject:
      [EXTERNAL] Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen





      Hi Alexey,


      I think that's why we choose to use variable size storage like std::vector to store the mapping information at the first place, right? It'll be costly to precalculate the total number of components, especially in the presence of nested mappers. Besides, a runtime function call is just a std::vector::push, so I think it's okay to have multiple function calls.


      Thanks,
      Lingda Li



      From:
      Alexey Bataev <[hidden email]>
      Sent:
      Thursday, June 27, 2019 3:52 PM
      To:
      Deepak Eachempati
      Cc:
      Li, Lingda; Narayanaswamy, Ravi; Alexandre Eichenberger; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; Carlo Bertolli; Chan, SiuChi; Cownie, James H; David Oehmke; Denny, Joel E.; Dmitriev, Serguei N; Doerfert, Johannes Rudolf; Ettore Tiotto; [hidden email]; Gheorghe-Teod Bercea; Hal Finkel; [hidden email]; Jeeva Paudel; Jeff Heath; Jeffrey Sandoval; Jones, Jeff C; [hidden email]; Kelvin Li; Kevin K O'Brien; [hidden email]; Kotsifakou, Maria; Krishnaiyer, Rakesh; Lieberman, Ron; Lopez, Matthew Graham; [hidden email]; Martin Kong; Matt Martineau; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; Samuel Antao; Sarah McNamara; [hidden email]; Sergio Pino Gallardo; Sharif, Hashim; Sjodin, Jan; Sunil Shrestha; Sunita Chandrasekaran; Tian, Xinmin; Tianyi Zhang; [hidden email]; Wael Yehia; Wang Chen; Wilmarth, Terry L
      Subject:
      Re: RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

      Lingda, can we in scheme 1 precalculate the total number of components, allocate memory for these precalculate number of elements, then fill it with mappers and only after that call the runtime function (only once!) to transfer the mappings to the runtime?

      Best regards,
      Alexey Bataev

      27 июня 2019 г., в 15:44, Deepak Eachempati <[hidden email]> написал(а):
              Got it. Thanks.

              -- Deepak

              From:
              Li, Lingda [[hidden email]]
              Sent:
              Thursday, June 27, 2019 2:41 PM
              To:
              Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
              Subject:
              Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

              In the current scheme, all mappings within a mapper function is done atomically by one thread. In the mapper function of the example in the original email, <push> will just push the mapping information into an internal data structure. Once all mapping information is available, the runtime will do the real mapping together. For your example, the behavior is the same as the code below:

              ...
              #pragma omp parallel num_threads(2)
              {
              if (omp_get_thread_num() == 0) {
              #pragma omp target map(s.x, s.p[0:s.x])
              {
              for (int i = 0; i < s.x; i++) s.p[i] = i;
              }
              } else {
              #pragma omp target map(other_data)
              {
              // work on other_data
              }
              }
              ...

              From: Deepak Eachempati <[hidden email]>
              Sent:
              Thursday, June 27, 2019 3:34 PM
              To:
              Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger'; 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
              Subject:
              RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

              I was referring to something like this, where another thread is not trying to map the same data:

              #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
              S s;
              ...
              #pragma omp parallel num_threads(2)
              {
              if (omp_get_thread_num() == 0) {
              #pragma omp target map(s)
              {
              for (int i = 0; i < s.x; i++) s.p[i] = i;
              }
              } else {
              #pragma omp target map(other_data)
              {
              // work on other_data
              }
              }
              ...

              Since I believe you are mapping s.x and s.p as separate map operations, it is possible that another thread could map ‘other_data’ in between those two maps. If this happens, will your implementation still ensure that s.x and s.p are positioned at the right offsets with respect to the same base address (&s)?

              -- Deepak

              From:
              Li, Lingda [[hidden email]]
              Sent:
              Thursday, June 27, 2019 2:26 PM
              To:
              Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
              Subject:
              Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

              When 2 threads try to concurrently map the same data, it behaves the same as when 2 threads concurrently map the same data using map clauses, and mappers don't introduce extra considerations here. For instance, both threads use #omp target enter data concurrently.

              When 2 threads concurrently maps the same data, my understanding based on the current code is, it will create 2 copies of the same data, either copy is correctly to use. It may have a problem when both copies are mapped back if not synchronized correctly, but this is a programming issue, not the responsibility of OpenMP.

              Thanks,
              Lingda Li

              From: Deepak Eachempati <[hidden email]>
              Sent:
              Thursday, June 27, 2019 3:17 PM
              To:
              Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger'; 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
              Subject:
              RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

              Thanks.

              Is it possible for another thread to be concurrently mapped something else while the maps from the mapper function are taking place? If so, how do you guarantee that the allocation for each component will get you the right addresses in device memory? Sorry if this was covered before and I missed it.

              -- Deepak

              From:
              Li, Lingda [[hidden email]]
              Sent:
              Thursday, June 27, 2019 2:08 PM
              To:
              Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
              Subject:
              Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

              Hi Deepak,

              Yes, it handles this case. The first part of mapper function (initially allocate space for the whole array) is just an optimization, not required for correctness, as suggested by you in an early discussion.

              In your example, s.x and s.p will be allocated separately (not in a single allocation). But Clang guarantees that their addresses will be correct because s.x and s.p share the same base address, which is &s.

              Thanks,
              Lingda Li

              From: Deepak Eachempati <[hidden email]>
              Sent:
              Thursday, June 27, 2019 2:49 PM
              To:
              Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger'; 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; '[hidden email]'; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; [hidden email]; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
              Subject:
              RE: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

              For Scheme 1, it looks like you are doing separate maps for each component when size == 1. It seems like the first and last if statements should have “size >= 1” rather than “size > 1”.

              If the mapper is declared like this:

              struct S {
              int x;
              ... // other stuff
              int *p;
              };

              #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])

              And you have:

              S s;
              ...
              #pragma omp target map(s)
              {
              for (int i = 0; i < s.x; i++) s.p[i] = i;
              }

              Since the target construct is just mapping a single structure of type S, there should be one map that takes care of mapping storage for s.x and s.p with a single allocation, and a separate map for the array section s.p[0:s.x], and finally the pointer attachment of s.p to s.p[0:s.x]. Does Scheme 1 handle this?

              -- Deepak


              From:
              Li, Lingda [[hidden email]]
              Sent:
              Thursday, June 27, 2019 1:07 PM
              To:
              Narayanaswamy, Ravi <[hidden email]>; 'Alexandre Eichenberger' <[hidden email]>; 'Alexey Bataev' <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; 'Carlo Bertolli' <[hidden email]>; 'Chan, SiuChi' <[hidden email]>; Cownie, James H <[hidden email]>; David Oehmke <[hidden email]>; Deepak Eachempati <[hidden email]>; 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Doerfert, Johannes Rudolf <[hidden email]>; '[hidden email]' <[hidden email]>; 'Ettore Tiotto' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Gheorghe-Teod Bercea' <[hidden email]>; Hal Finkel <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Jones, Jeff C <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Kotsifakou, Maria' <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Lieberman, Ron <[hidden email]>; Li, Lingda <[hidden email]>; 'Lopez, Matthew Graham' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin Kong' <[hidden email]>; 'Matt Martineau' <[hidden email]>; Menard, Lorri <[hidden email]>; Monteleone, Robert <[hidden email]>; [hidden email]; Rao, Premanand M <[hidden email]>; Rice, Michael P <[hidden email]>; Robichaux, Joseph <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sarah McNamara' <[hidden email]>; '[hidden email]' <[hidden email]>; 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim' <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil Shrestha <[hidden email]>; 'Sunita Chandrasekaran' <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi Zhang <[hidden email]>; '[hidden email]' <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang Chen' <[hidden email]>; Wilmarth, Terry L <[hidden email]>
              Subject:
              Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

              Hi,


              Alexey and I would like to have your attention on an ongoing discussion of 2 schemes to implement the declare mapper in OpenMP 5.0. The detailed discussion can be found at
              https://reviews.llvm.org/D59474

              Scheme 1 (the one has been implemented by me in
              https://reviews.llvm.org/D59474):
              The detailed design can be found at
              https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
              For each mapper function, the compiler generates a function like this:


              ```
              void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
              // Allocate space for an array section first.
              if (size > 1 && !maptype.IsDelete)
              <push>(base, begin, size*sizeof(Ty), clearToFrom(type));

              // Map members.
              for (unsigned i = 0; i < size; i++) {
              // For each component specified by this mapper:
              for (auto c : components) {
              ...; // code to generate c.arg_base, c.arg_begin, c.arg_size, c.arg_type
              if (c.hasMapper())
              (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
              else
              <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
              }
              }
              // Delete the array section.
              if (size > 1 && maptype.IsDelete)
              <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
              }
              ```
              This function is passed to the OpenMP runtime, and the runtime will call this function to finish the data mapping.



              Scheme 2 (which Alexey proposes):
              Alexey proposes to move parts of the mapper function above into the OpenMP runtime, so the compiler will generate code below:
              ```
              void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
              ...; // code to generate arg_base, arg_begin, arg_size, arg_type, arg_mapper.
              auto sub_components[] = {...}; // fill in generated begin, base, ...
              __tgt_mapper(base, begin, size, type, sub_components);
              }
              ```


              `__tgt_mapper` is a runtime function as below:
              ```
              void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, auto components[]) {
              // Allocate space for an array section first.
              if (size > 1 && !maptype.IsDelete)
              <push>(base, begin, size*sizeof(Ty), clearToFrom(type));

              // Map members.
              for (unsigned i = 0; i < size; i++) {
              // For each component specified by this mapper:
              for (auto c : components) {
              if (c.hasMapper())
              (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
              else
              <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
              }
              }
              // Delete the array section.
              if (size > 1 && maptype.IsDelete)
              <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
              }
              ```


              Comparison:
              Why to choose 1 over 2:
              1. In scheme 2, the compiler needs to generate all map types and pass them to __tgt_mapper through sub_components. But in this case, the compiler won't be able to generate the correct MEMBER_OF field in map type. As a result, the runtime has to fix it using the mechanism we already have here: __tgt_mapper_num_components. This not only increases complexity, but also, it means the runtime needs further manipulation of the map type, which creates locality issues. While in the current scheme, the map type is generated by compiler once, so the data locality will be very good in this case.
              2. In scheme 2, sub_components includes all components that should be mapped. If we are mapping an array, this means we need to map many components, which will need to allocate memory for sub_components in the heap. This creates further memory management burden and is not an efficient way to use memory.
              3. In scheme 1, we are able to inline nested mapper functions. As a result, the compiler can do further optimizations to optimize the mapper function, e.g., eliminate redundant computation, loop unrolling, and thus achieve potentially better performance. We cannot achieve these optimizations in scheme 2.


              Why to choose 2 over 1:
              1. Less code in the mapper function codegen (I doubt this because the codegen function of scheme 1 uses less than 200 loc)
              Alexey may have other reasons.


              We will appreciate if you can share your thoughts.


              Thanks,
              Lingda Li

              From: Narayanaswamy, Ravi <[hidden email]>
              Sent:
              Wednesday, June 19, 2019 3:09 PM
              To:
              'Alexandre Eichenberger'; 'Alexey Bataev'; '[hidden email]'; Bobrovsky, Konstantin S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; Deepak Eachempati; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes Rudolf ; '[hidden email]'; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod Bercea'; '[hidden email]'; '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou, Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; '[hidden email]'; 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard, Lorri; Monteleone, Robert; Narayanaswamy, Ravi; 'Oscar R. Hernandez'; Rao, Premanand M; Rice, Michael P; Robichaux, Joseph; Rodgers, Gregory; Rokos, Georgios; '[hidden email]'; 'Sarah McNamara'; '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif, Hashim'; Sjodin, Jan ; Sunil Shrestha ([hidden email]); 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
              Subject:
              OpenMP / HPC in Clang / LLVM Multi-company Telecom Meeting Minutes June 19th 2019

              Next Meeting : July 10
              th (Moved from July 3rd)

              Opens :

              - Documentation

              - Greg : Can we have documents for libopenmp and Libomptarget.
              - Alexey suggested having 3 documents: libopenmp, Libomptarget and device plugin
              - Hal will convert the existing libomptarget document. Once done others can update document to capture the existing implementation
              Future addition to libomptarget will also require update to document.
              - Next libopenmp document will be created if it does not exist or updated if one exists.


              LTO for fat binary linking

              - Serguei (Intel) has an implementation which enables LTO and doing away with linker scripts.
              Everybody agreed this is a good idea, especially some linkers don’t have support for linker scripts.
              AMD is interested in enabling enabling LTO and will like to see the code
              Serguei to post the code to get feedback from all
              - Hal to present in next meeting his proposal to support static fat archives using LTO.


              OpenMP 5.0 Features

              - No update on setting up the public website. Johannes was out attending ISC.
              - New features added since last release (courtesy of Kelvin)

              - allocate clause/allocate directive - parsing+sema, codegen
              - mutexinout dependence-type for task
              - user-defined mapper (declare mapper) - parsing+sema.
              - omp_get_device_num() API routine


              Development Activity
              - Async API

              Support in Clang and libopenmp including lit test had been checked in by Doru


              - Mapper support

              Initial support for Mapper has been posted for review Lingda. Once approved, the rest of the support will be done
              Lingda : Should the old API being replaced by the new similar API with extra mapper argument be obsoleted
              Suggestion was for clang to not generated but keep the API in libomptarget for backward compatible. In the future it can be obsoleted


              - Required Directives

              Support for required directives has been checked in by Doru.
              There was one issue with checking for requires directive and confirming it the Declare type is TO or LINK.
              Doru removed the check and added note to make sure if things change in future need to modify this code.


              Roll Call :
              CompanyAttendees
              19-Jun
              AMD
              Greg Rodgers
              x
              Ashwin Aji
              Jan Sjodin
              x
              Ron Lieberman
              x
              sameer Sahasrabuddhe
              Andrey Kasaurov
              ANL Hal Finkel
              x
              Johannes Doerfert
              IBMAlexandre Eichenberger
              Carlo Bertolli
              Kelvin Li
              Doru
              x
              Alexey Bataev
              x
              IntelAndrey Churbanov
              Ravi Narayanaswamy
              x
              Serguei Dmitriev
              x
              Rajiv Deodhar
              Lorri Menard
              Terry Wilmarth
              Rao, Prem
              Hansang Bae
              George Rokos
              x
              CrayDeepak Eachempati
              x
              MicronJohn Leidel
              NvidiaJames Beyer
              x
              ORNLGraham Lopez
              Joel Denny
              Geoffroy Vallee
              Oscar Hernandez
              SBU/BNLLingda Li
              x
              Jose Monlsave
              Martin Kong
              TIEric Stotzer
              U of BristolMat Martineau
              U of DelawareSunita Chandrasekaran
              U of IllinoisHashim Sharif
              RiceJohn Mellor-Crummey
              LSUTianyi Zhang



              .........................................................................................................................................

              à
              Join Skype Meeting Join by phone
              <a href="tel:+1(916)356-2663%20(or%20your%20local%20bridge%20access%20#)%20Choose%20bridge%205.">+1(916)356-2663 (or your local bridge access #) Choose bridge 5. (Global) English (United States)
              Find a local number

              Conference ID: 7607896966

              Forgot your dial-in PIN? |Help

              [!OC([1033])!]

              .........................................................................................................................................



_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev

Hi, Alexey, Lingda,

I haven't been following this closely, so a few questions/comments:

 1. Recursive mappers are not supported in OpenMP 5, but do we expect that to change in the future?

 2. Our experience so far suggests that the most important optimization in this space is to limit the number of distinct host-to-device transfers (or data copies) on systems where data needs to be copied. In these schemes, where does that coalescing occur?

 3. So long as the mappers aren't recursive, I agree with Alexey that the total number of to-be-mapped components should be efficient to calculate. The counting function should simplify to a trivial expression in nearly all cases. The only case where it might not is where the type contains an array section with dynamic bounds, and the element type also has a mapper with an array section with dynamic bounds. In this case (similar to the unsupported recursive cases, which as an aside, we should probably support it as an extension) we could need to walk the data structure twice to precalculate the number of total components to map. However, this case is certainly detectable by static analysis of the declared mappers, and so I think that we can get the best of both worlds: we could use Alexey's proposed scheme except in cases where we truly need to walk the data-structure twice, in which case we could use Lingda's combined walk/push_back scheme. Is there any reason why that wouldn't work?

Thanks again,

Hal

On 6/28/19 9:00 AM, Alexey Bataev wrote:

Hi Lingda, thanks for your comments.
We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.
With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.
But anyway, I agree, it would be good to hear some other opinions.
--------------
Best regards,
Alexey Bataev


...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev

Recursive data structures are important if you consider linked lists important. 

 

Supporting these is challenging but not impossible, I would expect that if someone manages to implement a cost effective way to support linked lists we would add support to OpenMP with ease.

 

From: Finkel, Hal J. <[hidden email]>
Sent: Friday, June 28, 2019 10:46 PM
To: Alexey Bataev <[hidden email]>; Li, Lingda <[hidden email]>
Cc: Alexandre Eichenberger <[hidden email]>; Chapman, Barbara (Contact) <[hidden email]>; Kevin K O'Brien <[hidden email]>; Carlo Bertolli <[hidden email]>; Deepak Eachempati <[hidden email]>; Denny, Joel E. <[hidden email]>; David Oehmke <[hidden email]>; Ettore Tiotto <[hidden email]>; [hidden email]; Rokos, Georgios <[hidden email]>; Gheorghe-Teod Bercea <[hidden email]>; [hidden email]; Sharif, Hashim <[hidden email]>; Cownie, James H <[hidden email]>; Sjodin, Jan <[hidden email]>; James Beyer <[hidden email]>; Doerfert, Johannes <[hidden email]>; Jones, Jeff C <[hidden email]>; [hidden email]; Robichaux, Joseph <[hidden email]>; Jeff Heath <[hidden email]>; [hidden email]; Kelvin Li <[hidden email]>; Bobrovsky, Konstantin S <[hidden email]>; Kotsifakou, Maria <[hidden email]>; Li, Lingda (Contact) <[hidden email]>; Lopez, Matthew Graham <[hidden email]>; [hidden email]; Menard, Lorri <[hidden email]>; Martin Kong <[hidden email]>; Sarah McNamara <[hidden email]>; Rice, Michael P <[hidden email]>; Matt Martineau <[hidden email]>; [hidden email]; Jeeva Paudel <[hidden email]>; Rao, Premanand M <[hidden email]>; Krishnaiyer, Rakesh <[hidden email]>; Narayanaswamy, Ravi <[hidden email]>; Monteleone, Robert <[hidden email]>; Lieberman, Ron <[hidden email]>; Samuel Antao <[hidden email]>; Jeffrey Sandoval <[hidden email]>; Sunita Chandrasekaran <[hidden email]>; [hidden email]; Sergio Pino Gallardo <[hidden email]>; Dmitriev, Serguei N <[hidden email]>; Chan, SiuChi <[hidden email]>; Sunil Shrestha <[hidden email]>; Wilmarth, Terry L <[hidden email]>; Tianyi Zhang <[hidden email]>; [hidden email]; Wang Chen <[hidden email]>; Wael Yehia <[hidden email]>; Tian, Xinmin <[hidden email]>; [hidden email]; [hidden email]
Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

 

Hi, Alexey, Lingda,

I haven't been following this closely, so a few questions/comments:

 1. Recursive mappers are not supported in OpenMP 5, but do we expect that to change in the future?

 2. Our experience so far suggests that the most important optimization in this space is to limit the number of distinct host-to-device transfers (or data copies) on systems where data needs to be copied. In these schemes, where does that coalescing occur?

 3. So long as the mappers aren't recursive, I agree with Alexey that the total number of to-be-mapped components should be efficient to calculate. The counting function should simplify to a trivial expression in nearly all cases. The only case where it might not is where the type contains an array section with dynamic bounds, and the element type also has a mapper with an array section with dynamic bounds. In this case (similar to the unsupported recursive cases, which as an aside, we should probably support it as an extension) we could need to walk the data structure twice to precalculate the number of total components to map. However, this case is certainly detectable by static analysis of the declared mappers, and so I think that we can get the best of both worlds: we could use Alexey's proposed scheme except in cases where we truly need to walk the data-structure twice, in which case we could use Lingda's combined walk/push_back scheme. Is there any reason why that wouldn't work?

Thanks again,

Hal

On 6/28/19 9:00 AM, Alexey Bataev wrote:

Hi Lingda, thanks for your comments.
We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.
With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.
But anyway, I agree, it would be good to hear some other opinions.
--------------
Best regards,
Alexey Bataev



...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev
Hi Hal,

Best regards,
Alexey Bataev

28 июня 2019 г., в 23:46, Finkel, Hal J. via cfe-dev <[hidden email]> написал(а):

Hi, Alexey, Lingda,

I haven't been following this closely, so a few questions/comments:

 1. Recursive mappers are not supported in OpenMP 5, but do we expect that to change in the future?

Good question. Do not know, actually, but I think both of those schemes can be adapted to support recursive mappers.

 2. Our experience so far suggests that the most important optimization in this space is to limit the number of distinct host-to-device transfers (or data copies) on systems where data needs to be copied. In these schemes, where does that coalescing occur?

In both schemes we transfer the data only ones, after we gather all the required data mapping info and after that we transfer it to the device at once. The only difference in these schemes is the number of runtime functions calls required to fill this mapping data.

 3. So long as the mappers aren't recursive, I agree with Alexey that the total number of to-be-mapped components should be efficient to calculate. The counting function should simplify to a trivial expression in nearly all cases. The only case where it might not is where the type contains an array section with dynamic bounds, and the element type also has a mapper with an array section with dynamic bounds. In this case (similar to the unsupported recursive cases, which as an aside, we should probably support it as an extension) we could need to walk the data structure twice to precalculate the number of total components to map. However, this case is certainly detectable by static analysis of the declared mappers, and so I think that we can get the best of both worlds: we could use Alexey's proposed scheme except in cases where we truly need to walk the data-structure twice, in which case we could use Lingda's combined walk/push_back scheme. Is there any reason why that wouldn't work?

I think it is better to use only one scheme. I rather doubt that we can implement some kind of analysis in the frontend. Later, when the real codegen is moved to the backend, we can try to implement 2 schemes. But not today. We need to choose one and I just want to hear all pros and cons for both (actually, there are 3 schemes already) schemes to choose the most flexible, reliable and fast one.

Thanks again,

Hal

On 6/28/19 9:00 AM, Alexey Bataev wrote:

Hi Lingda, thanks for your comments.
We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.
With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.
But anyway, I agree, it would be good to hear some other opinions.
--------------
Best regards,
Alexey Bataev


...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev
Hi Lingda,

may I ask to start discussions about important decisions related to
Clang's OpenMP support on the public mailing list instead of having
private conversations? That would help to get feedback from people not
being part of the selected circle participating in the "OpenMP / HPC in
Clang / LLVM Multi-company Telecom".

Thanks,
Jonas

On 2019-06-28 15:59, Lingda Li via cfe-dev wrote:

> On Fri, Jun 28, 2019 at 9:49 AM Li, Lingda <[hidden email]> wrote:
>
>> I don't think we can have the buffer allocated within the mapper
>> function. It has to be done in the runtime, because of nested
>> mappers.
>> First, all mapper functions are born in the same way. We cannot
>> make the outer most mapper function allocate memory, whether the
>> inner one doesn't and has to use what is allocated by the outer most
>> mapper function.
>> I suppose we still need to allocate memory in the runtime, so the
>> runtime can pass the pointer and size to the mapper function, and
>> the outer mapper function can then pass them into inner ones.
>> Again, this is just like the current implementation, except that we
>> don't use vecter::push_back(), instead we use something like a
>> manual implementation of vector::push_back() (because we need to use
>> the pointer and the current index)
>>
>> I believe the key question here is whether it is true that (the
>> overhead of push_back() > the overhead of precalculating the total
>> number + the memory allocation overhead + directly memory write).
>> This will decide whether this change is necessary. Any opinions?
>>
>> Thanks,
>> Lingda Li
>>
>> -------------------------
>>
>> FROM: Alexey Bataev <[hidden email]>
>> SENT: Thursday, June 27, 2019 5:05 PM
>> TO: Li, Lingda
>> CC: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K
>> O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David
>> Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios;
>> Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif,
>> Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert,
>> Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph;
>> Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky,
>> Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew
>> Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael P;
>> Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M;
>> Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert;
>> Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita
>> Chandrasekaran; [hidden email]; Sergio Pino Gallardo;
>> Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha; Wilmarth, Terry
>> L; Tianyi Zhang; [hidden email]; Wang Chen; Wael Yehia; Tian,
>> Xinmin
>> SUBJECT: Re: Re: Re: RE: Comparison of 2 schemes to implement OpenMP
>> 5.0 declare mapper codegen
>>
>> Yes, we need 2 functions, but thw first one can be optimized very
>> effectively. After the optimizations and inlining it will end up
>> with just return s1+s2+s3... I think, inost cases those sizes will
>> be constant, since the mapper maps constant number of elements. And,
>> thus, this expression will be optimized to just a constant value.
>> You don't need to pass these functions to runtime. We can call the
>> directly from the compiler.
>> 1st call: get number of elements.
>> 2nd: allocate the buffer
>> 3rd call: call mapper with this preallocated buffer that fills this
>> buffer without any calls of the runtime functions.
>> 4th call: call the runtime to pass the buffer to the runtime.
>>
>> Best regards,
>> Alexey Bataev
>>
>> 27 июня 2019 г., в 16:53, Li, Lingda <[hidden email]>
>> написал(а):
>>
>>> If we precalculate the size, first, it means we need to generate
>>> 2 functions for each mapper, rather than 1 now. One for mapping
>>> information filling as we have, the other for size calculation
>>> (This will not return constant values, because size depends on how
>>> many instances we are mapping). Both these 2 functions will need
>>> to be passed to the runtime. The runtime will need to precalculate
>>> the number of components first, then allocate memory, then call
>>> the mapper function to fill it up.
>>>
>>> Compared with the scheme 1, the differences are:
>>> 1) An extra call to calculate the total number, while scheme 1
>>> does not;
>>> 2) A preallocated buffer, whose pointer and the current number
>>> should be passed to the mapper function, then the mapper function
>>> uses them to fill components, while scheme 1 uses push_back() to
>>> do the same thing.
>>>
>>> Is there really a benefit doing this? push_back() should be
>>> efficient enough compared with directly writing to memory.
>>>
>>> If people here think that, the overhead of push_back() > the
>>> overhead of precalculating the total number + the memory
>>> allocation overhead + directly memory write, then we can consider
>>> this scheme.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>>
>>> FROM: Alexey Bataev <[hidden email]>
>>> SENT: Thursday, June 27, 2019 4:26 PM
>>> TO: Li, Lingda
>>> CC: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K
>>> O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David
>>> Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios;
>>> Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif,
>>> Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert,
>>> Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph;
>>> Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky,
>>> Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew
>>> Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael
>>> P; Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M;
>>> Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert;
>>> Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita
>>> Chandrasekaran; [hidden email]; Sergio Pino
>>> Gallardo; Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha;
>>> Wilmarth, Terry L; Tianyi Zhang; [hidden email]; Wang Chen;
>>> Wael Yehia; Tian, Xinmin
>>> SUBJECT: Re: Re: RE: Comparison of 2 schemes to implement OpenMP
>>> 5.0 declare mapper codegen
>>>
>>> If the functions are inlined (the ines, intended for size
>>> precalculation). They can be optimized out very effectively since
>>> in most cases they will return constant values.
>>> If we could do this, we won't need vectors and oush_backs, we can
>>> use preallocated memory and internal counter.
>>> --------------
>>> Best regards,
>>> Alexey Bataev
>>>
>>> <graycol.gif>"Li, Lingda" ---06/27/2019 04:13:03 PM---Hi Alexey, I
>>> think that's why we choose to use variable size storage like
>>> std::vector to store the m
>>>
>>> From: "Li, Lingda" <[hidden email]>
>>> To: Alexey Bataev <[hidden email]>, Deepak Eachempati
>>> <[hidden email]>
>>> Cc: "Narayanaswamy, Ravi" <[hidden email]>,
>>> "Alexandre Eichenberger" <[hidden email]>, "Chapman, Barbara
>>> (Contact)" <[hidden email]>, "Bobrovsky,
>>> Konstantin S" <[hidden email]>, Carlo Bertolli
>>> <[hidden email]>, "Chan, SiuChi" <[hidden email]>,
>>> "Cownie, James H" <[hidden email]>, David Oehmke
>>> <[hidden email]>, "Denny, Joel E." <[hidden email]>,
>>> "Dmitriev, Serguei N" <[hidden email]>, "Doerfert,
>>> Johannes Rudolf" <[hidden email]>, Ettore Tiotto
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, Gheorghe-Teod Bercea
>>> <[hidden email]>, Hal Finkel <[hidden email]>,
>>> "[hidden email]" <[hidden email]>, Jeeva Paudel
>>> <[hidden email]>, Jeff Heath <[hidden email]>, Jeffrey
>>> Sandoval <[hidden email]>, "Jones, Jeff C"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>,
>>> Kelvin Li <[hidden email]>, "Kevin K O'Brien"
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, "Kotsifakou, Maria"
>>> <[hidden email]>, "Krishnaiyer, Rakesh"
>>> <[hidden email]>, "Lieberman, Ron"
>>> <[hidden email]>, "Lopez, Matthew Graham"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>, Martin
>>> Kong <[hidden email]>, Matt Martineau
>>> <[hidden email]>, "Menard, Lorri"
>>> <[hidden email]>, "Monteleone, Robert"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>,
>>> "Rao, Premanand M" <[hidden email]>, "Rice, Michael P"
>>> <[hidden email]>, "Robichaux, Joseph"
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, "Rokos, Georgios"
>>> <[hidden email]>, Samuel Antao <[hidden email]>,
>>> "Sarah McNamara" <[hidden email]>,
>>> "[hidden email]" <[hidden email]>,
>>> Sergio Pino Gallardo <[hidden email]>, "Sharif, Hashim"
>>> <[hidden email]>, "Sjodin, Jan" <[hidden email]>, Sunil
>>> Shrestha <[hidden email]>, Sunita Chandrasekaran
>>> <[hidden email]>, "Tian, Xinmin" <[hidden email]>,
>>> Tianyi Zhang <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, Wael Yehia <[hidden email]>, Wang Chen
>>> <[hidden email]>, "Wilmarth, Terry L"
>>> <[hidden email]>
>>> Date: 06/27/2019 04:13 PM
>>> Subject: [EXTERNAL] Re: RE: Comparison of 2 schemes to implement
>>> OpenMP 5.0 declare mapper codegen
>>>
>>> -------------------------
>>>
>>> Hi Alexey,
>>>
>>> I think that's why we choose to use variable size storage like
>>> std::vector to store the mapping information at the first place,
>>> right? It'll be costly to precalculate the total number of
>>> components, especially in the presence of nested mappers. Besides,
>>> a runtime function call is just a std::vector::push, so I think
>>> it's okay to have multiple function calls.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>>
>>> FROM: Alexey Bataev <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:52 PM
>>> To: Deepak Eachempati
>>> Cc: Li, Lingda; Narayanaswamy, Ravi; Alexandre Eichenberger;
>>> Chapman, Barbara (Contact); Bobrovsky, Konstantin S; Carlo
>>> Bertolli; Chan, SiuChi; Cownie, James H; David Oehmke; Denny, Joel
>>> E.; Dmitriev, Serguei N; Doerfert, Johannes Rudolf; Ettore Tiotto;
>>> [hidden email]; Gheorghe-Teod Bercea; Hal Finkel;
>>> [hidden email]; Jeeva Paudel; Jeff Heath; Jeffrey Sandoval;
>>> Jones, Jeff C; [hidden email]; Kelvin Li; Kevin K O'Brien;
>>> [hidden email]; Kotsifakou, Maria; Krishnaiyer, Rakesh;
>>> Lieberman, Ron; Lopez, Matthew Graham; [hidden email]; Martin
>>> Kong; Matt Martineau; Menard, Lorri; Monteleone, Robert;
>>> [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux,
>>> Joseph; [hidden email]; Rokos, Georgios; Samuel Antao;
>>> Sarah McNamara; [hidden email]; Sergio Pino
>>> Gallardo; Sharif, Hashim; Sjodin, Jan; Sunil Shrestha; Sunita
>>> Chandrasekaran; Tian, Xinmin; Tianyi Zhang; [hidden email];
>>> Wael Yehia; Wang Chen; Wilmarth, Terry L
>>> Subject: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Lingda, can we in scheme 1 precalculate the total number of
>>> components, allocate memory for these precalculate number of
>>> elements, then fill it with mappers and only after that call the
>>> runtime function (only once!) to transfer the mappings to the
>>> runtime?
>>>
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 27 июня 2019 г., в 15:44, Deepak Eachempati
>>> <[hidden email]> написал(а):
>>>
>>> Got it. Thanks.
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:41 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> In the current scheme, all mappings within a mapper function is
>>> done atomically by one thread. In the mapper function of the
>>> example in the original email, <push> will just push the mapping
>>> information into an internal data structure. Once all mapping
>>> information is available, the runtime will do the real mapping
>>> together. For your example, the behavior is the same as the code
>>> below:
>>>
>>> ...
>>> #pragma omp parallel num_threads(2)
>>> {
>>> if (omp_get_thread_num() == 0) {
>>> #pragma omp target map(s.x, s.p[0:s.x])
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>> } else {
>>> #pragma omp target map(other_data)
>>> {
>>> // work on other_data
>>> }
>>> }
>>> ...
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:34 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod
>>> Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew
>>> Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau';
>>> Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand
>>> M; Rice, Michael P; Robichaux, Joseph; [hidden email];
>>> Rokos, Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> I was referring to something like this, where another thread is
>>> not trying to map the same data:
>>>
>>> #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
>>> S s;
>>> ...
>>> #pragma omp parallel num_threads(2)
>>> {
>>> if (omp_get_thread_num() == 0) {
>>> #pragma omp target map(s)
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>> } else {
>>> #pragma omp target map(other_data)
>>> {
>>> // work on other_data
>>> }
>>> }
>>> ...
>>>
>>> Since I believe you are mapping s.x and s.p as separate map
>>> operations, it is possible that another thread could map
>>> ‘other_data’ in between those two maps. If this happens, will
>>> your implementation still ensure that s.x and s.p are positioned
>>> at the right offsets with respect to the same base address (&s)?
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:26 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> When 2 threads try to concurrently map the same data, it behaves
>>> the same as when 2 threads concurrently map the same data using
>>> map clauses, and mappers don't introduce extra considerations
>>> here. For instance, both threads use #omp target enter data
>>> concurrently.
>>>
>>> When 2 threads concurrently maps the same data, my understanding
>>> based on the current code is, it will create 2 copies of the same
>>> data, either copy is correctly to use. It may have a problem when
>>> both copies are mapped back if not synchronized correctly, but
>>> this is a programming issue, not the responsibility of OpenMP.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:17 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod
>>> Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew
>>> Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau';
>>> Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand
>>> M; Rice, Michael P; Robichaux, Joseph; [hidden email];
>>> Rokos, Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Thanks.
>>>
>>> Is it possible for another thread to be concurrently mapped
>>> something else while the maps from the mapper function are taking
>>> place? If so, how do you guarantee that the allocation for each
>>> component will get you the right addresses in device memory? Sorry
>>> if this was covered before and I missed it.
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:08 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Hi Deepak,
>>>
>>> Yes, it handles this case. The first part of mapper function
>>> (initially allocate space for the whole array) is just an
>>> optimization, not required for correctness, as suggested by you in
>>> an early discussion.
>>>
>>> In your example, s.x and s.p will be allocated separately (not in
>>> a single allocation). But Clang guarantees that their addresses
>>> will be correct because s.x and s.p share the same base address,
>>> which is &s.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 2:49 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; '[hidden email]'; 'Ettore Tiotto';
>>> '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel;
>>> '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey
>>> Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K
>>> O'Brien'; '[hidden email]'; 'Kotsifakou, Maria';
>>> Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham';
>>> '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard,
>>> Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice,
>>> Michael P; Robichaux, Joseph; [hidden email]; Rokos,
>>> Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> For Scheme 1, it looks like you are doing separate maps for each
>>> component when size == 1. It seems like the first and last if
>>> statements should have “size >= 1” rather than “size > 1”.
>>>
>>> If the mapper is declared like this:
>>>
>>> struct S {
>>> int x;
>>> ... // other stuff
>>> int *p;
>>> };
>>>
>>> #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
>>>
>>> And you have:
>>>
>>> S s;
>>> ...
>>> #pragma omp target map(s)
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>>
>>> Since the target construct is just mapping a single structure of
>>> type S, there should be one map that takes care of mapping storage
>>> for s.x and s.p with a single allocation, and a separate map for
>>> the array section s.p[0:s.x], and finally the pointer attachment
>>> of s.p to s.p[0:s.x]. Does Scheme 1 handle this?
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 1:07 PM
>>> To: Narayanaswamy, Ravi <[hidden email]>; 'Alexandre
>>> Eichenberger' <[hidden email]>; 'Alexey Bataev'
>>> <[hidden email]>; Chapman, Barbara (Contact)
>>> <[hidden email]>; Bobrovsky, Konstantin S
>>> <[hidden email]>; 'Carlo Bertolli'
>>> <[hidden email]>; 'Chan, SiuChi' <[hidden email]>;
>>> Cownie, James H <[hidden email]>; David Oehmke
>>> <[hidden email]>; Deepak Eachempati <[hidden email]>;
>>> 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Ettore
>>> Tiotto' <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Gheorghe-Teod Bercea'
>>> <[hidden email]>; Hal Finkel <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Jeeva Paudel'
>>> <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey
>>> Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; Li, Lingda <[hidden email]>; 'Lopez, Matthew
>>> Graham' <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Martin Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Comparison of 2 schemes to implement OpenMP 5.0 declare
>>> mapper codegen
>>>
>>> Hi,
>>>
>>> Alexey and I would like to have your attention on an ongoing
>>> discussion of 2 schemes to implement the declare mapper in OpenMP
>>> 5.0. The detailed discussion can be found at
>>> https://reviews.llvm.org/D59474 [1]
>>>
>>> Scheme 1 (the one has been implemented by me in
>>> https://reviews.llvm.org/D59474 [1]):
>>> The detailed design can be found at
>>>
>>
> https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
>>> [2]
>>> For each mapper function, the compiler generates a function like
>>> this:
>>>
>>> ```
>>> void <type>.mapper(void *base, void *begin, size_t size, int64_t
>>> type) {
>>> // Allocate space for an array section first.
>>> if (size > 1 && !maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>>
>>> // Map members.
>>> for (unsigned i = 0; i < size; i++) {
>>> // For each component specified by this mapper:
>>> for (auto c : components) {
>>> ...; // code to generate c.arg_base, c.arg_begin, c.arg_size,
>>> c.arg_type
>>> if (c.hasMapper())
>>> (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> else
>>> <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> }
>>> }
>>> // Delete the array section.
>>> if (size > 1 && maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>> }
>>> ```
>>> This function is passed to the OpenMP runtime, and the runtime
>>> will call this function to finish the data mapping.
>>>
>>> Scheme 2 (which Alexey proposes):
>>> Alexey proposes to move parts of the mapper function above into
>>> the OpenMP runtime, so the compiler will generate code below:
>>> ```
>>> void <type>.mapper(void *base, void *begin, size_t size, int64_t
>>> type) {
>>> ...; // code to generate arg_base, arg_begin, arg_size, arg_type,
>>> arg_mapper.
>>> auto sub_components[] = {...}; // fill in generated begin, base,
>>> ...
>>> __tgt_mapper(base, begin, size, type, sub_components);
>>> }
>>> ```
>>>
>>> `__tgt_mapper` is a runtime function as below:
>>> ```
>>> void __tgt_mapper(void *base, void *begin, size_t size, int64_t
>>> type, auto components[]) {
>>> // Allocate space for an array section first.
>>> if (size > 1 && !maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>>
>>> // Map members.
>>> for (unsigned i = 0; i < size; i++) {
>>> // For each component specified by this mapper:
>>> for (auto c : components) {
>>> if (c.hasMapper())
>>> (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> else
>>> <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> }
>>> }
>>> // Delete the array section.
>>> if (size > 1 && maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>> }
>>> ```
>>>
>>> Comparison:
>>> Why to choose 1 over 2:
>>> 1. In scheme 2, the compiler needs to generate all map types and
>>> pass them to __tgt_mapper through sub_components. But in this
>>> case, the compiler won't be able to generate the correct MEMBER_OF
>>> field in map type. As a result, the runtime has to fix it using
>>> the mechanism we already have here: __tgt_mapper_num_components.
>>> This not only increases complexity, but also, it means the runtime
>>> needs further manipulation of the map type, which creates locality
>>> issues. While in the current scheme, the map type is generated by
>>> compiler once, so the data locality will be very good in this
>>> case.
>>> 2. In scheme 2, sub_components includes all components that should
>>> be mapped. If we are mapping an array, this means we need to map
>>> many components, which will need to allocate memory for
>>> sub_components in the heap. This creates further memory management
>>> burden and is not an efficient way to use memory.
>>> 3. In scheme 1, we are able to inline nested mapper functions. As
>>> a result, the compiler can do further optimizations to optimize
>>> the mapper function, e.g., eliminate redundant computation, loop
>>> unrolling, and thus achieve potentially better performance. We
>>> cannot achieve these optimizations in scheme 2.
>>>
>>> Why to choose 2 over 1:
>>> 1. Less code in the mapper function codegen (I doubt this because
>>> the codegen function of scheme 1 uses less than 200 loc)
>>> Alexey may have other reasons.
>>>
>>> We will appreciate if you can share your thoughts.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Narayanaswamy, Ravi <[hidden email]>
>>> Sent: Wednesday, June 19, 2019 3:09 PM
>>> To: 'Alexandre Eichenberger'; 'Alexey Bataev';
>>> '[hidden email]'; Bobrovsky, Konstantin S; 'Carlo
>>> Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; Deepak
>>> Eachempati; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert,
>>> Johannes Rudolf ; '[hidden email]'; 'Ettore Tiotto';
>>> '[hidden email]'; 'Gheorghe-Teod Bercea';
>>> '[hidden email]'; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; '[hidden email]';
>>> 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt
>>> Martineau'; Menard, Lorri; Monteleone, Robert; Narayanaswamy,
>>> Ravi; 'Oscar R. Hernandez'; Rao, Premanand M; Rice, Michael P;
>>> Robichaux, Joseph; Rodgers, Gregory; Rokos, Georgios;
>>> '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha ([hidden email]);
>>> 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang;
>>> '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
>>> Subject: OpenMP / HPC in Clang / LLVM Multi-company Telecom
>>> Meeting Minutes June 19th 2019
>>>
>>> NEXT MEETING : JULY 10TH (MOVED FROM JULY 3RD)
>>>
>>> OPENS :
>>> - DOCUMENTATION
>>> - Greg : Can we have documents for libopenmp and Libomptarget.
>>> - Alexey suggested having 3 documents: libopenmp, Libomptarget and
>>> device plugin
>>> - Hal will convert the existing libomptarget document. Once done
>>> others can update document to capture the existing implementation
>>> Future addition to libomptarget will also require update to
>>> document.
>>> - Next libopenmp document will be created if it does not exist or
>>> updated if one exists.
>>>
>>> LTO FOR FAT BINARY LINKING
>>> - Serguei (Intel) has an implementation which enables LTO and
>>> doing away with linker scripts.
>>> Everybody agreed this is a good idea, especially some linkers
>>> don’t have support for linker scripts.
>>> AMD is interested in enabling enabling LTO and will like to see
>>> the code
>>> Serguei to post the code to get feedback from all
>>> - Hal to present in next meeting his proposal to support static
>>> fat archives using LTO.
>>>
>>> OPENMP 5.0 FEATURES
>>> - No update on setting up the public website. Johannes was out
>>> attending ISC.
>>> - New features added since last release (courtesy of Kelvin)
>>> - allocate clause/allocate directive - parsing+sema, codegen
>>> - mutexinout dependence-type for task
>>> - user-defined mapper (declare mapper) - parsing+sema.
>>> - omp_get_device_num() API routine
>>>
>>> DEVELOPMENT ACTIVITY
>>> - ASYNC API
>>> Support in Clang and libopenmp including lit test had been checked
>>> in by Doru
>>>
>>> - MAPPER SUPPORT
>>> Initial support for Mapper has been posted for review Lingda. Once
>>> approved, the rest of the support will be done
>>> Lingda : Should the old API being replaced by the new similar API
>>> with extra mapper argument be obsoleted
>>> Suggestion was for clang to not generated but keep the API in
>>> libomptarget for backward compatible. In the future it can be
>>> obsoleted
>>>
>>> - REQUIRED DIRECTIVES
>>> Support for required directives has been checked in by Doru.
>>> There was one issue with checking for requires directive and
>>> confirming it the Declare type is TO or LINK.
>>> Doru removed the check and added note to make sure if things
>>> change in future need to modify this code.
>>>
>>> ROLL CALL :
>>>
>>> COMPANY
>>> ATTENDEES
>>>
>>> 19-JUN
>>>
>>> AMD
>>>
>>> Greg Rodgers
>>>
>>> x
>>>
>>> Ashwin Aji
>>>
>>> Jan Sjodin
>>>
>>> x
>>>
>>> Ron Lieberman
>>>
>>> x
>>>
>>> sameer Sahasrabuddhe
>>>
>>> Andrey Kasaurov
>>>
>>> ANL
>>> Hal Finkel
>>>
>>> x
>>>
>>> Johannes Doerfert
>>>
>>> IBM
>>> Alexandre Eichenberger
>>>
>>> Carlo Bertolli
>>>
>>> Kelvin Li
>>>
>>> Doru
>>>
>>> x
>>>
>>> Alexey Bataev
>>>
>>> x
>>>
>>> INTEL
>>> Andrey Churbanov
>>>
>>> Ravi Narayanaswamy
>>>
>>> x
>>>
>>> Serguei Dmitriev
>>>
>>> x
>>>
>>> Rajiv Deodhar
>>>
>>> Lorri Menard
>>>
>>> Terry Wilmarth
>>>
>>> Rao, Prem
>>>
>>> Hansang Bae
>>>
>>> George Rokos
>>>
>>> x
>>>
>>> CRAY
>>> Deepak Eachempati
>>>
>>> x
>>>
>>> MICRON
>>> John Leidel
>>>
>>> NVIDIA
>>> James Beyer
>>>
>>> x
>>>
>>> ORNL
>>> Graham Lopez
>>>
>>> Joel Denny
>>>
>>> Geoffroy Vallee
>>>
>>> Oscar Hernandez
>>>
>>> SBU/BNL
>>> Lingda Li
>>>
>>> x
>>>
>>> Jose Monlsave
>>>
>>> Martin Kong
>>>
>>> TI
>>> Eric Stotzer
>>>
>>> U OF BRISTOL
>>> Mat Martineau
>>>
>>> U OF DELAWARE
>>> Sunita Chandrasekaran
>>>
>>> U OF ILLINOIS
>>> Hashim Sharif
>>>
>>> RICE
>>> John Mellor-Crummey
>>>
>>> LSU
>>> Tianyi Zhang
>>>
>>>
>>
> .........................................................................................................................................
>>> àJoin Skype Meeting [3]
>>>
>>> Trouble Joining? Try Skype Web App [4]
>>>
>>> Join by phone
>>> +1(916)356-2663 (or your local bridge access #) Choose bridge 5.
>>> [5] (Global) English (United States)
>>> Find a local number [6]
>>>
>>> Conference ID: 7607896966
>>> Forgot your dial-in PIN? [6] |Help [7]
>>>
>>> [!OC([1033])!]
>>>
>>
> .........................................................................................................................................
>
>
> Links:
> ------
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D59474&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=EVaPRpEtSzi0Y56zmjD5fXRzN87UZDOaYp5PY3TXiVQ&amp;e=
> [2]
> https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
> [3]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__meet.intel.com_ravi.narayanaswamy_DK7943NR&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=K4msFCmDvK4n0MdVQd7UTXRRvRkaNwLzMaP8fnX0iOg&amp;e=
> [4]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__meet.intel.com_ravi.narayanaswamy_DK7943NR-3Fsl-3D1&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=krI3wEp2z8GhcZt6feFq3WgaBjcEoTDRk-GvI1BIdO8&amp;e=
> [5]
> tel:+1(916)356-2663%20(or%20your%20local%20bridge%20access%20#)%20Choose%20bridge%205.
> [6]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__dial.intel.com&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=g2dQtoTqaRXyBMaIUpfyoPFDRTtrQbgbWbb9b90tgBg&amp;e=
> [7]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__o15.officeredir.microsoft.com_r_rlidLync15-3Fclid-3D1033-26p1-3D5-26p2-3D2009&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=6OCBXxzOIJfra2Pewq_p-l2pY3MyKnuG-TLr7M1xq-s&amp;e=
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
Hi Jonas,

Sure, we are trying to do so. The public lists often reject my emails because it is large and I cannot fill all people in that mailing list here, though.

Thanks,
Lingda Li

On Sat, Jun 29, 2019 at 8:39 AM Jonas Hahnfeld <[hidden email]> wrote:
Hi Lingda,

may I ask to start discussions about important decisions related to
Clang's OpenMP support on the public mailing list instead of having
private conversations? That would help to get feedback from people not
being part of the selected circle participating in the "OpenMP / HPC in
Clang / LLVM Multi-company Telecom".

Thanks,
Jonas

On 2019-06-28 15:59, Lingda Li via cfe-dev wrote:
> On Fri, Jun 28, 2019 at 9:49 AM Li, Lingda <[hidden email]> wrote:
>
>> I don't think we can have the buffer allocated within the mapper
>> function. It has to be done in the runtime, because of nested
>> mappers.
>> First, all mapper functions are born in the same way. We cannot
>> make the outer most mapper function allocate memory, whether the
>> inner one doesn't and has to use what is allocated by the outer most
>> mapper function.
>> I suppose we still need to allocate memory in the runtime, so the
>> runtime can pass the pointer and size to the mapper function, and
>> the outer mapper function can then pass them into inner ones.
>> Again, this is just like the current implementation, except that we
>> don't use vecter::push_back(), instead we use something like a
>> manual implementation of vector::push_back() (because we need to use
>> the pointer and the current index)
>>
>> I believe the key question here is whether it is true that (the
>> overhead of push_back() > the overhead of precalculating the total
>> number + the memory allocation overhead + directly memory write).
>> This will decide whether this change is necessary. Any opinions?
>>
>> Thanks,
>> Lingda Li
>>
>> -------------------------
>>
>> FROM: Alexey Bataev <[hidden email]>
>> SENT: Thursday, June 27, 2019 5:05 PM
>> TO: Li, Lingda
>> CC: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K
>> O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David
>> Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios;
>> Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif,
>> Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert,
>> Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph;
>> Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky,
>> Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew
>> Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael P;
>> Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M;
>> Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert;
>> Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita
>> Chandrasekaran; [hidden email]; Sergio Pino Gallardo;
>> Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha; Wilmarth, Terry
>> L; Tianyi Zhang; [hidden email]; Wang Chen; Wael Yehia; Tian,
>> Xinmin
>> SUBJECT: Re: Re: Re: RE: Comparison of 2 schemes to implement OpenMP
>> 5.0 declare mapper codegen
>>
>> Yes, we need 2 functions, but thw first one can be optimized very
>> effectively. After the optimizations and inlining it will end up
>> with just return s1+s2+s3... I think, inost cases those sizes will
>> be constant, since the mapper maps constant number of elements. And,
>> thus, this expression will be optimized to just a constant value.
>> You don't need to pass these functions to runtime. We can call the
>> directly from the compiler.
>> 1st call: get number of elements.
>> 2nd: allocate the buffer
>> 3rd call: call mapper with this preallocated buffer that fills this
>> buffer without any calls of the runtime functions.
>> 4th call: call the runtime to pass the buffer to the runtime.
>>
>> Best regards,
>> Alexey Bataev
>>
>> 27 июня 2019 г., в 16:53, Li, Lingda <[hidden email]>
>> написал(а):
>>
>>> If we precalculate the size, first, it means we need to generate
>>> 2 functions for each mapper, rather than 1 now. One for mapping
>>> information filling as we have, the other for size calculation
>>> (This will not return constant values, because size depends on how
>>> many instances we are mapping). Both these 2 functions will need
>>> to be passed to the runtime. The runtime will need to precalculate
>>> the number of components first, then allocate memory, then call
>>> the mapper function to fill it up.
>>>
>>> Compared with the scheme 1, the differences are:
>>> 1) An extra call to calculate the total number, while scheme 1
>>> does not;
>>> 2) A preallocated buffer, whose pointer and the current number
>>> should be passed to the mapper function, then the mapper function
>>> uses them to fill components, while scheme 1 uses push_back() to
>>> do the same thing.
>>>
>>> Is there really a benefit doing this? push_back() should be
>>> efficient enough compared with directly writing to memory.
>>>
>>> If people here think that, the overhead of push_back() > the
>>> overhead of precalculating the total number + the memory
>>> allocation overhead + directly memory write, then we can consider
>>> this scheme.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>>
>>> FROM: Alexey Bataev <[hidden email]>
>>> SENT: Thursday, June 27, 2019 4:26 PM
>>> TO: Li, Lingda
>>> CC: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K
>>> O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David
>>> Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios;
>>> Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif,
>>> Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert,
>>> Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph;
>>> Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky,
>>> Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew
>>> Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael
>>> P; Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M;
>>> Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert;
>>> Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita
>>> Chandrasekaran; [hidden email]; Sergio Pino
>>> Gallardo; Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha;
>>> Wilmarth, Terry L; Tianyi Zhang; [hidden email]; Wang Chen;
>>> Wael Yehia; Tian, Xinmin
>>> SUBJECT: Re: Re: RE: Comparison of 2 schemes to implement OpenMP
>>> 5.0 declare mapper codegen
>>>
>>> If the functions are inlined (the ines, intended for size
>>> precalculation). They can be optimized out very effectively since
>>> in most cases they will return constant values.
>>> If we could do this, we won't need vectors and oush_backs, we can
>>> use preallocated memory and internal counter.
>>> --------------
>>> Best regards,
>>> Alexey Bataev
>>>
>>> <graycol.gif>"Li, Lingda" ---06/27/2019 04:13:03 PM---Hi Alexey, I
>>> think that's why we choose to use variable size storage like
>>> std::vector to store the m
>>>
>>> From: "Li, Lingda" <[hidden email]>
>>> To: Alexey Bataev <[hidden email]>, Deepak Eachempati
>>> <[hidden email]>
>>> Cc: "Narayanaswamy, Ravi" <[hidden email]>,
>>> "Alexandre Eichenberger" <[hidden email]>, "Chapman, Barbara
>>> (Contact)" <[hidden email]>, "Bobrovsky,
>>> Konstantin S" <[hidden email]>, Carlo Bertolli
>>> <[hidden email]>, "Chan, SiuChi" <[hidden email]>,
>>> "Cownie, James H" <[hidden email]>, David Oehmke
>>> <[hidden email]>, "Denny, Joel E." <[hidden email]>,
>>> "Dmitriev, Serguei N" <[hidden email]>, "Doerfert,
>>> Johannes Rudolf" <[hidden email]>, Ettore Tiotto
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, Gheorghe-Teod Bercea
>>> <[hidden email]>, Hal Finkel <[hidden email]>,
>>> "[hidden email]" <[hidden email]>, Jeeva Paudel
>>> <[hidden email]>, Jeff Heath <[hidden email]>, Jeffrey
>>> Sandoval <[hidden email]>, "Jones, Jeff C"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>,
>>> Kelvin Li <[hidden email]>, "Kevin K O'Brien"
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, "Kotsifakou, Maria"
>>> <[hidden email]>, "Krishnaiyer, Rakesh"
>>> <[hidden email]>, "Lieberman, Ron"
>>> <[hidden email]>, "Lopez, Matthew Graham"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>, Martin
>>> Kong <[hidden email]>, Matt Martineau
>>> <[hidden email]>, "Menard, Lorri"
>>> <[hidden email]>, "Monteleone, Robert"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>,
>>> "Rao, Premanand M" <[hidden email]>, "Rice, Michael P"
>>> <[hidden email]>, "Robichaux, Joseph"
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, "Rokos, Georgios"
>>> <[hidden email]>, Samuel Antao <[hidden email]>,
>>> "Sarah McNamara" <[hidden email]>,
>>> "[hidden email]" <[hidden email]>,
>>> Sergio Pino Gallardo <[hidden email]>, "Sharif, Hashim"
>>> <[hidden email]>, "Sjodin, Jan" <[hidden email]>, Sunil
>>> Shrestha <[hidden email]>, Sunita Chandrasekaran
>>> <[hidden email]>, "Tian, Xinmin" <[hidden email]>,
>>> Tianyi Zhang <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, Wael Yehia <[hidden email]>, Wang Chen
>>> <[hidden email]>, "Wilmarth, Terry L"
>>> <[hidden email]>
>>> Date: 06/27/2019 04:13 PM
>>> Subject: [EXTERNAL] Re: RE: Comparison of 2 schemes to implement
>>> OpenMP 5.0 declare mapper codegen
>>>
>>> -------------------------
>>>
>>> Hi Alexey,
>>>
>>> I think that's why we choose to use variable size storage like
>>> std::vector to store the mapping information at the first place,
>>> right? It'll be costly to precalculate the total number of
>>> components, especially in the presence of nested mappers. Besides,
>>> a runtime function call is just a std::vector::push, so I think
>>> it's okay to have multiple function calls.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>>
>>> FROM: Alexey Bataev <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:52 PM
>>> To: Deepak Eachempati
>>> Cc: Li, Lingda; Narayanaswamy, Ravi; Alexandre Eichenberger;
>>> Chapman, Barbara (Contact); Bobrovsky, Konstantin S; Carlo
>>> Bertolli; Chan, SiuChi; Cownie, James H; David Oehmke; Denny, Joel
>>> E.; Dmitriev, Serguei N; Doerfert, Johannes Rudolf; Ettore Tiotto;
>>> [hidden email]; Gheorghe-Teod Bercea; Hal Finkel;
>>> [hidden email]; Jeeva Paudel; Jeff Heath; Jeffrey Sandoval;
>>> Jones, Jeff C; [hidden email]; Kelvin Li; Kevin K O'Brien;
>>> [hidden email]; Kotsifakou, Maria; Krishnaiyer, Rakesh;
>>> Lieberman, Ron; Lopez, Matthew Graham; [hidden email]; Martin
>>> Kong; Matt Martineau; Menard, Lorri; Monteleone, Robert;
>>> [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux,
>>> Joseph; [hidden email]; Rokos, Georgios; Samuel Antao;
>>> Sarah McNamara; [hidden email]; Sergio Pino
>>> Gallardo; Sharif, Hashim; Sjodin, Jan; Sunil Shrestha; Sunita
>>> Chandrasekaran; Tian, Xinmin; Tianyi Zhang; [hidden email];
>>> Wael Yehia; Wang Chen; Wilmarth, Terry L
>>> Subject: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Lingda, can we in scheme 1 precalculate the total number of
>>> components, allocate memory for these precalculate number of
>>> elements, then fill it with mappers and only after that call the
>>> runtime function (only once!) to transfer the mappings to the
>>> runtime?
>>>
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 27 июня 2019 г., в 15:44, Deepak Eachempati
>>> <[hidden email]> написал(а):
>>>
>>> Got it. Thanks.
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:41 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> In the current scheme, all mappings within a mapper function is
>>> done atomically by one thread. In the mapper function of the
>>> example in the original email, <push> will just push the mapping
>>> information into an internal data structure. Once all mapping
>>> information is available, the runtime will do the real mapping
>>> together. For your example, the behavior is the same as the code
>>> below:
>>>
>>> ...
>>> #pragma omp parallel num_threads(2)
>>> {
>>> if (omp_get_thread_num() == 0) {
>>> #pragma omp target map(s.x, s.p[0:s.x])
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>> } else {
>>> #pragma omp target map(other_data)
>>> {
>>> // work on other_data
>>> }
>>> }
>>> ...
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:34 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod
>>> Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew
>>> Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau';
>>> Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand
>>> M; Rice, Michael P; Robichaux, Joseph; [hidden email];
>>> Rokos, Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> I was referring to something like this, where another thread is
>>> not trying to map the same data:
>>>
>>> #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
>>> S s;
>>> ...
>>> #pragma omp parallel num_threads(2)
>>> {
>>> if (omp_get_thread_num() == 0) {
>>> #pragma omp target map(s)
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>> } else {
>>> #pragma omp target map(other_data)
>>> {
>>> // work on other_data
>>> }
>>> }
>>> ...
>>>
>>> Since I believe you are mapping s.x and s.p as separate map
>>> operations, it is possible that another thread could map
>>> ‘other_data’ in between those two maps. If this happens, will
>>> your implementation still ensure that s.x and s.p are positioned
>>> at the right offsets with respect to the same base address (&s)?
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:26 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> When 2 threads try to concurrently map the same data, it behaves
>>> the same as when 2 threads concurrently map the same data using
>>> map clauses, and mappers don't introduce extra considerations
>>> here. For instance, both threads use #omp target enter data
>>> concurrently.
>>>
>>> When 2 threads concurrently maps the same data, my understanding
>>> based on the current code is, it will create 2 copies of the same
>>> data, either copy is correctly to use. It may have a problem when
>>> both copies are mapped back if not synchronized correctly, but
>>> this is a programming issue, not the responsibility of OpenMP.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:17 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod
>>> Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew
>>> Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau';
>>> Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand
>>> M; Rice, Michael P; Robichaux, Joseph; [hidden email];
>>> Rokos, Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Thanks.
>>>
>>> Is it possible for another thread to be concurrently mapped
>>> something else while the maps from the mapper function are taking
>>> place? If so, how do you guarantee that the allocation for each
>>> component will get you the right addresses in device memory? Sorry
>>> if this was covered before and I missed it.
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:08 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Hi Deepak,
>>>
>>> Yes, it handles this case. The first part of mapper function
>>> (initially allocate space for the whole array) is just an
>>> optimization, not required for correctness, as suggested by you in
>>> an early discussion.
>>>
>>> In your example, s.x and s.p will be allocated separately (not in
>>> a single allocation). But Clang guarantees that their addresses
>>> will be correct because s.x and s.p share the same base address,
>>> which is &s.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 2:49 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; '[hidden email]'; 'Ettore Tiotto';
>>> '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel;
>>> '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey
>>> Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K
>>> O'Brien'; '[hidden email]'; 'Kotsifakou, Maria';
>>> Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham';
>>> '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard,
>>> Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice,
>>> Michael P; Robichaux, Joseph; [hidden email]; Rokos,
>>> Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> For Scheme 1, it looks like you are doing separate maps for each
>>> component when size == 1. It seems like the first and last if
>>> statements should have “size >= 1” rather than “size > 1”.
>>>
>>> If the mapper is declared like this:
>>>
>>> struct S {
>>> int x;
>>> ... // other stuff
>>> int *p;
>>> };
>>>
>>> #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
>>>
>>> And you have:
>>>
>>> S s;
>>> ...
>>> #pragma omp target map(s)
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>>
>>> Since the target construct is just mapping a single structure of
>>> type S, there should be one map that takes care of mapping storage
>>> for s.x and s.p with a single allocation, and a separate map for
>>> the array section s.p[0:s.x], and finally the pointer attachment
>>> of s.p to s.p[0:s.x]. Does Scheme 1 handle this?
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 1:07 PM
>>> To: Narayanaswamy, Ravi <[hidden email]>; 'Alexandre
>>> Eichenberger' <[hidden email]>; 'Alexey Bataev'
>>> <[hidden email]>; Chapman, Barbara (Contact)
>>> <[hidden email]>; Bobrovsky, Konstantin S
>>> <[hidden email]>; 'Carlo Bertolli'
>>> <[hidden email]>; 'Chan, SiuChi' <[hidden email]>;
>>> Cownie, James H <[hidden email]>; David Oehmke
>>> <[hidden email]>; Deepak Eachempati <[hidden email]>;
>>> 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Ettore
>>> Tiotto' <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Gheorghe-Teod Bercea'
>>> <[hidden email]>; Hal Finkel <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Jeeva Paudel'
>>> <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey
>>> Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; Li, Lingda <[hidden email]>; 'Lopez, Matthew
>>> Graham' <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Martin Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Comparison of 2 schemes to implement OpenMP 5.0 declare
>>> mapper codegen
>>>
>>> Hi,
>>>
>>> Alexey and I would like to have your attention on an ongoing
>>> discussion of 2 schemes to implement the declare mapper in OpenMP
>>> 5.0. The detailed discussion can be found at
>>> https://reviews.llvm.org/D59474 [1]
>>>
>>> Scheme 1 (the one has been implemented by me in
>>> https://reviews.llvm.org/D59474 [1]):
>>> The detailed design can be found at
>>>
>>
> https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
>>> [2]
>>> For each mapper function, the compiler generates a function like
>>> this:
>>>
>>> ```
>>> void <type>.mapper(void *base, void *begin, size_t size, int64_t
>>> type) {
>>> // Allocate space for an array section first.
>>> if (size > 1 && !maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>>
>>> // Map members.
>>> for (unsigned i = 0; i < size; i++) {
>>> // For each component specified by this mapper:
>>> for (auto c : components) {
>>> ...; // code to generate c.arg_base, c.arg_begin, c.arg_size,
>>> c.arg_type
>>> if (c.hasMapper())
>>> (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> else
>>> <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> }
>>> }
>>> // Delete the array section.
>>> if (size > 1 && maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>> }
>>> ```
>>> This function is passed to the OpenMP runtime, and the runtime
>>> will call this function to finish the data mapping.
>>>
>>> Scheme 2 (which Alexey proposes):
>>> Alexey proposes to move parts of the mapper function above into
>>> the OpenMP runtime, so the compiler will generate code below:
>>> ```
>>> void <type>.mapper(void *base, void *begin, size_t size, int64_t
>>> type) {
>>> ...; // code to generate arg_base, arg_begin, arg_size, arg_type,
>>> arg_mapper.
>>> auto sub_components[] = {...}; // fill in generated begin, base,
>>> ...
>>> __tgt_mapper(base, begin, size, type, sub_components);
>>> }
>>> ```
>>>
>>> `__tgt_mapper` is a runtime function as below:
>>> ```
>>> void __tgt_mapper(void *base, void *begin, size_t size, int64_t
>>> type, auto components[]) {
>>> // Allocate space for an array section first.
>>> if (size > 1 && !maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>>
>>> // Map members.
>>> for (unsigned i = 0; i < size; i++) {
>>> // For each component specified by this mapper:
>>> for (auto c : components) {
>>> if (c.hasMapper())
>>> (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> else
>>> <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> }
>>> }
>>> // Delete the array section.
>>> if (size > 1 && maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>> }
>>> ```
>>>
>>> Comparison:
>>> Why to choose 1 over 2:
>>> 1. In scheme 2, the compiler needs to generate all map types and
>>> pass them to __tgt_mapper through sub_components. But in this
>>> case, the compiler won't be able to generate the correct MEMBER_OF
>>> field in map type. As a result, the runtime has to fix it using
>>> the mechanism we already have here: __tgt_mapper_num_components.
>>> This not only increases complexity, but also, it means the runtime
>>> needs further manipulation of the map type, which creates locality
>>> issues. While in the current scheme, the map type is generated by
>>> compiler once, so the data locality will be very good in this
>>> case.
>>> 2. In scheme 2, sub_components includes all components that should
>>> be mapped. If we are mapping an array, this means we need to map
>>> many components, which will need to allocate memory for
>>> sub_components in the heap. This creates further memory management
>>> burden and is not an efficient way to use memory.
>>> 3. In scheme 1, we are able to inline nested mapper functions. As
>>> a result, the compiler can do further optimizations to optimize
>>> the mapper function, e.g., eliminate redundant computation, loop
>>> unrolling, and thus achieve potentially better performance. We
>>> cannot achieve these optimizations in scheme 2.
>>>
>>> Why to choose 2 over 1:
>>> 1. Less code in the mapper function codegen (I doubt this because
>>> the codegen function of scheme 1 uses less than 200 loc)
>>> Alexey may have other reasons.
>>>
>>> We will appreciate if you can share your thoughts.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Narayanaswamy, Ravi <[hidden email]>
>>> Sent: Wednesday, June 19, 2019 3:09 PM
>>> To: 'Alexandre Eichenberger'; 'Alexey Bataev';
>>> '[hidden email]'; Bobrovsky, Konstantin S; 'Carlo
>>> Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; Deepak
>>> Eachempati; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert,
>>> Johannes Rudolf ; '[hidden email]'; 'Ettore Tiotto';
>>> '[hidden email]'; 'Gheorghe-Teod Bercea';
>>> '[hidden email]'; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; '[hidden email]';
>>> 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt
>>> Martineau'; Menard, Lorri; Monteleone, Robert; Narayanaswamy,
>>> Ravi; 'Oscar R. Hernandez'; Rao, Premanand M; Rice, Michael P;
>>> Robichaux, Joseph; Rodgers, Gregory; Rokos, Georgios;
>>> '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha ([hidden email]);
>>> 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang;
>>> '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
>>> Subject: OpenMP / HPC in Clang / LLVM Multi-company Telecom
>>> Meeting Minutes June 19th 2019
>>>
>>> NEXT MEETING : JULY 10TH (MOVED FROM JULY 3RD)
>>>
>>> OPENS :
>>> - DOCUMENTATION
>>> - Greg : Can we have documents for libopenmp and Libomptarget.
>>> - Alexey suggested having 3 documents: libopenmp, Libomptarget and
>>> device plugin
>>> - Hal will convert the existing libomptarget document. Once done
>>> others can update document to capture the existing implementation
>>> Future addition to libomptarget will also require update to
>>> document.
>>> - Next libopenmp document will be created if it does not exist or
>>> updated if one exists.
>>>
>>> LTO FOR FAT BINARY LINKING
>>> - Serguei (Intel) has an implementation which enables LTO and
>>> doing away with linker scripts.
>>> Everybody agreed this is a good idea, especially some linkers
>>> don’t have support for linker scripts.
>>> AMD is interested in enabling enabling LTO and will like to see
>>> the code
>>> Serguei to post the code to get feedback from all
>>> - Hal to present in next meeting his proposal to support static
>>> fat archives using LTO.
>>>
>>> OPENMP 5.0 FEATURES
>>> - No update on setting up the public website. Johannes was out
>>> attending ISC.
>>> - New features added since last release (courtesy of Kelvin)
>>> - allocate clause/allocate directive - parsing+sema, codegen
>>> - mutexinout dependence-type for task
>>> - user-defined mapper (declare mapper) - parsing+sema.
>>> - omp_get_device_num() API routine
>>>
>>> DEVELOPMENT ACTIVITY
>>> - ASYNC API
>>> Support in Clang and libopenmp including lit test had been checked
>>> in by Doru
>>>
>>> - MAPPER SUPPORT
>>> Initial support for Mapper has been posted for review Lingda. Once
>>> approved, the rest of the support will be done
>>> Lingda : Should the old API being replaced by the new similar API
>>> with extra mapper argument be obsoleted
>>> Suggestion was for clang to not generated but keep the API in
>>> libomptarget for backward compatible. In the future it can be
>>> obsoleted
>>>
>>> - REQUIRED DIRECTIVES
>>> Support for required directives has been checked in by Doru.
>>> There was one issue with checking for requires directive and
>>> confirming it the Declare type is TO or LINK.
>>> Doru removed the check and added note to make sure if things
>>> change in future need to modify this code.
>>>
>>> ROLL CALL :
>>>
>>> COMPANY
>>> ATTENDEES
>>>
>>> 19-JUN
>>>
>>> AMD
>>>
>>> Greg Rodgers
>>>
>>> x
>>>
>>> Ashwin Aji
>>>
>>> Jan Sjodin
>>>
>>> x
>>>
>>> Ron Lieberman
>>>
>>> x
>>>
>>> sameer Sahasrabuddhe
>>>
>>> Andrey Kasaurov
>>>
>>> ANL
>>> Hal Finkel
>>>
>>> x
>>>
>>> Johannes Doerfert
>>>
>>> IBM
>>> Alexandre Eichenberger
>>>
>>> Carlo Bertolli
>>>
>>> Kelvin Li
>>>
>>> Doru
>>>
>>> x
>>>
>>> Alexey Bataev
>>>
>>> x
>>>
>>> INTEL
>>> Andrey Churbanov
>>>
>>> Ravi Narayanaswamy
>>>
>>> x
>>>
>>> Serguei Dmitriev
>>>
>>> x
>>>
>>> Rajiv Deodhar
>>>
>>> Lorri Menard
>>>
>>> Terry Wilmarth
>>>
>>> Rao, Prem
>>>
>>> Hansang Bae
>>>
>>> George Rokos
>>>
>>> x
>>>
>>> CRAY
>>> Deepak Eachempati
>>>
>>> x
>>>
>>> MICRON
>>> John Leidel
>>>
>>> NVIDIA
>>> James Beyer
>>>
>>> x
>>>
>>> ORNL
>>> Graham Lopez
>>>
>>> Joel Denny
>>>
>>> Geoffroy Vallee
>>>
>>> Oscar Hernandez
>>>
>>> SBU/BNL
>>> Lingda Li
>>>
>>> x
>>>
>>> Jose Monlsave
>>>
>>> Martin Kong
>>>
>>> TI
>>> Eric Stotzer
>>>
>>> U OF BRISTOL
>>> Mat Martineau
>>>
>>> U OF DELAWARE
>>> Sunita Chandrasekaran
>>>
>>> U OF ILLINOIS
>>> Hashim Sharif
>>>
>>> RICE
>>> John Mellor-Crummey
>>>
>>> LSU
>>> Tianyi Zhang
>>>
>>>
>>
> .........................................................................................................................................
>>> àJoin Skype Meeting [3]
>>>
>>> Trouble Joining? Try Skype Web App [4]
>>>
>>> Join by phone
>>> +1(916)356-2663 (or your local bridge access #) Choose bridge 5.
>>> [5] (Global) English (United States)
>>> Find a local number [6]
>>>
>>> Conference ID: 7607896966
>>> Forgot your dial-in PIN? [6] |Help [7]
>>>
>>> [!OC([1033])!]
>>>
>>
> .........................................................................................................................................
>
>
> Links:
> ------
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D59474&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=EVaPRpEtSzi0Y56zmjD5fXRzN87UZDOaYp5PY3TXiVQ&amp;e=
> [2]
> https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
> [3]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__meet.intel.com_ravi.narayanaswamy_DK7943NR&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=K4msFCmDvK4n0MdVQd7UTXRRvRkaNwLzMaP8fnX0iOg&amp;e=
> [4]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__meet.intel.com_ravi.narayanaswamy_DK7943NR-3Fsl-3D1&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=krI3wEp2z8GhcZt6feFq3WgaBjcEoTDRk-GvI1BIdO8&amp;e=
> [5]
> tel:+1(916)356-2663%20(or%20your%20local%20bridge%20access%20#)%20Choose%20bridge%205.
> [6]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__dial.intel.com&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=g2dQtoTqaRXyBMaIUpfyoPFDRTtrQbgbWbb9b90tgBg&amp;e=
> [7]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__o15.officeredir.microsoft.com_r_rlidLync15-3Fclid-3D1033-26p1-3D5-26p2-3D2009&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=6OCBXxzOIJfra2Pewq_p-l2pY3MyKnuG-TLr7M1xq-s&amp;e=
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev
Hi Alexey, Hal, and James,

Please see my response inline below:

On Sat, Jun 29, 2019 at 8:30 AM Alexey Bataev <[hidden email]> wrote:
Hi Hal,

Best regards,
Alexey Bataev

28 июня 2019 г., в 23:46, Finkel, Hal J. via cfe-dev <[hidden email]> написал(а):

Hi, Alexey, Lingda,

I haven't been following this closely, so a few questions/comments:

 1. Recursive mappers are not supported in OpenMP 5, but do we expect that to change in the future?

Good question. Do not know, actually, but I think both of those schemes can be adapted to support recursive mappers.

I agree. It will be trivial to support recursive mappers within the framework of these schemes if needed in the future. In case of recursive mappers, mapper functions won't be able to fully inlined in scheme 1, so compiler optimization may be limited.

 2. Our experience so far suggests that the most important optimization in this space is to limit the number of distinct host-to-device transfers (or data copies) on systems where data needs to be copied. In these schemes, where does that coalescing occur?

In both schemes we transfer the data only ones, after we gather all the required data mapping info and after that we transfer it to the device at once. The only difference in these schemes is the number of runtime functions calls required to fill this mapping data.

Both schemes can do such coalescing in the runtime after all mapping information is collected. Scheme 1 can also do such coalescing in the compiler optimization of mapper function, it will be hard to do so though.
 

 3. So long as the mappers aren't recursive, I agree with Alexey that the total number of to-be-mapped components should be efficient to calculate. The counting function should simplify to a trivial expression in nearly all cases. The only case where it might not is where the type contains an array section with dynamic bounds, and the element type also has a mapper with an array section with dynamic bounds. In this case (similar to the unsupported recursive cases, which as an aside, we should probably support it as an extension) we could need to walk the data structure twice to precalculate the number of total components to map. However, this case is certainly detectable by static analysis of the declared mappers, and so I think that we can get the best of both worlds: we could use Alexey's proposed scheme except in cases where we truly need to walk the data-structure twice, in which case we could use Lingda's combined walk/push_back scheme. Is there any reason why that wouldn't work?

I think it is better to use only one scheme. I rather doubt that we can implement some kind of analysis in the frontend. Later, when the real codegen is moved to the backend, we can try to implement 2 schemes. But not today. We need to choose one and I just want to hear all pros and cons for both (actually, there are 3 schemes already) schemes to choose the most flexible, reliable and fast one.

The benefit of scheme 2 is to have memory preallocated instead of using push_back().
Hal, do you think the performance overhead of push_back() is larger than the overhead of precalculating total size, and why?

Thanks,
Lingda Li

Thanks again,

Hal

On 6/28/19 9:00 AM, Alexey Bataev wrote:

Hi Lingda, thanks for your comments.
We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.
With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.
But anyway, I agree, it would be good to hear some other opinions.
--------------
Best regards,
Alexey Bataev


...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev

Lingda Li via cfe-dev wrote:

On Fri, Jun 28, 2019 at 9:49 AM Li, Lingda [hidden email] wrote

I believe the key question here is whether it is true that (the
overhead of push_back() > the overhead of precalculating the total
number + the memory allocation overhead + directly memory write).

Not familiar with this area, nevertheless given

I believe the key question here is whether it is true that (the
overhead of push_back() > the overhead of precalculating the total
number + the memory allocation overhead + directly memory write).

What we appear to be looking at is something very close to vectors of vectors in using position pointers and indexes as against a separate key index as in the case of std::map.

Given that we have vectors of vectors (nested mappers), the nested vector, the vector under the parent vector would almost necessarily need to be reached via a pointer in the parent vector. The reductio ad absurdum would be if we have a nested dynamic vector or a resizable vector, it would be impossible to use pointer arithmetic on the parent if the parent object could change size in containing the dynamic nested vector.

The result is that the size of the parent object would be fixed in using a pointer to the nested dynamic vector. Hence the ability to allocate these vectors at compile time depends on knowing the lengths of the vectors in descending order, parent first. When the length of a child vector is not known, that vector and its children would need to be allocated at run time.

I believe the key question here is whether it is true that (the
overhead of push_back() > the overhead of precalculating the total
number + the memory allocation overhead + directly memory write).

There are two inefficiencies associated with dynamic vectors. That is, the question appears to be whether or not we know the vector lengths at compile time and hence if we do not have the lengths, use dynamic vectors.

Dynamic vectors require a reserve area that can receive push_back objects. Some of that empty memory area will be wasted as it will likely never be used and if you do not have some reasonable idea of how long the eventual vector will likely be, you can waste a large amount of memory.

And if the initial estimate of the vector length is too small, reallocating memory and copying to a larger vector when the first vector is used up takes time.

A deque may be a better choice than a dynamic vector. You can still use position indexes but not pointer arithmetic and the deque would automatically grow in chunks and not need to reallocate.

I suggest that the downsides of dynamic vectors prefers allocation in the compiler when the lengths can be known.

Neil Nelson


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Openmp-dev] Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev


On 6/29/19 7:58 AM, Lingda Li via Openmp-dev wrote:
Hi Jonas,

Sure, we are trying to do so. The public lists often reject my emails because it is large and I cannot fill all people in that mailing list here, though.


That shouldn't be necessary. Everyone on that list should be subscribed to openmp-dev. That's where technical discussions about libomptarget should be held. If there are particular people of interest, then cc them, but there's no need to directly cc everyone who might be on the biweekly call.

Thanks again,

Hal



Thanks,
Lingda Li

On Sat, Jun 29, 2019 at 8:39 AM Jonas Hahnfeld <[hidden email]> wrote:
Hi Lingda,

may I ask to start discussions about important decisions related to
Clang's OpenMP support on the public mailing list instead of having
private conversations? That would help to get feedback from people not
being part of the selected circle participating in the "OpenMP / HPC in
Clang / LLVM Multi-company Telecom".

Thanks,
Jonas

On 2019-06-28 15:59, Lingda Li via cfe-dev wrote:
> On Fri, Jun 28, 2019 at 9:49 AM Li, Lingda <[hidden email]> wrote:
>
>> I don't think we can have the buffer allocated within the mapper
>> function. It has to be done in the runtime, because of nested
>> mappers.
>> First, all mapper functions are born in the same way. We cannot
>> make the outer most mapper function allocate memory, whether the
>> inner one doesn't and has to use what is allocated by the outer most
>> mapper function.
>> I suppose we still need to allocate memory in the runtime, so the
>> runtime can pass the pointer and size to the mapper function, and
>> the outer mapper function can then pass them into inner ones.
>> Again, this is just like the current implementation, except that we
>> don't use vecter::push_back(), instead we use something like a
>> manual implementation of vector::push_back() (because we need to use
>> the pointer and the current index)
>>
>> I believe the key question here is whether it is true that (the
>> overhead of push_back() > the overhead of precalculating the total
>> number + the memory allocation overhead + directly memory write).
>> This will decide whether this change is necessary. Any opinions?
>>
>> Thanks,
>> Lingda Li
>>
>> -------------------------
>>
>> FROM: Alexey Bataev <[hidden email]>
>> SENT: Thursday, June 27, 2019 5:05 PM
>> TO: Li, Lingda
>> CC: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K
>> O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David
>> Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios;
>> Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif,
>> Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert,
>> Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph;
>> Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky,
>> Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew
>> Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael P;
>> Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M;
>> Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert;
>> Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita
>> Chandrasekaran; [hidden email]; Sergio Pino Gallardo;
>> Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha; Wilmarth, Terry
>> L; Tianyi Zhang; [hidden email]; Wang Chen; Wael Yehia; Tian,
>> Xinmin
>> SUBJECT: Re: Re: Re: RE: Comparison of 2 schemes to implement OpenMP
>> 5.0 declare mapper codegen
>>
>> Yes, we need 2 functions, but thw first one can be optimized very
>> effectively. After the optimizations and inlining it will end up
>> with just return s1+s2+s3... I think, inost cases those sizes will
>> be constant, since the mapper maps constant number of elements. And,
>> thus, this expression will be optimized to just a constant value.
>> You don't need to pass these functions to runtime. We can call the
>> directly from the compiler.
>> 1st call: get number of elements.
>> 2nd: allocate the buffer
>> 3rd call: call mapper with this preallocated buffer that fills this
>> buffer without any calls of the runtime functions.
>> 4th call: call the runtime to pass the buffer to the runtime.
>>
>> Best regards,
>> Alexey Bataev
>>
>> 27 июня 2019 г., в 16:53, Li, Lingda <[hidden email]>
>> написал(а):
>>
>>> If we precalculate the size, first, it means we need to generate
>>> 2 functions for each mapper, rather than 1 now. One for mapping
>>> information filling as we have, the other for size calculation
>>> (This will not return constant values, because size depends on how
>>> many instances we are mapping). Both these 2 functions will need
>>> to be passed to the runtime. The runtime will need to precalculate
>>> the number of components first, then allocate memory, then call
>>> the mapper function to fill it up.
>>>
>>> Compared with the scheme 1, the differences are:
>>> 1) An extra call to calculate the total number, while scheme 1
>>> does not;
>>> 2) A preallocated buffer, whose pointer and the current number
>>> should be passed to the mapper function, then the mapper function
>>> uses them to fill components, while scheme 1 uses push_back() to
>>> do the same thing.
>>>
>>> Is there really a benefit doing this? push_back() should be
>>> efficient enough compared with directly writing to memory.
>>>
>>> If people here think that, the overhead of push_back() > the
>>> overhead of precalculating the total number + the memory
>>> allocation overhead + directly memory write, then we can consider
>>> this scheme.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>>
>>> FROM: Alexey Bataev <[hidden email]>
>>> SENT: Thursday, June 27, 2019 4:26 PM
>>> TO: Li, Lingda
>>> CC: Alexandre Eichenberger; Chapman, Barbara (Contact); Kevin K
>>> O'Brien; Carlo Bertolli; Deepak Eachempati; Denny, Joel E.; David
>>> Oehmke; Ettore Tiotto; [hidden email]; Rokos, Georgios;
>>> Gheorghe-Teod Bercea; [hidden email]; Hal Finkel; Sharif,
>>> Hashim; Cownie, James H; Sjodin, Jan; [hidden email]; Doerfert,
>>> Johannes Rudolf; Jones, Jeff C; [hidden email]; Robichaux, Joseph;
>>> Jeff Heath; [hidden email]; Kelvin Li; Bobrovsky,
>>> Konstantin S; Kotsifakou, Maria; [hidden email]; Lopez, Matthew
>>> Graham; Menard, Lorri; Martin Kong; Sarah McNamara; Rice, Michael
>>> P; Matt Martineau; [hidden email]; Jeeva Paudel; Rao, Premanand M;
>>> Krishnaiyer, Rakesh; Narayanaswamy, Ravi; Monteleone, Robert;
>>> Lieberman, Ron; Samuel Antao; Jeffrey Sandoval; Sunita
>>> Chandrasekaran; [hidden email]; Sergio Pino
>>> Gallardo; Dmitriev, Serguei N; Chan, SiuChi; Sunil Shrestha;
>>> Wilmarth, Terry L; Tianyi Zhang; [hidden email]; Wang Chen;
>>> Wael Yehia; Tian, Xinmin
>>> SUBJECT: Re: Re: RE: Comparison of 2 schemes to implement OpenMP
>>> 5.0 declare mapper codegen
>>>
>>> If the functions are inlined (the ines, intended for size
>>> precalculation). They can be optimized out very effectively since
>>> in most cases they will return constant values.
>>> If we could do this, we won't need vectors and oush_backs, we can
>>> use preallocated memory and internal counter.
>>> --------------
>>> Best regards,
>>> Alexey Bataev
>>>
>>> <graycol.gif>"Li, Lingda" ---06/27/2019 04:13:03 PM---Hi Alexey, I
>>> think that's why we choose to use variable size storage like
>>> std::vector to store the m
>>>
>>> From: "Li, Lingda" <[hidden email]>
>>> To: Alexey Bataev <[hidden email]>, Deepak Eachempati
>>> <[hidden email]>
>>> Cc: "Narayanaswamy, Ravi" <[hidden email]>,
>>> "Alexandre Eichenberger" <[hidden email]>, "Chapman, Barbara
>>> (Contact)" <[hidden email]>, "Bobrovsky,
>>> Konstantin S" <[hidden email]>, Carlo Bertolli
>>> <[hidden email]>, "Chan, SiuChi" <[hidden email]>,
>>> "Cownie, James H" <[hidden email]>, David Oehmke
>>> <[hidden email]>, "Denny, Joel E." <[hidden email]>,
>>> "Dmitriev, Serguei N" <[hidden email]>, "Doerfert,
>>> Johannes Rudolf" <[hidden email]>, Ettore Tiotto
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, Gheorghe-Teod Bercea
>>> <[hidden email]>, Hal Finkel <[hidden email]>,
>>> "[hidden email]" <[hidden email]>, Jeeva Paudel
>>> <[hidden email]>, Jeff Heath <[hidden email]>, Jeffrey
>>> Sandoval <[hidden email]>, "Jones, Jeff C"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>,
>>> Kelvin Li <[hidden email]>, "Kevin K O'Brien"
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, "Kotsifakou, Maria"
>>> <[hidden email]>, "Krishnaiyer, Rakesh"
>>> <[hidden email]>, "Lieberman, Ron"
>>> <[hidden email]>, "Lopez, Matthew Graham"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>, Martin
>>> Kong <[hidden email]>, Matt Martineau
>>> <[hidden email]>, "Menard, Lorri"
>>> <[hidden email]>, "Monteleone, Robert"
>>> <[hidden email]>, "[hidden email]" <[hidden email]>,
>>> "Rao, Premanand M" <[hidden email]>, "Rice, Michael P"
>>> <[hidden email]>, "Robichaux, Joseph"
>>> <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, "Rokos, Georgios"
>>> <[hidden email]>, Samuel Antao <[hidden email]>,
>>> "Sarah McNamara" <[hidden email]>,
>>> "[hidden email]" <[hidden email]>,
>>> Sergio Pino Gallardo <[hidden email]>, "Sharif, Hashim"
>>> <[hidden email]>, "Sjodin, Jan" <[hidden email]>, Sunil
>>> Shrestha <[hidden email]>, Sunita Chandrasekaran
>>> <[hidden email]>, "Tian, Xinmin" <[hidden email]>,
>>> Tianyi Zhang <[hidden email]>, "[hidden email]"
>>> <[hidden email]>, Wael Yehia <[hidden email]>, Wang Chen
>>> <[hidden email]>, "Wilmarth, Terry L"
>>> <[hidden email]>
>>> Date: 06/27/2019 04:13 PM
>>> Subject: [EXTERNAL] Re: RE: Comparison of 2 schemes to implement
>>> OpenMP 5.0 declare mapper codegen
>>>
>>> -------------------------
>>>
>>> Hi Alexey,
>>>
>>> I think that's why we choose to use variable size storage like
>>> std::vector to store the mapping information at the first place,
>>> right? It'll be costly to precalculate the total number of
>>> components, especially in the presence of nested mappers. Besides,
>>> a runtime function call is just a std::vector::push, so I think
>>> it's okay to have multiple function calls.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>>
>>> FROM: Alexey Bataev <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:52 PM
>>> To: Deepak Eachempati
>>> Cc: Li, Lingda; Narayanaswamy, Ravi; Alexandre Eichenberger;
>>> Chapman, Barbara (Contact); Bobrovsky, Konstantin S; Carlo
>>> Bertolli; Chan, SiuChi; Cownie, James H; David Oehmke; Denny, Joel
>>> E.; Dmitriev, Serguei N; Doerfert, Johannes Rudolf; Ettore Tiotto;
>>> [hidden email]; Gheorghe-Teod Bercea; Hal Finkel;
>>> [hidden email]; Jeeva Paudel; Jeff Heath; Jeffrey Sandoval;
>>> Jones, Jeff C; [hidden email]; Kelvin Li; Kevin K O'Brien;
>>> [hidden email]; Kotsifakou, Maria; Krishnaiyer, Rakesh;
>>> Lieberman, Ron; Lopez, Matthew Graham; [hidden email]; Martin
>>> Kong; Matt Martineau; Menard, Lorri; Monteleone, Robert;
>>> [hidden email]; Rao, Premanand M; Rice, Michael P; Robichaux,
>>> Joseph; [hidden email]; Rokos, Georgios; Samuel Antao;
>>> Sarah McNamara; [hidden email]; Sergio Pino
>>> Gallardo; Sharif, Hashim; Sjodin, Jan; Sunil Shrestha; Sunita
>>> Chandrasekaran; Tian, Xinmin; Tianyi Zhang; [hidden email];
>>> Wael Yehia; Wang Chen; Wilmarth, Terry L
>>> Subject: Re: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Lingda, can we in scheme 1 precalculate the total number of
>>> components, allocate memory for these precalculate number of
>>> elements, then fill it with mappers and only after that call the
>>> runtime function (only once!) to transfer the mappings to the
>>> runtime?
>>>
>>> Best regards,
>>> Alexey Bataev
>>>
>>> 27 июня 2019 г., в 15:44, Deepak Eachempati
>>> <[hidden email]> написал(а):
>>>
>>> Got it. Thanks.
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:41 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> In the current scheme, all mappings within a mapper function is
>>> done atomically by one thread. In the mapper function of the
>>> example in the original email, <push> will just push the mapping
>>> information into an internal data structure. Once all mapping
>>> information is available, the runtime will do the real mapping
>>> together. For your example, the behavior is the same as the code
>>> below:
>>>
>>> ...
>>> #pragma omp parallel num_threads(2)
>>> {
>>> if (omp_get_thread_num() == 0) {
>>> #pragma omp target map(s.x, s.p[0:s.x])
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>> } else {
>>> #pragma omp target map(other_data)
>>> {
>>> // work on other_data
>>> }
>>> }
>>> ...
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:34 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod
>>> Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew
>>> Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau';
>>> Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand
>>> M; Rice, Michael P; Robichaux, Joseph; [hidden email];
>>> Rokos, Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> I was referring to something like this, where another thread is
>>> not trying to map the same data:
>>>
>>> #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
>>> S s;
>>> ...
>>> #pragma omp parallel num_threads(2)
>>> {
>>> if (omp_get_thread_num() == 0) {
>>> #pragma omp target map(s)
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>> } else {
>>> #pragma omp target map(other_data)
>>> {
>>> // work on other_data
>>> }
>>> }
>>> ...
>>>
>>> Since I believe you are mapping s.x and s.p as separate map
>>> operations, it is possible that another thread could map
>>> ‘other_data’ in between those two maps. If this happens, will
>>> your implementation still ensure that s.x and s.p are positioned
>>> at the right offsets with respect to the same base address (&s)?
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:26 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> When 2 threads try to concurrently map the same data, it behaves
>>> the same as when 2 threads concurrently map the same data using
>>> map clauses, and mappers don't introduce extra considerations
>>> here. For instance, both threads use #omp target enter data
>>> concurrently.
>>>
>>> When 2 threads concurrently maps the same data, my understanding
>>> based on the current code is, it will create 2 copies of the same
>>> data, either copy is correctly to use. It may have a problem when
>>> both copies are mapped back if not synchronized correctly, but
>>> this is a programming issue, not the responsibility of OpenMP.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 3:17 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; 'Ettore Tiotto'; '[hidden email]'; 'Gheorghe-Teod
>>> Bercea'; Hal Finkel; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew
>>> Graham'; '[hidden email]'; 'Martin Kong'; 'Matt Martineau';
>>> Menard, Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand
>>> M; Rice, Michael P; Robichaux, Joseph; [hidden email];
>>> Rokos, Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Thanks.
>>>
>>> Is it possible for another thread to be concurrently mapped
>>> something else while the maps from the mapper function are taking
>>> place? If so, how do you guarantee that the allocation for each
>>> component will get you the right addresses in device memory? Sorry
>>> if this was covered before and I missed it.
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 2:08 PM
>>> To: Deepak Eachempati <[hidden email]>; Narayanaswamy, Ravi
>>> <[hidden email]>; 'Alexandre Eichenberger'
>>> <[hidden email]>; 'Alexey Bataev' <[hidden email]>;
>>> Chapman, Barbara (Contact) <[hidden email]>;
>>> Bobrovsky, Konstantin S <[hidden email]>; 'Carlo
>>> Bertolli' <[hidden email]>; 'Chan, SiuChi'
>>> <[hidden email]>; Cownie, James H <[hidden email]>;
>>> David Oehmke <[hidden email]>; 'Denny, Joel E.'
>>> <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; 'Ettore Tiotto' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Gheorghe-Teod
>>> Bercea' <[hidden email]>; Hal Finkel
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Jeeva
>>> Paudel' <[hidden email]>; 'Jeff Heath' <[hidden email]>;
>>> Jeffrey Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; 'Lopez, Matthew Graham'
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Martin
>>> Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> Hi Deepak,
>>>
>>> Yes, it handles this case. The first part of mapper function
>>> (initially allocate space for the whole array) is just an
>>> optimization, not required for correctness, as suggested by you in
>>> an early discussion.
>>>
>>> In your example, s.x and s.p will be allocated separately (not in
>>> a single allocation). But Clang guarantees that their addresses
>>> will be correct because s.x and s.p share the same base address,
>>> which is &s.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Deepak Eachempati <[hidden email]>
>>> Sent: Thursday, June 27, 2019 2:49 PM
>>> To: Li, Lingda; Narayanaswamy, Ravi; 'Alexandre Eichenberger';
>>> 'Alexey Bataev'; Chapman, Barbara (Contact); Bobrovsky, Konstantin
>>> S; 'Carlo Bertolli'; 'Chan, SiuChi'; Cownie, James H; David
>>> Oehmke; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert, Johannes
>>> Rudolf ; '[hidden email]'; 'Ettore Tiotto';
>>> '[hidden email]'; 'Gheorghe-Teod Bercea'; Hal Finkel;
>>> '[hidden email]'; 'Jeeva Paudel'; 'Jeff Heath'; Jeffrey
>>> Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin Li'; 'Kevin K
>>> O'Brien'; '[hidden email]'; 'Kotsifakou, Maria';
>>> Krishnaiyer, Rakesh; Lieberman, Ron ; 'Lopez, Matthew Graham';
>>> '[hidden email]'; 'Martin Kong'; 'Matt Martineau'; Menard,
>>> Lorri; Monteleone, Robert; [hidden email]; Rao, Premanand M; Rice,
>>> Michael P; Robichaux, Joseph; [hidden email]; Rokos,
>>> Georgios; '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha; 'Sunita Chandrasekaran';
>>> Tian, Xinmin; Tianyi Zhang; '[hidden email]'; 'Wael Yehia';
>>> 'Wang Chen'; Wilmarth, Terry L
>>> Subject: RE: Comparison of 2 schemes to implement OpenMP 5.0
>>> declare mapper codegen
>>>
>>> For Scheme 1, it looks like you are doing separate maps for each
>>> component when size == 1. It seems like the first and last if
>>> statements should have “size >= 1” rather than “size > 1”.
>>>
>>> If the mapper is declared like this:
>>>
>>> struct S {
>>> int x;
>>> ... // other stuff
>>> int *p;
>>> };
>>>
>>> #pragma omp declare mapper(S s) map(s.x) map(s.p[0:s.x])
>>>
>>> And you have:
>>>
>>> S s;
>>> ...
>>> #pragma omp target map(s)
>>> {
>>> for (int i = 0; i < s.x; i++) s.p[i] = i;
>>> }
>>>
>>> Since the target construct is just mapping a single structure of
>>> type S, there should be one map that takes care of mapping storage
>>> for s.x and s.p with a single allocation, and a separate map for
>>> the array section s.p[0:s.x], and finally the pointer attachment
>>> of s.p to s.p[0:s.x]. Does Scheme 1 handle this?
>>>
>>> -- Deepak
>>>
>>> FROM: Li, Lingda [mailto:[hidden email]]
>>> Sent: Thursday, June 27, 2019 1:07 PM
>>> To: Narayanaswamy, Ravi <[hidden email]>; 'Alexandre
>>> Eichenberger' <[hidden email]>; 'Alexey Bataev'
>>> <[hidden email]>; Chapman, Barbara (Contact)
>>> <[hidden email]>; Bobrovsky, Konstantin S
>>> <[hidden email]>; 'Carlo Bertolli'
>>> <[hidden email]>; 'Chan, SiuChi' <[hidden email]>;
>>> Cownie, James H <[hidden email]>; David Oehmke
>>> <[hidden email]>; Deepak Eachempati <[hidden email]>;
>>> 'Denny, Joel E.' <[hidden email]>; Dmitriev, Serguei N
>>> <[hidden email]>; Doerfert, Johannes Rudolf
>>> <[hidden email]>; '[hidden email]' <[hidden email]>; 'Ettore
>>> Tiotto' <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Gheorghe-Teod Bercea'
>>> <[hidden email]>; Hal Finkel <[hidden email]>;
>>> '[hidden email]' <[hidden email]>; 'Jeeva Paudel'
>>> <[hidden email]>; 'Jeff Heath' <[hidden email]>; Jeffrey
>>> Sandoval <[hidden email]>; Jones, Jeff C
>>> <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Kelvin Li' <[hidden email]>; 'Kevin K O'Brien'
>>> <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Kotsifakou, Maria'
>>> <[hidden email]>; Krishnaiyer, Rakesh
>>> <[hidden email]>; Lieberman, Ron
>>> <[hidden email]>; Li, Lingda <[hidden email]>; 'Lopez, Matthew
>>> Graham' <[hidden email]>; '[hidden email]' <[hidden email]>;
>>> 'Martin Kong' <[hidden email]>; 'Matt Martineau'
>>> <[hidden email]>; Menard, Lorri
>>> <[hidden email]>; Monteleone, Robert
>>> <[hidden email]>; [hidden email]; Rao, Premanand M
>>> <[hidden email]>; Rice, Michael P
>>> <[hidden email]>; Robichaux, Joseph
>>> <[hidden email]>; [hidden email]; Rokos,
>>> Georgios <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Sarah McNamara' <[hidden email]>;
>>> '[hidden email]' <[hidden email]>;
>>> 'Sergio Pino Gallardo' <[hidden email]>; 'Sharif, Hashim'
>>> <[hidden email]>; Sjodin, Jan <[hidden email]>; Sunil
>>> Shrestha <[hidden email]>; 'Sunita Chandrasekaran'
>>> <[hidden email]>; Tian, Xinmin <[hidden email]>; Tianyi
>>> Zhang <[hidden email]>; '[hidden email]'
>>> <[hidden email]>; 'Wael Yehia' <[hidden email]>; 'Wang
>>> Chen' <[hidden email]>; Wilmarth, Terry L
>>> <[hidden email]>
>>> Subject: Comparison of 2 schemes to implement OpenMP 5.0 declare
>>> mapper codegen
>>>
>>> Hi,
>>>
>>> Alexey and I would like to have your attention on an ongoing
>>> discussion of 2 schemes to implement the declare mapper in OpenMP
>>> 5.0. The detailed discussion can be found at
>>> https://reviews.llvm.org/D59474 [1]
>>>
>>> Scheme 1 (the one has been implemented by me in
>>> https://reviews.llvm.org/D59474 [1]):
>>> The detailed design can be found at
>>>
>>
> https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
>>> [2]
>>> For each mapper function, the compiler generates a function like
>>> this:
>>>
>>> ```
>>> void <type>.mapper(void *base, void *begin, size_t size, int64_t
>>> type) {
>>> // Allocate space for an array section first.
>>> if (size > 1 && !maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>>
>>> // Map members.
>>> for (unsigned i = 0; i < size; i++) {
>>> // For each component specified by this mapper:
>>> for (auto c : components) {
>>> ...; // code to generate c.arg_base, c.arg_begin, c.arg_size,
>>> c.arg_type
>>> if (c.hasMapper())
>>> (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> else
>>> <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> }
>>> }
>>> // Delete the array section.
>>> if (size > 1 && maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>> }
>>> ```
>>> This function is passed to the OpenMP runtime, and the runtime
>>> will call this function to finish the data mapping.
>>>
>>> Scheme 2 (which Alexey proposes):
>>> Alexey proposes to move parts of the mapper function above into
>>> the OpenMP runtime, so the compiler will generate code below:
>>> ```
>>> void <type>.mapper(void *base, void *begin, size_t size, int64_t
>>> type) {
>>> ...; // code to generate arg_base, arg_begin, arg_size, arg_type,
>>> arg_mapper.
>>> auto sub_components[] = {...}; // fill in generated begin, base,
>>> ...
>>> __tgt_mapper(base, begin, size, type, sub_components);
>>> }
>>> ```
>>>
>>> `__tgt_mapper` is a runtime function as below:
>>> ```
>>> void __tgt_mapper(void *base, void *begin, size_t size, int64_t
>>> type, auto components[]) {
>>> // Allocate space for an array section first.
>>> if (size > 1 && !maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>>
>>> // Map members.
>>> for (unsigned i = 0; i < size; i++) {
>>> // For each component specified by this mapper:
>>> for (auto c : components) {
>>> if (c.hasMapper())
>>> (*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> else
>>> <push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
>>> }
>>> }
>>> // Delete the array section.
>>> if (size > 1 && maptype.IsDelete)
>>> <push>(base, begin, size*sizeof(Ty), clearToFrom(type));
>>> }
>>> ```
>>>
>>> Comparison:
>>> Why to choose 1 over 2:
>>> 1. In scheme 2, the compiler needs to generate all map types and
>>> pass them to __tgt_mapper through sub_components. But in this
>>> case, the compiler won't be able to generate the correct MEMBER_OF
>>> field in map type. As a result, the runtime has to fix it using
>>> the mechanism we already have here: __tgt_mapper_num_components.
>>> This not only increases complexity, but also, it means the runtime
>>> needs further manipulation of the map type, which creates locality
>>> issues. While in the current scheme, the map type is generated by
>>> compiler once, so the data locality will be very good in this
>>> case.
>>> 2. In scheme 2, sub_components includes all components that should
>>> be mapped. If we are mapping an array, this means we need to map
>>> many components, which will need to allocate memory for
>>> sub_components in the heap. This creates further memory management
>>> burden and is not an efficient way to use memory.
>>> 3. In scheme 1, we are able to inline nested mapper functions. As
>>> a result, the compiler can do further optimizations to optimize
>>> the mapper function, e.g., eliminate redundant computation, loop
>>> unrolling, and thus achieve potentially better performance. We
>>> cannot achieve these optimizations in scheme 2.
>>>
>>> Why to choose 2 over 1:
>>> 1. Less code in the mapper function codegen (I doubt this because
>>> the codegen function of scheme 1 uses less than 200 loc)
>>> Alexey may have other reasons.
>>>
>>> We will appreciate if you can share your thoughts.
>>>
>>> Thanks,
>>> Lingda Li
>>>
>>> -------------------------
>>> FROM: Narayanaswamy, Ravi <[hidden email]>
>>> Sent: Wednesday, June 19, 2019 3:09 PM
>>> To: 'Alexandre Eichenberger'; 'Alexey Bataev';
>>> '[hidden email]'; Bobrovsky, Konstantin S; 'Carlo
>>> Bertolli'; 'Chan, SiuChi'; Cownie, James H; David Oehmke; Deepak
>>> Eachempati; 'Denny, Joel E.'; Dmitriev, Serguei N; Doerfert,
>>> Johannes Rudolf ; '[hidden email]'; 'Ettore Tiotto';
>>> '[hidden email]'; 'Gheorghe-Teod Bercea';
>>> '[hidden email]'; '[hidden email]'; 'Jeeva Paudel'; 'Jeff
>>> Heath'; Jeffrey Sandoval; Jones, Jeff C; '[hidden email]'; 'Kelvin
>>> Li'; 'Kevin K O'Brien'; '[hidden email]'; 'Kotsifakou,
>>> Maria'; Krishnaiyer, Rakesh; Lieberman, Ron ; '[hidden email]';
>>> 'Lopez, Matthew Graham'; '[hidden email]'; 'Martin Kong'; 'Matt
>>> Martineau'; Menard, Lorri; Monteleone, Robert; Narayanaswamy,
>>> Ravi; 'Oscar R. Hernandez'; Rao, Premanand M; Rice, Michael P;
>>> Robichaux, Joseph; Rodgers, Gregory; Rokos, Georgios;
>>> '[hidden email]'; 'Sarah McNamara';
>>> '[hidden email]'; 'Sergio Pino Gallardo'; 'Sharif,
>>> Hashim'; Sjodin, Jan ; Sunil Shrestha ([hidden email]);
>>> 'Sunita Chandrasekaran'; Tian, Xinmin; Tianyi Zhang;
>>> '[hidden email]'; 'Wael Yehia'; 'Wang Chen'; Wilmarth, Terry L
>>> Subject: OpenMP / HPC in Clang / LLVM Multi-company Telecom
>>> Meeting Minutes June 19th 2019
>>>
>>> NEXT MEETING : JULY 10TH (MOVED FROM JULY 3RD)
>>>
>>> OPENS :
>>> - DOCUMENTATION
>>> - Greg : Can we have documents for libopenmp and Libomptarget.
>>> - Alexey suggested having 3 documents: libopenmp, Libomptarget and
>>> device plugin
>>> - Hal will convert the existing libomptarget document. Once done
>>> others can update document to capture the existing implementation
>>> Future addition to libomptarget will also require update to
>>> document.
>>> - Next libopenmp document will be created if it does not exist or
>>> updated if one exists.
>>>
>>> LTO FOR FAT BINARY LINKING
>>> - Serguei (Intel) has an implementation which enables LTO and
>>> doing away with linker scripts.
>>> Everybody agreed this is a good idea, especially some linkers
>>> don’t have support for linker scripts.
>>> AMD is interested in enabling enabling LTO and will like to see
>>> the code
>>> Serguei to post the code to get feedback from all
>>> - Hal to present in next meeting his proposal to support static
>>> fat archives using LTO.
>>>
>>> OPENMP 5.0 FEATURES
>>> - No update on setting up the public website. Johannes was out
>>> attending ISC.
>>> - New features added since last release (courtesy of Kelvin)
>>> - allocate clause/allocate directive - parsing+sema, codegen
>>> - mutexinout dependence-type for task
>>> - user-defined mapper (declare mapper) - parsing+sema.
>>> - omp_get_device_num() API routine
>>>
>>> DEVELOPMENT ACTIVITY
>>> - ASYNC API
>>> Support in Clang and libopenmp including lit test had been checked
>>> in by Doru
>>>
>>> - MAPPER SUPPORT
>>> Initial support for Mapper has been posted for review Lingda. Once
>>> approved, the rest of the support will be done
>>> Lingda : Should the old API being replaced by the new similar API
>>> with extra mapper argument be obsoleted
>>> Suggestion was for clang to not generated but keep the API in
>>> libomptarget for backward compatible. In the future it can be
>>> obsoleted
>>>
>>> - REQUIRED DIRECTIVES
>>> Support for required directives has been checked in by Doru.
>>> There was one issue with checking for requires directive and
>>> confirming it the Declare type is TO or LINK.
>>> Doru removed the check and added note to make sure if things
>>> change in future need to modify this code.
>>>
>>> ROLL CALL :
>>>
>>> COMPANY
>>> ATTENDEES
>>>
>>> 19-JUN
>>>
>>> AMD
>>>
>>> Greg Rodgers
>>>
>>> x
>>>
>>> Ashwin Aji
>>>
>>> Jan Sjodin
>>>
>>> x
>>>
>>> Ron Lieberman
>>>
>>> x
>>>
>>> sameer Sahasrabuddhe
>>>
>>> Andrey Kasaurov
>>>
>>> ANL
>>> Hal Finkel
>>>
>>> x
>>>
>>> Johannes Doerfert
>>>
>>> IBM
>>> Alexandre Eichenberger
>>>
>>> Carlo Bertolli
>>>
>>> Kelvin Li
>>>
>>> Doru
>>>
>>> x
>>>
>>> Alexey Bataev
>>>
>>> x
>>>
>>> INTEL
>>> Andrey Churbanov
>>>
>>> Ravi Narayanaswamy
>>>
>>> x
>>>
>>> Serguei Dmitriev
>>>
>>> x
>>>
>>> Rajiv Deodhar
>>>
>>> Lorri Menard
>>>
>>> Terry Wilmarth
>>>
>>> Rao, Prem
>>>
>>> Hansang Bae
>>>
>>> George Rokos
>>>
>>> x
>>>
>>> CRAY
>>> Deepak Eachempati
>>>
>>> x
>>>
>>> MICRON
>>> John Leidel
>>>
>>> NVIDIA
>>> James Beyer
>>>
>>> x
>>>
>>> ORNL
>>> Graham Lopez
>>>
>>> Joel Denny
>>>
>>> Geoffroy Vallee
>>>
>>> Oscar Hernandez
>>>
>>> SBU/BNL
>>> Lingda Li
>>>
>>> x
>>>
>>> Jose Monlsave
>>>
>>> Martin Kong
>>>
>>> TI
>>> Eric Stotzer
>>>
>>> U OF BRISTOL
>>> Mat Martineau
>>>
>>> U OF DELAWARE
>>> Sunita Chandrasekaran
>>>
>>> U OF ILLINOIS
>>> Hashim Sharif
>>>
>>> RICE
>>> John Mellor-Crummey
>>>
>>> LSU
>>> Tianyi Zhang
>>>
>>>
>>
> .........................................................................................................................................
>>> àJoin Skype Meeting [3]
>>>
>>> Trouble Joining? Try Skype Web App [4]
>>>
>>> Join by phone
>>> +1(916)356-2663 (or your local bridge access #) Choose bridge 5.
>>> [5] (Global) English (United States)
>>> Find a local number [6]
>>>
>>> Conference ID: 7607896966
>>> Forgot your dial-in PIN? [6] |Help [7]
>>>
>>> [!OC([1033])!]
>>>
>>
> .........................................................................................................................................
>
>
> Links:
> ------
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D59474&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=EVaPRpEtSzi0Y56zmjD5fXRzN87UZDOaYp5PY3TXiVQ&amp;e=
> [2]
> https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
> [3]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__meet.intel.com_ravi.narayanaswamy_DK7943NR&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=K4msFCmDvK4n0MdVQd7UTXRRvRkaNwLzMaP8fnX0iOg&amp;e=
> [4]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__meet.intel.com_ravi.narayanaswamy_DK7943NR-3Fsl-3D1&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=krI3wEp2z8GhcZt6feFq3WgaBjcEoTDRk-GvI1BIdO8&amp;e=
> [5]
> tel:+1(916)356-2663%20(or%20your%20local%20bridge%20access%20#)%20Choose%20bridge%205.
> [6]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__dial.intel.com&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=g2dQtoTqaRXyBMaIUpfyoPFDRTtrQbgbWbb9b90tgBg&amp;e=
> [7]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__o15.officeredir.microsoft.com_r_rlidLync15-3Fclid-3D1033-26p1-3D5-26p2-3D2009&amp;d=DwMFaQ&amp;c=aTOVZmpUfPKZuaG9NO7J7Mh6imZbfhL47t9CpZ-pCOw&amp;r=RLUU7gQynM_GwGu2QR7zHw&amp;m=0c8CuLZZzM3R7PecCmFPYLuPYEOtCJHYTIGjSgIPaWU&amp;s=6OCBXxzOIJfra2Pewq_p-l2pY3MyKnuG-TLr7M1xq-s&amp;e=
> _______________________________________________
> cfe-dev mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
Openmp-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev


On 6/28/19 11:56 PM, James Beyer wrote:

Recursive data structures are important if you consider linked lists important. 


I definitely agree, and I do.


 

Supporting these is challenging but not impossible, I would expect that if someone manages to implement a cost effective way to support linked lists we would add support to OpenMP with ease.


In the context of the current proposal, supporting recursion seems to have two effects:

 1. You would not want to use a two-pass traversal to precalculate the size of the mapping table (because if you did two passes, you would traverse the list twice, and hat would be unnecessarily expensive).

 2. We'd need to also maintain a "visited addresses" hash table to prevent infinite recursion. As we build up the array of mapping descriptors, we would also add the addresses to the hash table, and should the address already be present , we'd avoid recursing (i.e., just use a regular visited set as one does with a graph traversal).

Am I overlooking something?

 -Hal


 

From: Finkel, Hal J. [hidden email]
Sent: Friday, June 28, 2019 10:46 PM
To: Alexey Bataev [hidden email]; Li, Lingda [hidden email]
Cc: Alexandre Eichenberger [hidden email]; Chapman, Barbara (Contact) [hidden email]; Kevin K O'Brien [hidden email]; Carlo Bertolli [hidden email]; Deepak Eachempati [hidden email]; Denny, Joel E. [hidden email]; David Oehmke [hidden email]; Ettore Tiotto [hidden email]; [hidden email]; Rokos, Georgios [hidden email]; Gheorghe-Teod Bercea [hidden email]; [hidden email]; Sharif, Hashim [hidden email]; Cownie, James H [hidden email]; Sjodin, Jan [hidden email]; James Beyer [hidden email]; Doerfert, Johannes [hidden email]; Jones, Jeff C [hidden email]; [hidden email]; Robichaux, Joseph [hidden email]; Jeff Heath [hidden email]; [hidden email]; Kelvin Li [hidden email]; Bobrovsky, Konstantin S [hidden email]; Kotsifakou, Maria [hidden email]; Li, Lingda (Contact) [hidden email]; Lopez, Matthew Graham [hidden email]; [hidden email]; Menard, Lorri [hidden email]; Martin Kong [hidden email]; Sarah McNamara [hidden email]; Rice, Michael P [hidden email]; Matt Martineau [hidden email]; [hidden email]; Jeeva Paudel [hidden email]; Rao, Premanand M [hidden email]; Krishnaiyer, Rakesh [hidden email]; Narayanaswamy, Ravi [hidden email]; Monteleone, Robert [hidden email]; Lieberman, Ron [hidden email]; Samuel Antao [hidden email]; Jeffrey Sandoval [hidden email]; Sunita Chandrasekaran [hidden email]; [hidden email]; Sergio Pino Gallardo [hidden email]; Dmitriev, Serguei N [hidden email]; Chan, SiuChi [hidden email]; Sunil Shrestha [hidden email]; Wilmarth, Terry L [hidden email]; Tianyi Zhang [hidden email]; [hidden email]; Wang Chen [hidden email]; Wael Yehia [hidden email]; Tian, Xinmin [hidden email]; [hidden email]; [hidden email]
Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

 

Hi, Alexey, Lingda,

I haven't been following this closely, so a few questions/comments:

 1. Recursive mappers are not supported in OpenMP 5, but do we expect that to change in the future?

 2. Our experience so far suggests that the most important optimization in this space is to limit the number of distinct host-to-device transfers (or data copies) on systems where data needs to be copied. In these schemes, where does that coalescing occur?

 3. So long as the mappers aren't recursive, I agree with Alexey that the total number of to-be-mapped components should be efficient to calculate. The counting function should simplify to a trivial expression in nearly all cases. The only case where it might not is where the type contains an array section with dynamic bounds, and the element type also has a mapper with an array section with dynamic bounds. In this case (similar to the unsupported recursive cases, which as an aside, we should probably support it as an extension) we could need to walk the data structure twice to precalculate the number of total components to map. However, this case is certainly detectable by static analysis of the declared mappers, and so I think that we can get the best of both worlds: we could use Alexey's proposed scheme except in cases where we truly need to walk the data-structure twice, in which case we could use Lingda's combined walk/push_back scheme. Is there any reason why that wouldn't work?

Thanks again,

Hal

On 6/28/19 9:00 AM, Alexey Bataev wrote:

Hi Lingda, thanks for your comments.
We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.
With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.
But anyway, I agree, it would be good to hear some other opinions.
--------------
Best regards,
Alexey Bataev



...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev
Hi Neil, thanks for you suggestions.  Unfortunately,  we cannot use deques or some other container without continuous memory layout. Yes, it looks like vector of vectors, but actually should be represented as linearized, collapsed vector of vectors. The runtime functions are not aware about any complex structures, they accept continuous memory buffers as inputs, not vectors, deques etc.

Best regards,
Alexey Bataev

29 июня 2019 г., в 15:17, Neil Nelson via cfe-dev <[hidden email]> написал(а):

Lingda Li via cfe-dev wrote:

On Fri, Jun 28, 2019 at 9:49 AM Li, Lingda [hidden email] wrote

I believe the key question here is whether it is true that (the
overhead of push_back() > the overhead of precalculating the total
number + the memory allocation overhead + directly memory write).

Not familiar with this area, nevertheless given

I believe the key question here is whether it is true that (the
overhead of push_back() > the overhead of precalculating the total
number + the memory allocation overhead + directly memory write).

What we appear to be looking at is something very close to vectors of vectors in using position pointers and indexes as against a separate key index as in the case of std::map.

Given that we have vectors of vectors (nested mappers), the nested vector, the vector under the parent vector would almost necessarily need to be reached via a pointer in the parent vector. The reductio ad absurdum would be if we have a nested dynamic vector or a resizable vector, it would be impossible to use pointer arithmetic on the parent if the parent object could change size in containing the dynamic nested vector.

The result is that the size of the parent object would be fixed in using a pointer to the nested dynamic vector. Hence the ability to allocate these vectors at compile time depends on knowing the lengths of the vectors in descending order, parent first. When the length of a child vector is not known, that vector and its children would need to be allocated at run time.

I believe the key question here is whether it is true that (the
overhead of push_back() > the overhead of precalculating the total
number + the memory allocation overhead + directly memory write).

There are two inefficiencies associated with dynamic vectors. That is, the question appears to be whether or not we know the vector lengths at compile time and hence if we do not have the lengths, use dynamic vectors.

Dynamic vectors require a reserve area that can receive push_back objects. Some of that empty memory area will be wasted as it will likely never be used and if you do not have some reasonable idea of how long the eventual vector will likely be, you can waste a large amount of memory.

And if the initial estimate of the vector length is too small, reallocating memory and copying to a larger vector when the first vector is used up takes time.

A deque may be a better choice than a dynamic vector. You can still use position indexes but not pointer arithmetic and the deque would automatically grow in chunks and not need to reallocate.

I suggest that the downsides of dynamic vectors prefers allocation in the compiler when the lengths can be known.

Neil Nelson


_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: [Openmp-dev] Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Nathan Ridge via cfe-dev
In reply to this post by Nathan Ridge via cfe-dev


Best regards,
Alexey Bataev

2 июля 2019 г., в 14:34, Finkel, Hal J. via Openmp-dev <[hidden email]> написал(а):


On 6/28/19 11:56 PM, James Beyer wrote:

Recursive data structures are important if you consider linked lists important. 


I definitely agree, and I do.


 

Supporting these is challenging but not impossible, I would expect that if someone manages to implement a cost effective way to support linked lists we would add support to OpenMP with ease.


In the context of the current proposal, supporting recursion seems to have two effects:

 1. You would not want to use a two-pass traversal to precalculate the size of the mapping table (because if you did two passes, you would traverse the list twice, and hat would be unnecessarily expensive).


In this case we should review 2 remaining schemes: the original from Lingda and alternative scheme with functional part moved to the runtime and mappers called indirectly by the runtime (see the description provided by Lingda).

 2. We'd need to also maintain a "visited addresses" hash table to prevent infinite recursion. As we build up the array of mapping descriptors, we would also add the addresses to the hash table, and should the address already be present , we'd avoid recursing (i.e., just use a regular visited set as one does with a graph traversal).

Am I overlooking something?

 -Hal


 

From: Finkel, Hal J. [hidden email]
Sent: Friday, June 28, 2019 10:46 PM
To: Alexey Bataev [hidden email]; Li, Lingda [hidden email]
Cc: Alexandre Eichenberger [hidden email]; Chapman, Barbara (Contact) [hidden email]; Kevin K O'Brien [hidden email]; Carlo Bertolli [hidden email]; Deepak Eachempati [hidden email]; Denny, Joel E. [hidden email]; David Oehmke [hidden email]; Ettore Tiotto [hidden email]; [hidden email]; Rokos, Georgios [hidden email]; Gheorghe-Teod Bercea [hidden email]; [hidden email]; Sharif, Hashim [hidden email]; Cownie, James H [hidden email]; Sjodin, Jan [hidden email]; James Beyer [hidden email]; Doerfert, Johannes [hidden email]; Jones, Jeff C [hidden email]; [hidden email]; Robichaux, Joseph [hidden email]; Jeff Heath [hidden email]; [hidden email]; Kelvin Li [hidden email]; Bobrovsky, Konstantin S [hidden email]; Kotsifakou, Maria [hidden email]; Li, Lingda (Contact) [hidden email]; Lopez, Matthew Graham [hidden email]; [hidden email]; Menard, Lorri [hidden email]; Martin Kong [hidden email]; Sarah McNamara [hidden email]; Rice, Michael P [hidden email]; Matt Martineau [hidden email]; [hidden email]; Jeeva Paudel [hidden email]; Rao, Premanand M [hidden email]; Krishnaiyer, Rakesh [hidden email]; Narayanaswamy, Ravi [hidden email]; Monteleone, Robert [hidden email]; Lieberman, Ron [hidden email]; Samuel Antao [hidden email]; Jeffrey Sandoval [hidden email]; Sunita Chandrasekaran [hidden email]; [hidden email]; Sergio Pino Gallardo [hidden email]; Dmitriev, Serguei N [hidden email]; Chan, SiuChi [hidden email]; Sunil Shrestha [hidden email]; Wilmarth, Terry L [hidden email]; Tianyi Zhang [hidden email]; [hidden email]; Wang Chen [hidden email]; Wael Yehia [hidden email]; Tian, Xinmin [hidden email]; [hidden email]; [hidden email]
Subject: Re: Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

 

Hi, Alexey, Lingda,

I haven't been following this closely, so a few questions/comments:

 1. Recursive mappers are not supported in OpenMP 5, but do we expect that to change in the future?

 2. Our experience so far suggests that the most important optimization in this space is to limit the number of distinct host-to-device transfers (or data copies) on systems where data needs to be copied. In these schemes, where does that coalescing occur?

 3. So long as the mappers aren't recursive, I agree with Alexey that the total number of to-be-mapped components should be efficient to calculate. The counting function should simplify to a trivial expression in nearly all cases. The only case where it might not is where the type contains an array section with dynamic bounds, and the element type also has a mapper with an array section with dynamic bounds. In this case (similar to the unsupported recursive cases, which as an aside, we should probably support it as an extension) we could need to walk the data structure twice to precalculate the number of total components to map. However, this case is certainly detectable by static analysis of the declared mappers, and so I think that we can get the best of both worlds: we could use Alexey's proposed scheme except in cases where we truly need to walk the data-structure twice, in which case we could use Lingda's combined walk/push_back scheme. Is there any reason why that wouldn't work?

Thanks again,

Hal

On 6/28/19 9:00 AM, Alexey Bataev wrote:

Hi Lingda, thanks for your comments.
We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.
With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.
But anyway, I agree, it would be good to hear some other opinions.
--------------
Best regards,
Alexey Bataev



...

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
Openmp-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev