Playbook for using VTune tool on devCloud or other Clusters
-----------------------------------------------------------

Note: command lines start with "$" prompt. 

1-4 are about Devcloud usage. 
5-7 show the test application nbody. 
  8 aps
9+  VTune

1. Log into DevCloud
--------------------

$ ssh devcloud 

Alternative: open a jupyter notebook and start the terminal 

2. Clone Samples GitHub
------------------------

$ git clone https://github.com/oneapi-src/oneAPI-samples.git


3. Start interactive session on a node with GEN11 GPU 
------------------------------------------------------

(People with NDA accounts may use ATS-P gpu)


it is better to compile on compute node because login node has very limited memory etc. 

$ qsub -I -l nodes=1:gen11:ppn=2

4. Check properties 
-------------------

$ sycl-ls --verbose

Platform [#3]:
    Version  : OpenCL 3.0
    Name     : Intel(R) OpenCL HD Graphics
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#2]:
        Type       : gpu
        Version    : 3.0
        Name       : Intel(R) UHD Graphics [0x9a60]
        Vendor     : Intel(R) Corporation
        Driver     : 22.23.23405
Platform [#4]:
    Version  : 1.3
    Name     : Intel(R) Level-Zero
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : 1.3
        Name       : Intel(R) UHD Graphics [0x9a60]
        Vendor     : Intel(R) Corporation
        Driver     : 1.3.23405


prints out all backends (GPU device + low level driver level_zero or opencl)
Level Zero shows:


$ clinfo 

provides more details for opencl backend. 

5. Build nbody code from oneAPI-samples
---------------------------------------


$ git clone https://github.com/oneapi-src/oneAPI-samples.git

$ mkdir build
$ cd build
$ cmake ../oneAPI-samples/DirectProgramming/DPC++/N-BodyMethods/Nbody/
$ make 

6. Run nbody 
------------

$ ./src/nbody


output should look like:

===============================
 Initialize Gravity Simulation
 nPart = 16000; nSteps = 10; dt = 0.1
------------------------------------------------
 s       dt      kenergy     time (s)    GFLOPS
------------------------------------------------
 1       0.1     26.405      0.19124     38.821
 2       0.2     313.77      0.006551    1133.3
 3       0.3     926.56      0.0066749   1112.3
 4       0.4     1866.4      0.0066208   1121.4
 5       0.5     3135.6      0.0065561   1132.4
 6       0.6     4737.6      0.0066551   1115.6
 7       0.7     6676.6      0.0066353   1118.9
 8       0.8     8957.7      0.0065615   1131.5
 9       0.9     11587       0.0066486   1116.7
 10      1       14572       0.006616    1122.2

# Total Time (s)     : 0.25087
# Average Performance : 1121.4 +- 6.799
===============================

7. (optional) change number of particles to 256000
---------------------------------------------------

$ vi ../oneAPI-samples/DirectProgramming/DPC++/N-BodyMethods/Nbody/src/GSimulation.cpp

change line :  set_npart(16000);
to     line :  set_npart(256000);

$ make clean
$ make 

$ ./src/nbody

Note: use $ export VERBOSE=1 
      to see build steps in detail


8. (Application Performance Snapshot) APS usage
------------------------------------------------

help menu

$ aps -help 


run with aps

$ aps nbody 

generates ascii output and HTML

more options for mpi scaling found in    

$ aps-report -help


APS shows high occupancy on GPU. The advice is to use VTune to discover why the CPU is underutilized. 
But we did not intend to do computation on the CPU. 

9. VTune on Devcloud
--------------------


generate a ssh tunnel to the compute node and start the vtune-backend server. 
allocate a node as described before:

e.g. node with DG1 GPU : s011-n001 

log to devcloud again and use port 55001 for a tunnel 

$ ssh -L 127.0.0.1:55001:127.0.0.1:55001 devcloud

extend tunnel to allocated node

$ ssh -L 127.0.0.1:55001:127.0.0.1:55001 s011-n001

start vtune backend server

$ vtune-backend --web-port=55001 --enable-server-profiling

provides you a line to use in your local web browser:
Serving GUI at https://127.0.0.1:55001

first time it will have a longer line with certificate. Use the whole line! 
first time you will be asked for a passphrase. Please add a good password and remember it. 

Web browser will now show the VTune GUI. 

start analysis by "Configure Analysis" 

see also: https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/using-vtune-server-with-vs-code-intel-devcloud.html

10. VTune command line 
----------------------

run VTune command line and copy the result directories into the default VTune Projects directory:

$ ls $HOME/intel/vtune/projects

Generate new dir like MY-NBODY inside projects and copy results to it. 

Alternative: start vtune-backend with parameter --data-directory <your directory> 

11 HPC Analysis
---------------

$ vtune -c hpc-performance <app> <app parameter> 

good for OpenMP analysis but no OMP in nbody! 

add memory analysis. 

$ vtune -c hpc-performance -knob collect-memory-bandwidth=true <app> <app parameter> 


12 GPU Hotspots
---------------

plain first analysis:

$ vtune -c gpu-hotspots -r gh -- <your app> <app parameter>

Result data inside directory gh 

Full instrumentation of instruction (very high overhead):

$ vtune -c gpu-hotspots -knob characterization-mode=instruction-count -r ghi -- <your app> <app parameter>

GPU source analysis:

$ vtune -c gpu-hotspots -knob profiling-mode=source-analysis -r ghs -- <your app> <app parameter>

estimation of timing per source line (measures basic blocks)

$ vtune -c gpu-hotspots -knob profiling-mode=source-analysis -knob source-analysis=mem-latency -r ghl -- <your app> <app parameter>

shows latencies per source line