Web based job monitoring
Spartan Grafana
You can examine time series data from your jobs using the Grafana dashboards we've prepared at Spartan Grafana. You can use these dashboards for currently running jobs as well as your previously run jobs.
Log in to Grafana using your Spartan account and password. Select the "Job Stats" dashboard.
Time Range
Spartan's job Data is recorded in a time series database. The first thing you will need to set is the range in which your job ran by using the time range controls on the top right side of the dashboard. It is fine if this range is imprecise to begin with... after the data is loaded you can select a graph range to focus on by clicking and dragging on it with your mouse.
If you do not set this before proceeding, you may see "no data" instead of graphs, and certain elements (such as the Nodes dropdown menu) may fail to automatically populate.
Note that data is recorded at a rate of one entry every 2 minutes, this is the finest meaningful granularity that this system can currently report.
Job Details
Enter the slurm job number of the job you wish to review into the "Slurm JobID" box on the top left of the dashboard. You should see the Nodes dropdown menu automatically update with a list of nodes assigned to the job. You can enable all nodes by clicking the top tickbox in the list, or toggle single nodes by ticking or unticking their corresponding box.
Slurm Stats
This dashboard contains a general summary of job utilisation statistics, largely in their raw form. Putting a job number into the field marked Slurm JobID will fetch a record of that job, which will then be displayed in the dashboard interface. It will also populate the Nodes dropdown menu, which will allow you to include or exclude specific nodes involved in the job from the stats you are viewing. Selecting a node will add a new section to the bottom of the dashboard which shows per node statistics. Note that you can use the blue dash to select all nodes in the job. Of course, jobs that run on a single node will only show that single node in the list. Finally, the Interval field will set the iteration interval of the job data. Please note again that the data itself is only recorded at 2 minute intervals so you will not see more detail by setting the interval lower than this.
Job CPU Utilisation & Job Memory Utilisation
These fields list the raw output from all nodes in a summary graph. Clicking on a node's hostname in the list will show you just the record from that node. You can also enable or disable specific sets of records by shift-clicking their names in the list under the graph. Note that Job CPU Utilization lists the User, System and combined Total as separate graphs. Similarly, Job CPU Memory Utilisation lists the amounts of Total allocated, RSS, and Used memory separately, as well as any memory failures.
GPU Stats
This section lists GPU Utilisation, onboard Memory Utilisation, Temperature and Power usage. Note that you will not see any data in this section if your job was not GPU enabled
Node Stats
This section lists CPU, Memory, Frequency adjustments (useful for determining CPU load), Local Disk Read/Write operations in Bytes, and IOPS. There will be one set of each of these graphs for each selected node.
Please note that graphs in this section will be populated with system level data and will show the node's total activity, not just the activity of the selected job or user account.

