在使用训练好的深度学习模型进行推理时,为了了解推理过程中所使用的系统资源信息,如CPU利用率、GPU利用率等,往往需要一个监控工具。
对于CPU利用率,可以使用psutil库获取:
1
| psutil.cpu_percent(interval=1, percpu=False)
|
封装成函数:
1 2 3 4 5 6 7 8
| def get_cpu_utilization(): try: cpu_utilization = psutil.cpu_percent(interval=1, percpu=False) return cpu_utilization except Exception as e: print(f"Error while fetching CPU utilization: {e}") return []
|
对于GPU利用率,可以使用命令行工具 nvidia-smi
获取,封装成函数:
1 2 3 4 5 6 7 8 9 10
| def get_gpu_utilization(): try: result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader'], stdout=subprocess.PIPE) output = result.stdout.decode('utf-8').strip().split('\n') gpu_utilizations = [int(util.replace(' %', '')) for util in output] return gpu_utilizations except Exception as e: print(f"Error while fetching GPU utilization: {e}") return []
|
将以上两个功能函数集成到一起:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| def start_monitor(GPU_ID): timestamps = [] cpu_utilizations = [] gpu_utilizations = []
try: while True: timestamp = datetime.datetime.now() timestamps.append(timestamp)
cpu_utilization = get_cpu_utilization() cpu_utilizations.append(cpu_utilization) print(f"CPU Utilization: {cpu_utilization}%") gpu_utilization = get_gpu_utilization()[GPU_ID] gpu_utilizations.append(gpu_utilization)
print(f"GPU Utilization: {gpu_utilization}%")
time.sleep(1)
except KeyboardInterrupt: print("Monitoring stopped.")
|
在对要运行的模型推理程序进行监控时,直接传入要监控的 GPU_ID
,运行start_monitor函数就可以对CPU利用率和GPU利用率进行监控了。
然而,由于监控操作使用了 while True
,在模型推理程序运行结束后,要手动终止start_monitor函数的运行。
为了减少人工干预,实现在推理程序结束后自动停止监控程序,需要在推理程序运行结束后释放一个推理完成的信号,当监控程序接受到该信号后,立即停止监控。
为了实现上述功能,所需步骤如下:
首先引入多进程模块 multiprocessing
,创建一个共享变量作为控制监控程序的信号 stop_flag
,将 stop_flag
也作为监控程序的一个参数进行传入,并启动监控程序;
1 2 3 4 5 6 7
| stop_flag = multiprocessing.Value('b', False)
monitor_proc = multiprocessing.Process(target=start_monitor, args=(GPU_ID, stop_flag)) monitor_proc.start()
|
此时监控操作的执行条件由原来的 while True
改成了 while not stop_flag.value
.
使用subprocess的Popen方法,传入推理程序的脚本所在路径,从而在监控程序运行的同时,启动推理程序
1 2 3 4 5 6 7 8 9 10
| def start_inference(cmd): print(f"Starting inference with command: {' '.join(cmd)}") proc = subprocess.Popen(cmd)
proc.wait() print("Inference completed.")
start_inference(cmd)
|
- 等待推理程序执行完成,修改共享变量
stop_flag
的值,使得监控程序中的 while
循环不满足条件而退出,从而达到停止监控的目标。1 2 3 4 5 6 7
| def stop_monitoring(stop_flag): # 发送停止信号给监控程序 stop_flag.value = True # 推理结束后停止监控程序 stop_monitoring(stop_flag) monitor_proc.join()
|
代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
| import subprocess import time import matplotlib.pyplot as plt import datetime import time import psutil import multiprocessing import os
def get_cpu_utilization(): try: cpu_utilization = psutil.cpu_percent(interval=1, percpu=False) return cpu_utilization except Exception as e: print(f"Error while fetching CPU utilization: {e}") return []
def get_gpu_utilization(): try: result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader'], stdout=subprocess.PIPE) output = result.stdout.decode('utf-8').strip().split('\n') gpu_utilizations = [int(util.replace(' %', '')) for util in output] return gpu_utilizations except Exception as e: print(f"Error while fetching GPU utilization: {e}") return [] def start_monitor(GPU_ID, stop_flag): timestamps = [] cpu_utilizations = [] gpu_utilizations = []
try: while not stop_flag.value: timestamp = datetime.datetime.now() timestamps.append(timestamp)
cpu_utilization = get_cpu_utilization() cpu_utilizations.append(cpu_utilization) print(f"CPU Utilization: {cpu_utilization}%") gpu_utilization = get_gpu_utilization()[GPU_ID] gpu_utilizations.append(gpu_utilization)
print(f"GPU Utilization: {gpu_utilization}%")
time.sleep(1)
except KeyboardInterrupt: pass
finally: print("Monitoring stopped.")
def start_inference(cmd): print(f"Starting inference with command: {' '.join(cmd)}") proc = subprocess.Popen(cmd)
pid = proc.pid print(f"Inference process PID: {pid}")
proc.wait() print("Inference completed.")
def stop_monitoring(stop_flag): stop_flag.value = True
def start_(GPU_ID, cmd): stop_flag = multiprocessing.Value('b', False)
monitor_proc = multiprocessing.Process(target=start_monitor, args=(GPU_ID, stop_flag)) monitor_proc.start()
try: start_inference(cmd)
stop_monitoring(stop_flag) monitor_proc.join()
except Exception as e: print(f"Error: {e}")
finally: print("Main program terminated.")
if __name__ == "__main__": GPU_ID=0 cmd = ['python', './INFER.py'] start_(GPU_ID,cmd)
|