nvidia-k8s-device-plugin源码分析

更新时间:2023-07-12 18:41:15 阅读: 评论:0

nvidia-k8s-device-plugin源码分析
1.引⾔
nvidia-k8s-device-plugin代码由go语⾔编写,在此确实要赞叹⼀下go语⾔的简洁和强⼤,想必以后会有越来越多的⼈喜欢上这门语⾔。当然,如果想了解⼀个程序的代码,⾸先梳理⼀下每个⽂件的作⽤:
:作为程序⼊⼝
:放置所有调⽤了nvml有关的函数代码
:定义监视器的代码
:实现与k8s-device-plugin有关流程的代码
在中定义了NvidiaDevicePlugin 结构体,该结构体成员作⽤如下:
type NvidiaDevicePlugin struct {
devs  []*pluginapi.Device  # api.protobuf⾥定义的⼀个数组,每个成员包括设备ID和其health信息
socket string  # nvidia-device-plugin监听端⼝路径,实际为/var/lib/kubelet/device-plugins/nvidia.sock
stop  chan interface{}  # 接受启停命令的管道
health chan *pluginapi.Device  # 接受不健康设备的管道,发来pluginapi.Device的结构
rver *grpc.Server  # grcprver,⽤来保存于kubelet的通讯
}
2.执⾏逻辑
<作为程序⼊⼝,⾸次执⾏代码逻辑如下。
1.⾸先加载nvml库,如果没有问题进⾏下⼀步,有问题则报错
log.Println("Loading NVML")
if err := nvml.Init(); err != nil {
log.Printf("Failed to initialize NVML: %s.", err)
log.Printf("If this is a GPU node, did you t the docker default runtime to `nvidia`?")
log.Printf("You can check the prerequisites at: /NVIDIA/k8s-device-plugin#prerequisites")
log.Printf("You can learn how to t the runtime at: /NVIDIA/k8s-device-plugin#quick-start")
lect {}
}
defer func() { log.Println("Shutdown of NVML returned:", nvml.Shutdown()) }()
2.获得当前宿主机设备数量,若为0则log出等待信息
log.Println("Fetching d  evices.")
if len(getDevices()) == 0 {
log.Println("No devices found. Waiting indefinitely.")
lect {}
}
3.创建对于/var/lib/kubelet/device-plugins/⽂件夹的fsnotify监视器watcher,监视了所有的⽂件更改操作。
log.Println("Starting FS watcher.")
gomei
watcher, err := newFSWatcher(pluginapi.DevicePluginPath) //->"/var/lib/kubelet/device-plugins/",监视了所有的⽂件更改操作
if err != nil {
log.Println("Failed to created FS watcher.")
os.Exit(1)
}
defer watcher.Clo()
4.创建系统调⽤信号监视器sigs,监视系统调⽤信号
defer watcher.Clo()
log.Println("Starting OS watcher.")
cambridge
sigs := newOSWatcher(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)  //监听信号,将系统的对应信号发送给sigs
5.监视deviceplugin的状态和系统信号,并作出相应反应
for循环L有两个功能块组成:
1)重启模块:
如果是第⼀次启动则创建新的NvidiaDevicePlugin结构体并填充信息,开启NvidiaDevicePlugin服务,否则停⽌之前的deviceplugin并重新创建
2)监视器模块:
针对watcher和sigs的传来不同信号的情况针对性处理,直⾄收到系统发来的停⽌信号则退出。
restart := true
var devicePlugin *NvidiaDevicePlugin
方式英文
L:
for {
if restart {
if devicePlugin != nil {
devicePlugin.Stop()
}
//如果还没有创建deviceplugin则创建,否则就停⽌原来的
devicePlugin = NewNvidiaDevicePlugin()
//返回⼀个结构体,⾥⾯包含NvidiaDevicePlugin{ devs,socket,stop,health}
if err := devicePlugin.Serve(); err != nil {
/
/开启NvidiaDevicePlugin的服务程序,并检查和kubelet的连通性,并
//开启健康监测,并向kubelet注册设备
log.Println("Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?")
log.Printf("You can check the prerequisites at: /NVIDIA/k8s-device-plugin#prerequisites")
log.Printf("You can learn how to t the runtime at: /NVIDIA/k8s-device-plugin#quick-start")
} el {
restart = fal
}
}
lect {
ca event := <-watcher.Events:
if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
restart = true //若有重新创建的⾏为则重启
}
ca err := <-watcher.Errors:  //出错则报错
log.Printf("inotify: %s", err)
ca s := <-sigs:  //若有系统调⽤信号传来
switch s {i wanted you
ca syscall.SIGHUP:  //重启信号
log.Println("Received SIGHUP, restarting.")
restart = true
default:  //其余信号都停⽌plugin服务
log.Printf("Received signal \"%v\", shutting down.", s)
devicePlugin.Stop()
break L
}
}
}
下⾯中每个步骤中关键的函数:
分析之前我们先看⼀下和同时引⼊的包pluginapi "k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1"
该路径下有⼀个和两个⽂件,包名同样为v1beta1,为grpc分析api.proto⾃动⽣成, 中定义了很多接下来的要⽤到的常量,列举在这⾥
// \vendor\k8s.io\kubernetes\pkg\kubelet\apis\deviceplugin\
package v1beta1
英语论文参考文献
const (
// Healthy means that the device is healty
Healthy = "Healthy"新东方留学中介
// UnHealthy means that the device is unhealthy
Unhealthy = "Unhealthy"
// Current version of the API supported by kubelet
Version = "v1beta1"
// DevicePluginPath is the folder the Device Plugin is expecting sockets to be on
// Only privileged pods have access to this path
// Note: Placeholder until we find a "standard path"
DevicePluginPath = "/var/lib/kubelet/device-plugins/"
// KubeletSocket is the path of the Kubelet registry socket
KubeletSocket = DevicePluginPath + "kubelet.sock"
// Timeout duration in cs for PreStartContainer RPC
carmaKubeletPreStartContainerRPCTimeoutInSecs = 30
)
var SupportedVersions = [...]string{"v1beta1"}
步骤1:
只有⼀个nvml.Init(),从字⾯意思可以知道是nvml进⾏了⼀些初始化操作。
步骤2:
// 吃大锅饭
func getDevices() []*pluginapi.Device {
n, err := nvml.GetDeviceCount()
check(err)
var devs []*pluginapi.Device
for i := uint(0); i < n; i++ {
d, err := nvml.NewDeviceLite(i)
check(err)
devs = append(devs, &pluginapi.Device{
ID:    d.UUID,
Health: pluginapi.Healthy,
})
}
该函数定义在中,⾸先其调⽤了nvml.GetDeviceCount()获得当前宿主机设备数,将所有设备的信息加⼊devs数组,该数组每个成员是⼀个pluginapi.Device结构体,其ID被初始化为每个设备的UUID,Health字段初始化为"Healthy"(在中的const字段定义的Healthy = "Healthy")
步骤3:
该函数定义在中,其主要功能是创建⼀个监视pluginapi.DevicePluginPath路径下的⽂件变动的watcher并返回,从中的定义我们可以看到,其监视的路径为/var/lib/kubelet/device-plugins/,即同时监视了kubelet.sock和nvidia.sock
//
func newFSWatcher(files ...string) (*fsnotify.Watcher, error) {
acrylic是什么意思
watcher, err := fsnotify.NewWatcher()
if err != nil {
return nil, err
}
for _, f := range files {
err = watcher.Add(f)
if err != nil {
watcher.Clo()
return nil, err
}
}
return watcher, nil
}
步骤4:
该函数同样定义在中,其返回⼀个监视系统发来的SIGHUP、SIGINT、SIGTERM、SIGQUIT信号的watcher,该watcher实际上是⼀个只有⼀个缓存且成员为os.Signal的chan。的L循环则监视该chan并做出相应的反应,
//
func newOSWatcher(sigs ...os.Signal) chan os.Signal {
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, )  //sigs:syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIG
QUIT,监听
return sigChan
}
步骤5:
1. devicePlugin.Stop()
twosome定义在中,停⽌grcp服务并清理现场。
//
func (m *NvidiaDevicePlugin) Stop() error {
if m.rver == nil {
return nil
}
m.rver.Stop()
m.rver = nil
clo(m.stop)
return m.cleanup()
}
2.NewNvidiaDevicePlugin()
返回⼀个结构体,⾥⾯包含NvidiaDevicePlugin{ devs,socket,stop,health},devs是getDevices()返回的devs,socket是中定义的常量rverSock = pluginapi.DevicePluginPath + "nvidia.sock",即/var/lib/kubelet/device-
plugins/nvidia.sock,stop是⼀个可以接受任何类型输⼊的⽆缓存chan,health是可以可以接受*pluginapi.Device类型输⼊的⽆缓存chan,其主要作⽤的及时将不健康的device报告给kubelet。 //待确定

本文发布于:2023-07-12 18:41:15,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/90/175338.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:信号   系统   设备   定义   代码   创建   监视   结构
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图